All Projects → dojutsu-user → IMDB-Scraper

dojutsu-user / IMDB-Scraper

Licence: MIT license
Scrapy project for scraping data from IMDB with Movie Dataset including 58,623 movies' data.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to IMDB-Scraper

Scrapy Fake Useragent
Random User-Agent middleware based on fake-useragent
Stars: ✭ 520 (+1305.41%)
Mutual labels:  web-scraping, scrapy
Juno crawler
Scrapy crawler to collect data on the back catalog of songs listed for sale.
Stars: ✭ 150 (+305.41%)
Mutual labels:  web-scraping, scrapy
Faster Than Requests
Faster requests on Python 3
Stars: ✭ 639 (+1627.03%)
Mutual labels:  web-scraping, scrapy
scraping-ebay
Scraping Ebay's products using Scrapy Web Crawling Framework
Stars: ✭ 79 (+113.51%)
Mutual labels:  web-scraping, scrapy
OLX Scraper
📻 An OLX Scraper using Scrapy + MongoDB. It Scrapes recent ads posted regarding requested product and dumps to NOSQL MONGODB.
Stars: ✭ 15 (-59.46%)
Mutual labels:  web-scraping, scrapy
Scrapple
A framework for creating semi-automatic web content extractors
Stars: ✭ 464 (+1154.05%)
Mutual labels:  web-scraping, scrapy
Scrapyd Cluster On Heroku
Set up free and scalable Scrapyd cluster for distributed web-crawling with just a few clicks. DEMO 👉
Stars: ✭ 106 (+186.49%)
Mutual labels:  web-scraping, scrapy
restaurant-finder-featureReviews
Build a Flask web application to help users retrieve key restaurant information and feature-based reviews (generated by applying market-basket model – Apriori algorithm and NLP on user reviews).
Stars: ✭ 21 (-43.24%)
Mutual labels:  web-scraping, scrapy
City Scrapers
Scrape, standardize and share public meetings from local government websites
Stars: ✭ 220 (+494.59%)
Mutual labels:  web-scraping, scrapy
Scrapy Training
Scrapy Training companion code
Stars: ✭ 157 (+324.32%)
Mutual labels:  web-scraping, scrapy
Python3 Spider
Python爬虫实战 - 模拟登陆各大网站 包含但不限于:滑块验证、拼多多、美团、百度、bilibili、大众点评、淘宝,如果喜欢请start ❤️
Stars: ✭ 2,129 (+5654.05%)
Mutual labels:  scrapy, scrapy-crawler
PythonScrapyBasicSetup
Basic setup with random user agents and IP addresses for Python Scrapy Framework.
Stars: ✭ 57 (+54.05%)
Mutual labels:  web-scraping, scrapy-framework
Scrapy Craigslist
Web Scraping Craigslist's Engineering Jobs in NY with Scrapy
Stars: ✭ 54 (+45.95%)
Mutual labels:  web-scraping, scrapy
Netflix Clone
Netflix like full-stack application with SPA client and backend implemented in service oriented architecture
Stars: ✭ 156 (+321.62%)
Mutual labels:  web-scraping, scrapy
estate-crawler
Scraping the real estate agencies for up-to-date house listings as soon as they arrive!
Stars: ✭ 20 (-45.95%)
Mutual labels:  scrapy, scrapy-crawler
scrapy-wayback-machine
A Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.
Stars: ✭ 92 (+148.65%)
Mutual labels:  web-scraping, scrapy
torchestrator
Spin up Tor containers and then proxy HTTP requests via these Tor instances
Stars: ✭ 32 (-13.51%)
Mutual labels:  scrapy
Scrapy IPProxyPool
免费 IP 代理池。Scrapy 爬虫框架插件
Stars: ✭ 100 (+170.27%)
Mutual labels:  scrapy
rreddit
𝐫⟋ Get Reddit data
Stars: ✭ 49 (+32.43%)
Mutual labels:  web-scraping
browser-pool
A Node.js library to easily manage and rotate a pool of web browsers, using any of the popular browser automation libraries like Puppeteer, Playwright, or SecretAgent.
Stars: ✭ 71 (+91.89%)
Mutual labels:  web-scraping

IMDB Scraper

GitHub contributors PRs Welcome

forthebadge made-with-python

Overview

This is a Scrapy project which can be used to crawl IMDB website to scrape movies' information and then store the data in json format.

Crawling

  1. Clone the repo and navigate into IMDB-Scraper folder.
$ git clone https://github.com/dojutsu-user/IMDB-Scraper.git
$ cd IMDB-Scraper/
  1. Create and activate a virtual environment.
(IMDB-Scraper) $ pipenv shell
  1. Install all dependencies.
(IMDB-Scraper) $ pipenv install
  1. Navigate into imdb_scraper folder.
(IMDB-Scraper) $ cd imdb_scraper/
  1. You can change the starting page of the crawler in the file imdb_scraper/spiders/movie.py by changing the SEARCH_QUERY variable. You can get your own query from here: imdb.com/search/title. Copy the generated URL and paste it in place of default url. By default:
SEARCH_QUERY = (
    'https://www.imdb.com/search/title?'
    'title_type=feature&'
    'user_rating=1.0,10.0&'
    'countries=us&'
    'languages=en&'
    'count=250&'
    'view=simple'
)
  1. Start the crawler.
(IMDB-Scraper) $ scrapy crawl movie
  1. Data will be stored in json file named movie.json located at IMDB-Scraper/imdb-scraper/data/movie.json.

The final data will be in the form:

[
    ...
    {
        ...
    },
    {
        "title": "12 Strong",
        "rating": "R",
        "year": "2018",
        "users_rating": "6.6",
        "votes": "42,919",
        "metascore": "54",
        "img_url": "https://m.media-amazon.com/images/M/MV5BNTEzMjk3NzkxMV5BMl5BanBnXkFtZTgwNjY2NDczNDM@._V1_UX182_CR0,0,182,268_AL__QL50.jpg",
        "countries": [
            "USA"
        ],
        "languages": [
            "English",
            "Dari",
            "Russian",
            "Spanish",
            "Uzbek"
        ],
        "actors": [
            "Chris Hemsworth",
            "Michael Shannon",
            "Michael Peña",
            "Navid Negahban",
            "Trevante Rhodes",
            "Geoff Stults",
            "Thad Luckinbill",
            "Austin Hébert",
            "Austin Stowell",
            "Ben O'Toole",
            "Kenneth Miller",
            "Kenny Sheard",
            "Jack Kesy",
            "Rob Riggle",
            "William Fichtner"
        ],
        "genre": [
            "Action",
            "Drama",
            "History",
            "War"
        ],
        "tagline": "The Declassified True Story of the Horse Soldiers",
        "description": "12 Strong tells the story of the first Special Forces team deployed to Afghanistan after 9/11; under the leadership of a new captain, the team must work with an Afghan warlord to take down the Taliban.",
        "directors": [
            "Nicolai Fuglsig"
        ],
        "runtime": "130 min",
        "imdb_url": "https://www.imdb.com/title/tt1413492/"
    },
    {
        ...
    }
    ...
]

Stats

These are the FINAL stats when the default SEARCH_QUERY is used.

{
    "downloader/exception_count": 32,
    "downloader/exception_type_count/twisted.internet.error.ConnectError": 8,
    "downloader/exception_type_count/twisted.internet.error.TimeoutError": 24,
    "downloader/request_bytes": 46219942,
    "downloader/request_count": 58931,
    "downloader/request_method_count/GET": 58931,
    "downloader/response_bytes": 2522617013,
    "downloader/response_count": 58899,
    "downloader/response_status_count/200": 58899,
    "dupefilter/filtered": 9829,
    "finish_reason": "finished",
    "finish_time": datetime.datetime(2018, 12, 28, 12, 37, 8, 676592),
    "item_scraped_count": 58623,
    "log_count/DEBUG": 117556,
    "log_count/INFO": 299,
    "memusage/max": 639164416,
    "memusage/startup": 52166656,
    "request_depth_max": 235,
    "response_received_count": 58899,
    "retry/count": 32,
    "retry/reason_count/twisted.internet.error.ConnectError": 8,
    "retry/reason_count/twisted.internet.error.TimeoutError": 24,
    "scheduler/dequeued": 58930,
    "scheduler/dequeued/memory": 58930,
    "scheduler/enqueued": 58930,
    "scheduler/enqueued/memory": 58930,
    "start_time": datetime.datetime(2018, 12, 28, 7, 46, 8, 317470)
}

IMDB Movie Dataset (58,623 Movies)

The dataset obtained from the above crawler is uploaded in the Google Drive and can be downloaded from this link: https://drive.google.com/open?id=13OE6CyqqDqJRpP-8JR15l9Tjb2fSWa6r

Disclaimer

The project and the obtained dataset is intended only for educational purpose. It is completely open source and has no value intentions to commercialise complete or any part of the same. The developer is on no part the owner of any resources used and does not claim to hold the permissions to use the project.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].