All Projects → testdrivenio → concurrent-web-scraping

testdrivenio / concurrent-web-scraping

Licence: other
Building a Concurrent Web Scraper with Python and Selenium

Programming Languages

HTML
75241 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to concurrent-web-scraping

Web Scraping
Detailed web scraping tutorials for dummies with financial data crawlers on Reddit WallStreetBets, CME (both options and futures), US Treasury, CFTC, LME, SHFE and news data crawlers on BBC, Wall Street Journal, Al Jazeera, Reuters, Financial Times, Bloomberg, CNN, Fortune, The Economist
Stars: ✭ 153 (+446.43%)
Mutual labels:  web-scraping
Trump Lies
Tutorial: Web scraping in Python with Beautiful Soup
Stars: ✭ 201 (+617.86%)
Mutual labels:  web-scraping
Wayback Machine Scraper
A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.
Stars: ✭ 230 (+721.43%)
Mutual labels:  web-scraping
Scrapy Training
Scrapy Training companion code
Stars: ✭ 157 (+460.71%)
Mutual labels:  web-scraping
Twitter Intelligence
Twitter Intelligence OSINT project performs tracking and analysis of the Twitter
Stars: ✭ 179 (+539.29%)
Mutual labels:  web-scraping
Short Jokes Dataset
Python scripts for building 'Short Jokes' dataset, featured on Kaggle
Stars: ✭ 215 (+667.86%)
Mutual labels:  web-scraping
Juno crawler
Scrapy crawler to collect data on the back catalog of songs listed for sale.
Stars: ✭ 150 (+435.71%)
Mutual labels:  web-scraping
UofT-Timetable-Generator
A web application that generates timetables for university students at the University of Toronto
Stars: ✭ 34 (+21.43%)
Mutual labels:  web-scraping
Bet On Sibyl
Machine Learning Model for Sport Predictions (Football, Basketball, Baseball, Hockey, Soccer & Tennis)
Stars: ✭ 190 (+578.57%)
Mutual labels:  web-scraping
Docbao
Công cụ quét và phân tích từ khoá các trang báo mạng Việt Nam
Stars: ✭ 230 (+721.43%)
Mutual labels:  web-scraping
Learnpythonforresearch
This repository provides everything you need to get started with Python for (social science) research.
Stars: ✭ 163 (+482.14%)
Mutual labels:  web-scraping
Grab
Web Scraping Framework
Stars: ✭ 2,147 (+7567.86%)
Mutual labels:  web-scraping
Selenium Python Helium
Selenium-python but lighter: Helium is the best Python library for web automation.
Stars: ✭ 2,732 (+9657.14%)
Mutual labels:  web-scraping
Netflix Clone
Netflix like full-stack application with SPA client and backend implemented in service oriented architecture
Stars: ✭ 156 (+457.14%)
Mutual labels:  web-scraping
Scrape Linkedin Selenium
`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.
Stars: ✭ 239 (+753.57%)
Mutual labels:  web-scraping
Helena
A Chrome extension for writing custom web scraping programs and web automation programs. Just demonstrate how to collect the first row of data, then let the extension write the program for collecting all rows.
Stars: ✭ 151 (+439.29%)
Mutual labels:  web-scraping
R Web Scraping Cheat Sheet
Guide, reference and cheatsheet on web scraping using rvest, httr and Rselenium.
Stars: ✭ 207 (+639.29%)
Mutual labels:  web-scraping
wayback
⏪ Tools to Work with the Various Internet Archive Wayback Machine APIs
Stars: ✭ 52 (+85.71%)
Mutual labels:  web-scraping
Quora Api
An unofficial API for Quora.
Stars: ✭ 250 (+792.86%)
Mutual labels:  web-scraping
City Scrapers
Scrape, standardize and share public meetings from local government websites
Stars: ✭ 220 (+685.71%)
Mutual labels:  web-scraping

Concurrent Web Scraping with Python and Selenium

Want to learn how to build this project?

Check out the blog post.

Want to use this project?

  1. Fork/Clone

  2. Create and activate a virtual environment

  3. Install the requirements

  4. Run the scrapers:

    # sync
    (env)$ python script.py headless
    
    # parallel with multiprocessing
    (env)$ python script_parallel_1.py headless
    
    # parallel with concurrent.futures
    (env)$ python script_parallel_2.py headless
    
    # concurrent with concurrent.futures (should be the fastest!)
    (env)$ python script_concurrent.py headless
    
    # parallel with concurrent.futures and concurrent with asyncio
    (env)$ python script_asyncio.py headless
  5. Run the tests:

    (env)$ python -m pytest test/test_scraper.py
    (env)$ python -m pytest test/test_scraper_mock.py
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].