All Categories → Data Processing → web-scraping

Top 135 web-scraping open source projects

Quora Api
An unofficial API for Quora.
Scrape Linkedin Selenium
`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.
Wayback Machine Scraper
A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.
Docbao
Công cụ quét và phân tích từ khoá các trang báo mạng Việt Nam
City Scrapers
Scrape, standardize and share public meetings from local government websites
Selenium Python Helium
Selenium-python but lighter: Helium is the best Python library for web automation.
Short Jokes Dataset
Python scripts for building 'Short Jokes' dataset, featured on Kaggle
R Web Scraping Cheat Sheet
Guide, reference and cheatsheet on web scraping using rvest, httr and Rselenium.
Trump Lies
Tutorial: Web scraping in Python with Beautiful Soup
Bet On Sibyl
Machine Learning Model for Sport Predictions (Football, Basketball, Baseball, Hockey, Soccer & Tennis)
Twitter Intelligence
Twitter Intelligence OSINT project performs tracking and analysis of the Twitter
Learnpythonforresearch
This repository provides everything you need to get started with Python for (social science) research.
Scrapy Training
Scrapy Training companion code
Netflix Clone
Netflix like full-stack application with SPA client and backend implemented in service oriented architecture
Web Scraping
Detailed web scraping tutorials for dummies with financial data crawlers on Reddit WallStreetBets, CME (both options and futures), US Treasury, CFTC, LME, SHFE and news data crawlers on BBC, Wall Street Journal, Al Jazeera, Reuters, Financial Times, Bloomberg, CNN, Fortune, The Economist
Helena
A Chrome extension for writing custom web scraping programs and web automation programs. Just demonstrate how to collect the first row of data, then let the extension write the program for collecting all rows.
Juno crawler
Scrapy crawler to collect data on the back catalog of songs listed for sale.
Phpscraper
PHP Scraper - an highly opinionated web-interface for PHP
Sqrape
Simple Query Scraping with CSS and Go Reflection (MOVED to Gitlab)
Zillow
Zillow Scraper for Python using Selenium
Html Metadata
MetaData html scraper and parser for Node.js (supports Promises and callback style)
Actor Page Analyzer
Apify actor that opens a web page in headless Chrome and analyzes the HTML and JavaScript objects, looks for schema.org microdata and JSON-LD metadata, analyzes AJAX requests, etc.
Ayakashi
⚡️ Ayakashi.io - The next generation web scraping framework
Save For Offline
Android app for saving webpages for offline reading.
Scrapyd Cluster On Heroku
Set up free and scalable Scrapyd cluster for distributed web-crawling with just a few clicks. DEMO 👉
Pulsar
Turn large Web sites into tables and charts using simple SQLs.
Splashr
💦 Tools to Work with the 'Splash' JavaScript Rendering Service in R
Hockey Scraper
Python Package for scraping NHL Play-by-Play and Shift data
Humanoid
Node.js package to bypass CloudFlare's anti-bot JavaScript challenges
Daftlistings
A library that enables programmatic interaction with daft.ie. Daft.ie has nationwide coverage and contains about 80% of the total available properties in Ireland.
Rvest
Simple web scraping for R
Detect Cms
PHP Library for detecting CMS
Reader
Extract clean(er), readable text from web pages via Mercury Web Parser.
Ping Sm
Receive an email or Telegram message as soon as Migros Sanalmarket is available for delivery in your neighborhood.
Arachnid
Powerful web scraping framework for Crystal
Decapitated
Headless 'Chrome' Orchestration in R
Instago
Download/access photos, videos, stories, story highlights, postlives, following and followers of Instagram
Scrapy Craigslist
Web Scraping Craigslist's Engineering Jobs in NY with Scrapy
Actor Google Search Scraper
Apify actor that crawls Google Search result pages (SERPs) and extracts a list of organic results, ads, related queries and more. It supports selection of custom country, language and location.
Snoop
Snoop — инструмент разведки на основе открытых данных (OSINT world)
Webmiddle
Node.js framework for modular web scraping and data extraction
Letterboxd recommendations
Scraping publicly-accessible Letterboxd data and creating a movie recommendation model with it that can generate recommendations when provided with a Letterboxd username
Youtube tutorials
Collection of scripts corresponding to LucidProgramming YouTube tutorials
Spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Coolqlcool
Nextjs server to query websites with GraphQL
Scrapy Fake Useragent
Random User-Agent middleware based on fake-useragent
User Agents
A JavaScript library for generating random user agents with data that's updated daily.
Rpa
UI.Vision: Open-Source RPA Software (formerly Kantu) - Modern Robotic Process Automation with Selenium IDE++
1-60 of 135 web-scraping projects