Top 229 scraping open source projects

Nimquery
Nim library for querying HTML using CSS-selectors (like JavaScripts document.querySelector)
Api Store
Contains all the public APIs listed in Phantombuster's API store. Pull requests welcome!
Torrengo
Torrengo is a CLI (command line) program written in Go which concurrently searches torrents from various sources.
Mechaml
OCaml functional web scraping library
Awesome Python Primer
自学入门 Python 优质中文资源索引,包含 书籍 / 文档 / 视频,适用于 爬虫 / Web / 数据分析 / 机器学习 方向
Mtnt
Code for the collection and analysis of the MTNT dataset
Artoo
artoo.js - the client-side scraping companion.
Django Dynamic Scraper
Creating Scrapy scrapers via the Django admin interface
Pge Outages
Tracking PG&E outages
Configs
Public, free to use, repository with diggers configs for scraping / extracting data from various e-commerce websites and online stores
Pypatent
Search for and retrieve US Patent and Trademark Office Patent Data
Scrapy Cluster
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
Instagram Scraper
Scrape the Instagram frontend. Inspired from twitter-scraper by @kennethreitz.
Webhere
HTML scraping for Objective-C.
Lulu
[Unmaintained] A simple and clean video/music/image downloader 👾
Imagescraper
✂️ High performance, multi-threaded image scraper
Parsel
Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors
Newcrawler
Free Web Scraping Tool with Java
Tabula
Tabula is a tool for liberating data tables trapped inside PDF files
Gazpacho
🥫 The simple, fast, and modern web scraping library
Oj
Tools for various online judges. Downloading sample cases, generating additional test cases, testing your code, and submitting it.
Facebook data analyzer
Analyze facebook copy of your data with ruby language. Download zip file from facebook and get info about friends ranking by message, vocabulary, contacts, friends added statistics and more
Facebook Scraper
Scrape Facebook public pages without an API key
Nickjs
Web scraping library made by the Phantombuster team. Modern, simple & works on all websites. (Deprecated)
Geeksforgeeks.pdf
Topic wise PDFs of Geeks for Geeks articles. (Last updated in October 2018)
Scrapple
A framework for creating semi-automatic web content extractors
Dataflowkit
Extract structured data from web sites. Web sites scraping.
Crawly
Crawly, a high-level web crawling & scraping framework for Elixir.
Jekyll
Jekyll-based static site for The Programming Historian
Lookyloo
Lookyloo is a web interface that allows users to capture a website page and then display a tree of domains that call each other.
Undetected Chromedriver
Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)
Coronadatascraper
COVID-19 Coronavirus data scraped from government and curated data sources.
✭ 372
htmlscraping
Post Tuto Deployment
Build and deploy a machine learning app from scratch 🚀
Comic Dl
Comic-dl is a command line tool to download manga and comics from various comic and manga sites. Supported sites : readcomiconline.to, mangafox.me, comic naver and many more.
Socialreaper
Social media scraping / data collection library for Facebook, Twitter, Reddit, YouTube, Pinterest, and Tumblr APIs
Tinking
🧶 Extract data from any website without code, just clicks.
Spidermon
Scrapy Extension for monitoring spiders execution.
Linkedin
Linkedin Scraper using Selenium Web Driver, Chromium headless, Docker and Scrapy
Elixir Scrape
Scrape any website, article or RSS/Atom Feed with ease!
Edu Mail Generator
Generate Free Edu Mail(s) within minutes
Sasila
一个灵活、友好的爬虫框架
Clean Text
🧹 Python package for text cleaning
Scrapy Crawlera
Crawlera middleware for Scrapy
Lambdasoup
Functional HTML scraping and rewriting with CSS in OCaml
Gopa
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Apify Js
Apify SDK — The scalable web scraping and crawling library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.
Mechanize
Mechanize is a ruby library that makes automated web interaction easy.
facebook-discussion-tk
A collection of tools to (semi-)automatically collect and analyze data from online discussions on Facebook groups and pages.
ARGUS
ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of different websites. On the websites, ARGUS is able to perform tasks like scraping texts or collecting hyperlinks between websites. See: https://link.springer.com/article/10.1007/s11192-020-03726-9
61-120 of 229 scraping projects