Top 80 crawling open source projects

Memorious
Distributed crawling framework for documents and structured data.
Colly
Elegant Scraper and Crawler Framework for Golang
Antch
Antch, a fast, powerful and extensible web crawling & scraping framework for Go
Nutch
Apache Nutch is an extensible and scalable web crawler
N2h4
네이버 뉴스 수집을 위한 도구
Linkedin Profile Scraper
🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Holiday Cn
📅🇨🇳 中国法定节假日数据 自动每日抓取国务院公告
Crawler
Go process used to crawl websites
Massivedl
Download a large list of files concurrently
Newspaper
News, full-text, and article metadata extraction in Python 3. Advanced docs:
Bhban rpa
6개월 치 업무를 하루 만에 끝내는 업무 자동화(생능출판사, 2020)의 예제 코드입니다. 파이썬을 한 번도 배워본 적 없는 분들을 위한 예제이며, 엑셀부터 디자인, 매크로, 크롤링까지 업무 자동화와 관련된 다양한 분야 예제가 제공됩니다.
Corpuscrawler
Crawler for linguistic corpora
Squidwarc
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
Scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
Dotnetcrawler
DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c
Grawler
Grawler is a tool written in PHP which comes with a web interface that automates the task of using google dorks, scrapes the results, and stores them in a file.
Dig Etl Engine
Download DIG to run on your laptop or server.
Arachnid
Powerful web scraping framework for Crystal
Crawling Projects
Web scraping and automation using python
Pdf downloader
A Scrapy Spider for downloading PDF files from a webpage.
Lulu
[Unmaintained] A simple and clean video/music/image downloader 👾
Scrapyrt
HTTP API for Scrapy spiders
Scrapy Selenium
Scrapy middleware to handle javascript pages using selenium
Dataflowkit
Extract structured data from web sites. Web sites scraping.
Crawly
Crawly, a high-level web crawling & scraping framework for Elixir.
Isp Data Pollution
ISP Data Pollution to Protect Private Browsing History with Obfuscation
Webster
a reliable high-level web crawling & scraping framework for Node.js.
Spidermon
Scrapy Extension for monitoring spiders execution.
Sasila
一个灵活、友好的爬虫框架
Gopa
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Apify Js
Apify SDK — The scalable web scraping and crawling library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.
Spidy
The simple, easy to use command line web crawler.
Skycaiji
蓝天采集器是一款免费的数据采集发布爬虫软件,采用php+mysql开发,可部署在云服务器,几乎能采集所有类型的网页,无缝对接各类CMS建站程序,免登录实时发布数据,全自动无需人工干预!是网页大数据采集软件中完全跨平台的云端爬虫系统
ARGUS
ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of different websites. On the websites, ARGUS is able to perform tasks like scraping texts or collecting hyperlinks between websites. See: https://link.springer.com/article/10.1007/s11192-020-03726-9
flink-crawler
Continuous scalable web crawler built on top of Flink and crawler-commons
img-cli
An interactive Command-Line Interface Build in NodeJS for downloading a single or multiple images to disk from URL
popular restaurants from officials
서울시 공무원의 업무추진비를 분석하여 진짜 맛집 찾기 프로젝트
serverless-instagram-crawler
serverless, instagram hashtag crawler with lambda, dynamoDB
talospider
talospider - A simple,lightweight scraping micro-framework
pomp
Screen scraping and web crawling framework
kasthack.osp
Генератор сырых дампов пользователей VK.
EngineeringTeam
와이빅타 엔지니어링팀의 자료를 정리해두는 곳입니다.
Mimo-Crawler
A web crawler that uses Firefox and js injection to interact with webpages and crawl their content, written in nodejs.
crawlkit
A crawler based on Phantom. Allows discovery of dynamic content and supports custom scrapers.
scrapy-distributed
A series of distributed components for Scrapy. Including RabbitMQ-based components, Kafka-based components, and RedisBloom-based components for Scrapy.
custom-crawler
🌌 High productivity semi-automatic crawler generator 🛠️🧰
go-scrapy
Web crawling and scraping framework for Golang
wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
1-60 of 80 crawling projects