All Categories → Data Processing → web-crawler

Top 54 web-crawler open source projects

Strong Web Crawler
基于C#.NET+PhantomJS+Sellenium的高级网络爬虫程序。可执行Javascript代码、触发各类事件、操纵页面Dom结构。
Kochat
Opensource Korean chatbot framework
Antch
Antch, a fast, powerful and extensible web crawling & scraping framework for Go
Nutch
Apache Nutch is an extensible and scalable web crawler
Zhihu Crawler People
A simple distributed crawler for zhihu && data analysis
Crawler Commons
A set of reusable Java components that implement functionality common to any web crawler
Collector Http
Norconex HTTP Collector is a flexible web crawler for collecting, parsing, and manipulating data from the Internet (or Intranet) to various data repositories such as search engines.
Proxy
A simple tool for fetching usable proxies from several websites.
Crawlab Lite
Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台
Pspider
简单易用的Python爬虫框架,QQ交流群:597510560
Pulsar
Turn large Web sites into tables and charts using simple SQLs.
Infinitycrawler
A simple but powerful web crawler library for .NET
Ospider
开源矢量地理数据获取与预处理工具(POI/AOI/行政区/路网/土地利用)
Cvpr2019
Displays all the 2019 CVPR Accepted Papers in a way that they are easy to parse.
Abotx
Cross Platform C# Web crawler framework, headless browser, parallel crawler. Please star this project! +1.
Terpene Profile Parser For Cannabis Strains
Parser and database to index the terpene profile of different strains of Cannabis from online databases
Crawlab
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Maman
Rust Web Crawler saving pages on Redis
Dutsso
快速登录大连理工大学统一身份认证系统(SSO)的Python模块,可轻松实现成绩提醒、抢课、玉兰卡信息、个人信息查询等功能。
Storm Crawler
A scalable, mature and versatile web crawler based on Apache Storm
Spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Awesome Crawler
A collection of awesome web crawler,spider in different languages
Spider Flow
新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。
Sparkler
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
Ache
ACHE is a web crawler for domain-specific search.
Supercrawler
A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.
Gopa
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Spidy
The simple, easy to use command line web crawler.
UnChain
A tool to find redirection chains in multiple URLs
CrawlBox
Easy way to brute-force web directory.
flink-crawler
Continuous scalable web crawler built on top of Flink and crawler-commons
SchweizerMesser
🎯Python 3 网络爬虫实战、数据分析合集 | 当当 | 网易云音乐 | unsplash | 必胜客 | 猫眼 |
pyCreeper
一个用来快速提取网页内容的信息采集(爬虫)框架, 实现了对网页的动态加载与控制。
proxi
Proxy pool. Finds and checks proxies with rest api for querying results. Can find over 25k proxies in under 5 minutes.
Mimo-Crawler
A web crawler that uses Firefox and js injection to interact with webpages and crawl their content, written in nodejs.
siteshooter
📷 Automate full website screenshots and PDF generation with multiple viewport support.
WebCrawler
Just a simple web crawler which return crawled links as IObservable using reactive extension and async await.
bolsa
Biblioteca feita em Python com o objetivo de facilitar o acesso a dados de seus investimentos na bolsa de valores(B3/CEI) através do Portal CEI.
leek
Distributed task redisqueue(最简单python分布式函数调度框架)
json-web-crawler
Use JSON to list all elements (with css 3 and jquery selector) that you want to crawl.
StackOverflow-Crawler
It is a web crawler which crawls the stackoverfolw website (http://stackoverflow.com/) and finds the most popular technologies at current point of time by getting the tags info of the newest questions asked on the website.
WeReadScan
扫描“微信读书”已购图书并下载本地PDF的爬虫
ant
A web crawler for Go
doc crawler.py
Explore a website recursively and download all the wanted documents (PDF, ODT…)
Market-Trend-Prediction
This is a project of build knowledge graph course. The project leverages historical stock price, and integrates social media listening from customers to predict market Trend On Dow Jones Industrial Average (DJIA).
1-54 of 54 web-crawler projects