Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → duyet → Awesome Web Scraper

duyet / Awesome Web Scraper

Licence: mit

A collection of awesome web scaper, crawler.

Labels

awesome awesome-list spider storage scrapy phantomjs web-crawler web-scraper

Projects that are alternatives of or similar to Awesome Web Scraper

Crawlab

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架

Stars: ✭ 8,392 (+5608.84%)

Mutual labels: spider, scrapy, web-crawler

Awesome Crawler

A collection of awesome web crawler,spider in different languages

Stars: ✭ 4,793 (+3160.54%)

Mutual labels: spider, web-crawler, web-scraper

Spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

Stars: ✭ 656 (+346.26%)

Mutual labels: spider, web-crawler, web-scraper

OLX Scraper

📻 An OLX Scraper using Scrapy + MongoDB. It Scrapes recent ads posted regarding requested product and dumps to NOSQL MONGODB.

Stars: ✭ 15 (-89.8%)

Mutual labels: web-crawler, web-scraper, scrapy

Crawlab Lite

Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台

Stars: ✭ 122 (-17.01%)

Mutual labels: spider, scrapy, web-crawler

Terpene Profile Parser For Cannabis Strains

Parser and database to index the terpene profile of different strains of Cannabis from online databases

Stars: ✭ 63 (-57.14%)

Mutual labels: scrapy, web-crawler

Abotx

Cross Platform C# Web crawler framework, headless browser, parallel crawler. Please star this project! +1.

Stars: ✭ 63 (-57.14%)

Mutual labels: spider, web-crawler

Taobaoscrapy

😩Tool For Taobao/Tmall| 儿时玩具已经过时

Stars: ✭ 146 (-0.68%)

Mutual labels: spider, scrapy

Image Downloader

Download images from Google, Bing, Baidu. 谷歌、百度、必应图片下载.

Stars: ✭ 1,173 (+697.96%)

Mutual labels: spider, scrapy

Alipayspider Scrapy

AlipaySpider on Scrapy(use chrome driver); 支付宝爬虫(基于Scrapy)

Stars: ✭ 70 (-52.38%)

Mutual labels: spider, scrapy

Capturer

capture pictures from website like sina, lofter, huaban and so on

Stars: ✭ 76 (-48.3%)

Mutual labels: spider, scrapy

Scrapy Craigslist

Web Scraping Craigslist's Engineering Jobs in NY with Scrapy

Stars: ✭ 54 (-63.27%)

Mutual labels: scrapy, web-scraper

Reptile

🏀 Python3 网络爬虫实战（部分含详细教程）猫眼腾讯视频豆瓣研招网微博笔趣阁小说百度热点 B站 CSDN 网易云阅读阿里文学百度股票今日头条微信公众号网易云音乐拉勾有道 unsplash 实习僧汽车之家英雄联盟盒子大众点评链家 LPL赛程台风梦幻西游、阴阳师藏宝阁天气牛客网百度文库睡前故事知乎 Wish

Stars: ✭ 1,048 (+612.93%)

Mutual labels: spider, scrapy

Arachnid

Powerful web scraping framework for Crystal

Stars: ✭ 68 (-53.74%)

Mutual labels: spider, web-scraper

Django Dynamic Scraper

Creating Scrapy scrapers via the Django admin interface

Stars: ✭ 1,024 (+596.6%)

Mutual labels: spider, scrapy

Tspider

Yet Another Web Spider

Stars: ✭ 70 (-52.38%)

Mutual labels: spider, phantomjs

Copybook

用爬虫爬取小说网站上所有小说，存储到数据库中，并用爬到的数据构建自己的小说网站

Stars: ✭ 117 (-20.41%)

Mutual labels: spider, scrapy

Scrala

Unmaintained 🐳 ☕️ 🕷 Scala crawler(spider) framework, inspired by scrapy, created by @gaocegege

Stars: ✭ 113 (-23.13%)

Mutual labels: spider, scrapy

Pspider

简单易用的Python爬虫框架，QQ交流群：597510560

Stars: ✭ 1,611 (+995.92%)

Mutual labels: spider, web-crawler

App comments spider

爬取百度贴吧、TapTap、appstore、微博官方博主上的游戏评论(基于redis_scrapy)，过滤器采用了bloomfilter。

Stars: ✭ 38 (-74.15%)

Mutual labels: spider, scrapy

View All Similar Projects ➔

Awesome Web Scraper

Support

A collection of awesome web scaper, crawler.

Java

Apache Nutch - Highly extensible, highly scalable Web crawler. Pluggable parsing, protocols, storage and indexing.
websphinx - Website-Specific Processors for HTML INformation eXtraction.
Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
crawler4j - open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-threaded web crawler in few minutes.

C/C++

HTTrack - Highly extensible, highly scalable Web crawler. Pluggable parsing, protocols, storage and indexing.

C#

ccrawler - Built in C# 3.5 version. it contains a simple extention of web content categorizer, which can saparate between the web page depending on their content.

Erlang

ebot - Opensource Web Crawler built on top of a nosql database (apache couchdb, riak), AMQP database (rabbitmq), webmachine and mochiweb.

Python

scrapy - Scrapy, a fast high-level web crawling & scraping framework for Python.
gdom - gdom, DOM Traversing and Scraping using GraphQL.

PHP

Goutte - Goutte, a simple PHP Web Scraper.
DiDOM - Simple and fast HTML parser.
simple_html_dom - Just a Simple HTML DOM library fork.
PHPCrawl - PHPCrawl is a framework for crawling/spidering websites written in PHP.

Nodejs

puppeteer - Headless Chrome Node API https://pptr.dev.
Phantomjs - Scriptable Headless WebKit.
node-crawler - Web Crawler/Spider for NodeJS + server-side jQuery.
node-simplecrawler - Flexible event driven crawler for node.
spider - Programmable spidering of web sites with node.js and jQuery.
slimerjs - A PhantomJS-like tool running Gecko.
casperjs - Navigation scripting & testing utility for PhantomJS and SlimerJS.
zombie - Insanely fast, full-stack, headless browser testing using node.js.
nightmare - Nightmare is a high level wrapper for PhantomJS that lets you automate browser tasks
jsdom - A JavaScript implementation of the WHATWG DOM and HTML standards, for use with node.js
xray - The next web scraper. See through the <html> noise.
lightcrawler - Crawl a website and run it through Google lighthouse.

Ruby

wombat - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.

Go

gocrawl - Polite, slim and concurrent web crawler.
fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.

License

MIT

Contributing

Please, read the Contribution Guidelines before submitting your suggestion.

Feel free to open an issue or create a pull request with your additions.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 147

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (40) 🔗