All Projects → duyet → Awesome Web Scraper

duyet / Awesome Web Scraper

Licence: mit
A collection of awesome web scaper, crawler.

Projects that are alternatives of or similar to Awesome Web Scraper

Crawlab
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Stars: ✭ 8,392 (+5608.84%)
Mutual labels:  spider, scrapy, web-crawler
Awesome Crawler
A collection of awesome web crawler,spider in different languages
Stars: ✭ 4,793 (+3160.54%)
Mutual labels:  spider, web-crawler, web-scraper
Spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Stars: ✭ 656 (+346.26%)
Mutual labels:  spider, web-crawler, web-scraper
OLX Scraper
📻 An OLX Scraper using Scrapy + MongoDB. It Scrapes recent ads posted regarding requested product and dumps to NOSQL MONGODB.
Stars: ✭ 15 (-89.8%)
Mutual labels:  web-crawler, web-scraper, scrapy
Crawlab Lite
Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台
Stars: ✭ 122 (-17.01%)
Mutual labels:  spider, scrapy, web-crawler
Terpene Profile Parser For Cannabis Strains
Parser and database to index the terpene profile of different strains of Cannabis from online databases
Stars: ✭ 63 (-57.14%)
Mutual labels:  scrapy, web-crawler
Abotx
Cross Platform C# Web crawler framework, headless browser, parallel crawler. Please star this project! +1.
Stars: ✭ 63 (-57.14%)
Mutual labels:  spider, web-crawler
Taobaoscrapy
😩Tool For Taobao/Tmall| 儿时玩具已经过时
Stars: ✭ 146 (-0.68%)
Mutual labels:  spider, scrapy
Image Downloader
Download images from Google, Bing, Baidu. 谷歌、百度、必应图片下载.
Stars: ✭ 1,173 (+697.96%)
Mutual labels:  spider, scrapy
Alipayspider Scrapy
AlipaySpider on Scrapy(use chrome driver); 支付宝爬虫(基于Scrapy)
Stars: ✭ 70 (-52.38%)
Mutual labels:  spider, scrapy
Capturer
capture pictures from website like sina, lofter, huaban and so on
Stars: ✭ 76 (-48.3%)
Mutual labels:  spider, scrapy
Scrapy Craigslist
Web Scraping Craigslist's Engineering Jobs in NY with Scrapy
Stars: ✭ 54 (-63.27%)
Mutual labels:  scrapy, web-scraper
Reptile
🏀 Python3 网络爬虫实战(部分含详细教程)猫眼 腾讯视频 豆瓣 研招网 微博 笔趣阁小说 百度热点 B站 CSDN 网易云阅读 阿里文学 百度股票 今日头条 微信公众号 网易云音乐 拉勾 有道 unsplash 实习僧 汽车之家 英雄联盟盒子 大众点评 链家 LPL赛程 台风 梦幻西游、阴阳师藏宝阁 天气 牛客网 百度文库 睡前故事 知乎 Wish
Stars: ✭ 1,048 (+612.93%)
Mutual labels:  spider, scrapy
Arachnid
Powerful web scraping framework for Crystal
Stars: ✭ 68 (-53.74%)
Mutual labels:  spider, web-scraper
Django Dynamic Scraper
Creating Scrapy scrapers via the Django admin interface
Stars: ✭ 1,024 (+596.6%)
Mutual labels:  spider, scrapy
Tspider
Yet Another Web Spider
Stars: ✭ 70 (-52.38%)
Mutual labels:  spider, phantomjs
Copybook
用爬虫爬取小说网站上所有小说,存储到数据库中,并用爬到的数据构建自己的小说网站
Stars: ✭ 117 (-20.41%)
Mutual labels:  spider, scrapy
Scrala
Unmaintained 🐳 ☕️ 🕷 Scala crawler(spider) framework, inspired by scrapy, created by @gaocegege
Stars: ✭ 113 (-23.13%)
Mutual labels:  spider, scrapy
Pspider
简单易用的Python爬虫框架,QQ交流群:597510560
Stars: ✭ 1,611 (+995.92%)
Mutual labels:  spider, web-crawler
App comments spider
爬取百度贴吧、TapTap、appstore、微博官方博主上的游戏评论(基于redis_scrapy),过滤器采用了bloomfilter。
Stars: ✭ 38 (-74.15%)
Mutual labels:  spider, scrapy

Awesome Web Scraper Awesome Build Status

Support

A collection of awesome web scaper, crawler.

Java

  • Apache Nutch - Highly extensible, highly scalable Web crawler. Pluggable parsing, protocols, storage and indexing.
  • websphinx - Website-Specific Processors for HTML INformation eXtraction.
  • Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
  • crawler4j - open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-threaded web crawler in few minutes.

C/C++

  • HTTrack - Highly extensible, highly scalable Web crawler. Pluggable parsing, protocols, storage and indexing.

C#

  • ccrawler - Built in C# 3.5 version. it contains a simple extention of web content categorizer, which can saparate between the web page depending on their content.

Erlang

  • ebot - Opensource Web Crawler built on top of a nosql database (apache couchdb, riak), AMQP database (rabbitmq), webmachine and mochiweb.

Python

  • scrapy - Scrapy, a fast high-level web crawling & scraping framework for Python.
  • gdom - gdom, DOM Traversing and Scraping using GraphQL.

PHP

  • Goutte - Goutte, a simple PHP Web Scraper.
  • DiDOM - Simple and fast HTML parser.
  • simple_html_dom - Just a Simple HTML DOM library fork.
  • PHPCrawl - PHPCrawl is a framework for crawling/spidering websites written in PHP.

Nodejs

  • puppeteer - Headless Chrome Node API https://pptr.dev.
  • Phantomjs - Scriptable Headless WebKit.
  • node-crawler - Web Crawler/Spider for NodeJS + server-side jQuery.
  • node-simplecrawler - Flexible event driven crawler for node.
  • spider - Programmable spidering of web sites with node.js and jQuery.
  • slimerjs - A PhantomJS-like tool running Gecko.
  • casperjs - Navigation scripting & testing utility for PhantomJS and SlimerJS.
  • zombie - Insanely fast, full-stack, headless browser testing using node.js.
  • nightmare - Nightmare is a high level wrapper for PhantomJS that lets you automate browser tasks
  • jsdom - A JavaScript implementation of the WHATWG DOM and HTML standards, for use with node.js
  • xray - The next web scraper. See through the <html> noise.
  • lightcrawler - Crawl a website and run it through Google lighthouse.

Ruby

  • wombat - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.

Go

  • gocrawl - Polite, slim and concurrent web crawler.
  • fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.

License

MIT

Contributing

Please, read the Contribution Guidelines before submitting your suggestion.

Feel free to open an issue or create a pull request with your additions.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].