All Projects → schollz → Linkcrawler

schollz / Linkcrawler

Licence: mit
Cross-platform persistent and distributed web crawler 🔗

Programming Languages

go
31211 projects - #10 most used programming language

Projects that are alternatives of or similar to Linkcrawler

Infinitycrawler
A simple but powerful web crawler library for .NET
Stars: ✭ 97 (-11.01%)
Mutual labels:  crawler
Dotnetcrawler
DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c
Stars: ✭ 100 (-8.26%)
Mutual labels:  crawler
Not Your Average Web Crawler
A web crawler (for bug hunting) that gathers more than you can imagine.
Stars: ✭ 107 (-1.83%)
Mutual labels:  crawler
Thesaurusspider
下载搜狗、百度、QQ输入法的词库文件的 python 爬虫,可用于构建不同行业的词汇库
Stars: ✭ 98 (-10.09%)
Mutual labels:  crawler
Crawlerpack
Java 網路資料爬蟲包
Stars: ✭ 99 (-9.17%)
Mutual labels:  crawler
D4n155
OWASP D4N155 - Intelligent and dynamic wordlist using OSINT
Stars: ✭ 105 (-3.67%)
Mutual labels:  crawler
Lightcrawler
Crawl a website and run it through Google lighthouse
Stars: ✭ 1,339 (+1128.44%)
Mutual labels:  crawler
Fawkes
Fawkes is a tool to search for targets vulnerable to SQL Injection. Performs the search using Google search engine.
Stars: ✭ 108 (-0.92%)
Mutual labels:  crawler
Ruia
Async Python 3.6+ web scraping micro-framework based on asyncio
Stars: ✭ 1,366 (+1153.21%)
Mutual labels:  crawler
Crawler Detect
🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent
Stars: ✭ 1,549 (+1321.1%)
Mutual labels:  crawler
Gopa Abandoned
GOPA, a spider written in Go.(NOTE: this project moved to https://github.com/infinitbyte/gopa )
Stars: ✭ 98 (-10.09%)
Mutual labels:  crawler
Antispider
Stars: ✭ 99 (-9.17%)
Mutual labels:  crawler
Skycaiji
蓝天采集器是一款免费的数据采集发布爬虫软件,采用php+mysql开发,可部署在云服务器,几乎能采集所有类型的网页,无缝对接各类CMS建站程序,免登录实时发布数据,全自动无需人工干预!是网页大数据采集软件中完全跨平台的云端爬虫系统
Stars: ✭ 1,514 (+1288.99%)
Mutual labels:  crawler
Amazonrobot
Amazon商品引流的 python 爬虫
Stars: ✭ 97 (-11.01%)
Mutual labels:  crawler
Webmagic
A scalable web crawler framework for Java.
Stars: ✭ 10,186 (+9244.95%)
Mutual labels:  crawler
Scaleable Crawler With Docker Cluster
a scaleable and efficient crawelr with docker cluster , crawl million pages in 2 hours with a single machine
Stars: ✭ 96 (-11.93%)
Mutual labels:  crawler
Andvaranaut
A dungeon crawler
Stars: ✭ 103 (-5.5%)
Mutual labels:  crawler
Lumberjack
An automated website accessibility scanner and cli
Stars: ✭ 109 (+0%)
Mutual labels:  crawler
Scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
Stars: ✭ 42,343 (+38746.79%)
Mutual labels:  crawler
Crawler
爬虫, http代理, 模拟登陆!
Stars: ✭ 106 (-2.75%)
Mutual labels:  crawler

linkcrawler
Build Status Code Coverage GoDoc

Cross-platform persistent and distributed web crawler

linkcrawler is persistent because the queue is stored in a remote database that is automatically re-initialized if interrupted. linkcrawler is distributed because multiple instances of linkcrawler will work on the remotely stored queue, so you can start as many crawlers as you want on separate machines to speed along the process. linkcrawler is also fast because it is threaded and uses connection pools.

Crawl responsibly.

This repo has been superseded by schollz/goredis-crawler

Getting Started

Install

If you have Go installed, just do

go get github.com/schollz/linkcrawler/...
go get github.com/schollz/boltdb-server/...

Otherwise, use the releases and download linkcrawler and then download the boltdb-server.

Run

Crawl a site

First run the database server which will create a LAN hub:

$ ./boltdb-server
boltdb-server running on http://X.Y.Z.W:8050

Then, to capture all the links on a website:

$ linkcrawler --server http://X.Y.Z.W:8050 crawl http://rpiai.com

Make sure to replace http://X.Y.Z.W:8050 with the IP information outputted from the boltdb-server.

You can run this last command on as many different machines as you want, which will help to crawl the respective website and add collected links to a universal queue in the server.

The current state of the crawler is saved. If the crawler is interrupted, you can simply run the command again and it will restart from the last state.

See the help (-help) if you'd like to see more options, such as exclusions/inclusions and modifying the worker pool and connection pools.

Download a site

You can also use linkcrawler to download webpages from a newline-delimited list of websites. As before, first startup a boltdb-server. Then you can run:

$ linkcrawler --server http://X.Y.Z.W:8050 download links.txt

Downloads are saved into a folder downloaded with URL of link encoded in Base32 and compressed using gzip.

Dump the current list of links

To dump the current database, just use

$ linkcrawler --server http://X.Y.Z.W:8050 dump http://rpiai.com
Wrote 32 links to NB2HI4B2F4XXE4DJMFUS4Y3PNU======.txt

License

MIT

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].