Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → schollz → Linkcrawler

schollz / Linkcrawler

Licence: mit

Cross-platform persistent and distributed web crawler 🔗

Programming Languages

31211 projects - #10 most used programming language

Labels

web crawler

Projects that are alternatives of or similar to Linkcrawler

Infinitycrawler

A simple but powerful web crawler library for .NET

Stars: ✭ 97 (-11.01%)

Mutual labels: crawler

Dotnetcrawler

DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c

Stars: ✭ 100 (-8.26%)

Mutual labels: crawler

Not Your Average Web Crawler

A web crawler (for bug hunting) that gathers more than you can imagine.

Stars: ✭ 107 (-1.83%)

Mutual labels: crawler

Thesaurusspider

下载搜狗、百度、QQ输入法的词库文件的 python 爬虫，可用于构建不同行业的词汇库

Stars: ✭ 98 (-10.09%)

Mutual labels: crawler

Crawlerpack

Java 網路資料爬蟲包

Stars: ✭ 99 (-9.17%)

Mutual labels: crawler

D4n155

OWASP D4N155 - Intelligent and dynamic wordlist using OSINT

Stars: ✭ 105 (-3.67%)

Mutual labels: crawler

Lightcrawler

Crawl a website and run it through Google lighthouse

Stars: ✭ 1,339 (+1128.44%)

Mutual labels: crawler

Fawkes

Fawkes is a tool to search for targets vulnerable to SQL Injection. Performs the search using Google search engine.

Stars: ✭ 108 (-0.92%)

Mutual labels: crawler

Ruia

Async Python 3.6+ web scraping micro-framework based on asyncio

Stars: ✭ 1,366 (+1153.21%)

Mutual labels: crawler

Crawler Detect

🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

Stars: ✭ 1,549 (+1321.1%)

Mutual labels: crawler

Gopa Abandoned

GOPA, a spider written in Go.（NOTE: this project moved to https://github.com/infinitbyte/gopa ）

Stars: ✭ 98 (-10.09%)

Mutual labels: crawler

Antispider

Stars: ✭ 99 (-9.17%)

Mutual labels: crawler

Skycaiji

蓝天采集器是一款免费的数据采集发布爬虫软件，采用php+mysql开发，可部署在云服务器，几乎能采集所有类型的网页，无缝对接各类CMS建站程序，免登录实时发布数据，全自动无需人工干预！是网页大数据采集软件中完全跨平台的云端爬虫系统

Stars: ✭ 1,514 (+1288.99%)

Mutual labels: crawler

Amazonrobot

Amazon商品引流的 python 爬虫

Stars: ✭ 97 (-11.01%)

Mutual labels: crawler

Webmagic

A scalable web crawler framework for Java.

Stars: ✭ 10,186 (+9244.95%)

Mutual labels: crawler

Scaleable Crawler With Docker Cluster

a scaleable and efficient crawelr with docker cluster , crawl million pages in 2 hours with a single machine

Stars: ✭ 96 (-11.93%)

Mutual labels: crawler

Andvaranaut

A dungeon crawler

Stars: ✭ 103 (-5.5%)

Mutual labels: crawler

Lumberjack

An automated website accessibility scanner and cli

Stars: ✭ 109 (+0%)

Mutual labels: crawler

Scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.

Stars: ✭ 42,343 (+38746.79%)

Mutual labels: crawler

Crawler

爬虫, http代理, 模拟登陆!

Stars: ✭ 106 (-2.75%)

Mutual labels: crawler

View All Similar Projects ➔

Cross-platform persistent and distributed web crawler

linkcrawler is persistent because the queue is stored in a remote database that is automatically re-initialized if interrupted. linkcrawler is distributed because multiple instances of linkcrawler will work on the remotely stored queue, so you can start as many crawlers as you want on separate machines to speed along the process. linkcrawler is also fast because it is threaded and uses connection pools.

Crawl responsibly.

This repo has been superseded by schollz/goredis-crawler

Getting Started

Install

If you have Go installed, just do

go get github.com/schollz/linkcrawler/...
go get github.com/schollz/boltdb-server/...

Otherwise, use the releases and download linkcrawler and then download the boltdb-server.

Run

Crawl a site

First run the database server which will create a LAN hub:

$ ./boltdb-server
boltdb-server running on http://X.Y.Z.W:8050

Then, to capture all the links on a website:

$ linkcrawler --server http://X.Y.Z.W:8050 crawl http://rpiai.com

Make sure to replace http://X.Y.Z.W:8050 with the IP information outputted from the boltdb-server.

You can run this last command on as many different machines as you want, which will help to crawl the respective website and add collected links to a universal queue in the server.

The current state of the crawler is saved. If the crawler is interrupted, you can simply run the command again and it will restart from the last state.

See the help (-help) if you'd like to see more options, such as exclusions/inclusions and modifying the worker pool and connection pools.

Download a site

You can also use linkcrawler to download webpages from a newline-delimited list of websites. As before, first startup a boltdb-server. Then you can run:

$ linkcrawler --server http://X.Y.Z.W:8050 download links.txt

Downloads are saved into a folder downloaded with URL of link encoded in Base32 and compressed using gzip.

Dump the current list of links

To dump the current database, just use

$ linkcrawler --server http://X.Y.Z.W:8050 dump http://rpiai.com
Wrote 32 links to NB2HI4B2F4XXE4DJMFUS4Y3PNU======.txt

License

MIT

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 109

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗