All Projects → risyasin → arachnod

risyasin / arachnod

Licence: Apache-2.0 License
High performance crawler for Nodejs

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to arachnod

Awesome Crawler
A collection of awesome web crawler,spider in different languages
Stars: ✭ 4,793 (+28094.12%)
Mutual labels:  crawler, scraper, spider
Scrapit
Scraping scripts for various websites.
Stars: ✭ 25 (+47.06%)
Mutual labels:  crawler, scraper, spider
Fbcrawl
A Facebook crawler
Stars: ✭ 536 (+3052.94%)
Mutual labels:  crawler, scraper, spider
Freshonions Torscraper
Fresh Onions is an open source TOR spider / hidden service onion crawler hosted at zlal32teyptf4tvi.onion
Stars: ✭ 348 (+1947.06%)
Mutual labels:  crawler, scraper, spider
Linkedin Profile Scraper
🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Stars: ✭ 171 (+905.88%)
Mutual labels:  crawler, scraper, spider
Crawly
Crawly, a high-level web crawling & scraping framework for Elixir.
Stars: ✭ 440 (+2488.24%)
Mutual labels:  crawler, scraper, spider
Crawler
A high performance web crawler in Elixir.
Stars: ✭ 781 (+4494.12%)
Mutual labels:  crawler, scraper, spider
Gosint
OSINT Swiss Army Knife
Stars: ✭ 401 (+2258.82%)
Mutual labels:  crawler, scraper, spider
Not Your Average Web Crawler
A web crawler (for bug hunting) that gathers more than you can imagine.
Stars: ✭ 107 (+529.41%)
Mutual labels:  crawler, scraper, spider
Geziyor
Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.
Stars: ✭ 1,246 (+7229.41%)
Mutual labels:  crawler, scraper, spider
Xcrawler
快速、简洁且强大的PHP爬虫框架
Stars: ✭ 344 (+1923.53%)
Mutual labels:  crawler, scraper, spider
Querylist
🕷️ The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。
Stars: ✭ 2,392 (+13970.59%)
Mutual labels:  crawler, scraper, spider
Spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Stars: ✭ 656 (+3758.82%)
Mutual labels:  crawler, scraper, spider
Avbook
AV 电影管理系统, avmoo , javbus , javlibrary 爬虫,线上 AV 影片图书馆,AV 磁力链接数据库,Japanese Adult Video Library,Adult Video Magnet Links - Japanese Adult Video Database
Stars: ✭ 8,133 (+47741.18%)
Mutual labels:  crawler, scraper, spider
Goribot
[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。
Stars: ✭ 190 (+1017.65%)
Mutual labels:  crawler, scraper, spider
Colly
Elegant Scraper and Crawler Framework for Golang
Stars: ✭ 15,535 (+91282.35%)
Mutual labels:  crawler, scraper, spider
crawler-chrome-extensions
爬虫工程师常用的 Chrome 插件 | Chrome extensions used by crawler developer
Stars: ✭ 53 (+211.76%)
Mutual labels:  scraper, spider
Polite
Be nice on the web
Stars: ✭ 253 (+1388.24%)
Mutual labels:  crawler, scraper
ant
A web crawler for Go
Stars: ✭ 264 (+1452.94%)
Mutual labels:  scraper, spider
website-to-json
Converts website to json using jQuery selectors
Stars: ✭ 37 (+117.65%)
Mutual labels:  scraper, cheerio

Arachnod

High performance crawler for Nodejs

Powerful & Easy to use web crawler for Nodejs. Arachnod has been designed for heavy and long running tasks, for performance & efective resource usage. For it's goals Arachnod uses Redis's power as a backend. Covering all heavy & time consuming tasks such as controlling urls & their tasks to store & distribute information among the Arachnod's child tasks (Spiderlings). Arachnod also avoids to use any server-side DOM requiring technics such as jQuery with JSdom to use resources properly. Frankly, I have tested JSdom for along time with no luck, always memory leaks & high memory usage. Libxml based XPath solutions were not actually real, Instead, Arachnod uses Cheerio for accessing DOM elements. Also uses SuperAgent as HTTP Client.

How to install

$ npm install arachnod

Or via Git $ git clone [email protected]:risyasin/arachnod.git

Then, install required Nodejs modules with npm $ npm install

Please make sure you have a running redis-server

How to use

var bot = require('arachnod');
    
bot.on('hit', function (doc, $) {
    
    // Do whatever you want to do parsed html content.
    
    var desc = $('article.entry-content').text();
    
    console.log(doc.path, desc);
    
    // if you don't need to follow all links.
    bot.pause();
    
    
});


bot.crawl({
    'redis': '127.0.0.1',
    'parallel': 4,
    'start': 'https://github.com/risyasin/arachnod',
    'resume': false
});


bot.on('error', function (err, task) {
    console.log('Bot error:', err, err.stack, task);
});


bot.on('end', function (err, status) {
    console.log('Bot finished:', err, status);
});

Documentation

Parameters
Parameter Name Description
start Start url for crawling (Mandatory)
parallel Number of child processes that will handle network tasks (Default: 8) Do not this more than 20.
redis Host name or IP address that Redis runs on (Default: 127.0.0.1)
redisPort Port number for Redis (Default: 6379)
verbose Arachnod will tell you more, 1 (silence) - 10 (everything). Default: 1.
resume Resume support, Simply, does not resets queues if there is any. (Default: false)
ignorePaths Ignores paths starts with. Must be multiple in array syntax such as ['/blog','/gallery']
ignoreParams Ignores query string parameters, Must be in array syntax. such as ['color','type']
sameDomain Stays in the same hostname. (implemented as of 0.4.4)
useCookies Using cookies (implemented at 0.4.4 as cookie parameter)
basicAuth Provide basic authentication credentials. user:pass
obeyRobotsTxt As it's name says. Honors the robots.txt (will be implemented at v0.5)
Events
Event Name Description
hit Emits when a url has been downloaded & processed, sends two parameters in order doc Parsed url info, $ as Cheerio object.
error Emits when an error occurs at any level including child processes. Single parameter Error or Exception.
end Emits when reached at the end of tasks queue. Return statistics.
stats Emits bot stats whenever a child changes it's states (such as downloading or querying queues). Use wisely.
Methods
Method Name Description
crawl(Parameters) Starts a new crawling session with parameters
pause() Stops bot but does not delete any task queue.
resume() Starts back a paused session. Useful to control resource usage in low spec systems (single core etc.).
queue(url) Adds given url to task queue.
getStats() Returns various statistics such as downloaded, checked, finished url counts, memory size etc.
What's Next
  • Regex support for ignore parameters
  • Cookie support
  • Robots.txt & rel=nofollow support
  • Actions for content-type or any given response headers
  • Custom headers
  • Custom POST/PUT method queues
  • Free-Ride mode (will be fun) Probably not useful.
  • Stats for each download/hit event
  • Plugin support

Support

If you like Arachnod, Help me to improve it.

License

Copyright 2015-17 yasin inat

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].