All Projects → tijme → Not Your Average Web Crawler

tijme / Not Your Average Web Crawler

Licence: mit
A web crawler (for bug hunting) that gathers more than you can imagine.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Not Your Average Web Crawler

Xcrawler
快速、简洁且强大的PHP爬虫框架
Stars: ✭ 344 (+221.5%)
Mutual labels:  crawler, spider, scraper
Awesome Crawler
A collection of awesome web crawler,spider in different languages
Stars: ✭ 4,793 (+4379.44%)
Mutual labels:  crawler, spider, scraper
Freshonions Torscraper
Fresh Onions is an open source TOR spider / hidden service onion crawler hosted at zlal32teyptf4tvi.onion
Stars: ✭ 348 (+225.23%)
Mutual labels:  crawler, spider, scraper
Colly
Elegant Scraper and Crawler Framework for Golang
Stars: ✭ 15,535 (+14418.69%)
Mutual labels:  crawler, spider, scraper
Geziyor
Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.
Stars: ✭ 1,246 (+1064.49%)
Mutual labels:  crawler, spider, scraper
arachnod
High performance crawler for Nodejs
Stars: ✭ 17 (-84.11%)
Mutual labels:  crawler, scraper, spider
Crawly
Crawly, a high-level web crawling & scraping framework for Elixir.
Stars: ✭ 440 (+311.21%)
Mutual labels:  crawler, spider, scraper
Gosint
OSINT Swiss Army Knife
Stars: ✭ 401 (+274.77%)
Mutual labels:  crawler, spider, scraper
Crawler
A high performance web crawler in Elixir.
Stars: ✭ 781 (+629.91%)
Mutual labels:  crawler, spider, scraper
Spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Stars: ✭ 656 (+513.08%)
Mutual labels:  crawler, spider, scraper
Querylist
🕷️ The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。
Stars: ✭ 2,392 (+2135.51%)
Mutual labels:  crawler, spider, scraper
Blackwidow
A Python based web application scanner to gather OSINT and fuzz for OWASP vulnerabilities on a target website.
Stars: ✭ 887 (+728.97%)
Mutual labels:  spider, scanner, vulnerability
Goribot
[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。
Stars: ✭ 190 (+77.57%)
Mutual labels:  crawler, spider, scraper
Dumpall
一款信息泄漏利用工具,适用于.git/.svn源代码泄漏和.DS_Store泄漏
Stars: ✭ 250 (+133.64%)
Mutual labels:  spider, scanner, bug-bounty
Linkedin Profile Scraper
🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Stars: ✭ 171 (+59.81%)
Mutual labels:  crawler, spider, scraper
Fbcrawl
A Facebook crawler
Stars: ✭ 536 (+400.93%)
Mutual labels:  crawler, spider, scraper
Scrapit
Scraping scripts for various websites.
Stars: ✭ 25 (-76.64%)
Mutual labels:  crawler, spider, scraper
Avbook
AV 电影管理系统, avmoo , javbus , javlibrary 爬虫,线上 AV 影片图书馆,AV 磁力链接数据库,Japanese Adult Video Library,Adult Video Magnet Links - Japanese Adult Video Database
Stars: ✭ 8,133 (+7500.93%)
Mutual labels:  crawler, spider, scraper
Beanbun
Beanbun 是用 PHP 编写的多进程网络爬虫框架,具有良好的开放性、高可扩展性,基于 Workerman。
Stars: ✭ 1,096 (+924.3%)
Mutual labels:  crawler, spider
Car Prices
Golang爬虫 爬取汽车之家 二手车产品库
Stars: ✭ 57 (-46.73%)
Mutual labels:  crawler, spider

.. raw:: html

<p align="center">

.. image:: https://tijme.github.io/not-your-average-web-crawler/latest/_static/img/logo.svg?pypi=png.from.svg :width: 300px :height: 300px :alt: N.Y.A.W.C. logo :align: center

.. raw:: html

<br class="title">

.. image:: https://raw.finnwea.com/shield/?firstText=Donate%20via&secondText=Bunq :target: https://bunq.me/tijme/0/A%20web%20crawler%20(for%20bug%20hunting)%20that%20gathers%20more%20than%20you%20can%20imagine :alt: Donate via Bunq

.. image:: https://raw.finnwea.com/shield/?typeKey=TravisBuildStatus&typeValue1=tijme/not-your-average-web-crawler&typeValue2=master&cache=1 :target: https://travis-ci.org/tijme/not-your-average-web-crawler :alt: Build Status

.. image:: https://raw.finnwea.com/vector-shields-v1/?typeKey=SemverVersion&typeValue1=tijme&typeValue2=not-your-average-web-crawler :target: https://pypi.python.org/pypi/nyawc/ :alt: PyPi version

.. image:: https://raw.finnwea.com/shield/?firstText=License&secondText=MIT :target: https://github.com/tijme/not-your-average-web-crawler/blob/master/LICENSE.rst :alt: License: MIT

.. raw:: html

Not Your Average Web Crawler

N.Y.A.W.C is a Python library that enables you to test your payload against all requests of a certain domain. It crawls all requests (e.g. GET, POST or PUT) in the specified scope and keeps track of the request and response data. During the crawling process the callbacks enable you to insert your payload at specific places and test if they worked.

Table of contents

  • Installation <#installation>__
  • Crawling flow <#crawling-flow>__
  • Documentation <#documentation>__
  • Minimal implementation <#minimal-implementation>__
  • Testing <#testing>__
  • Issues <#issues>__
  • License <#license>__

Installation

First make sure you're on Python 2.7/3.3 <https://www.python.org/>__ or higher. Then run the command below to install N.Y.A.W.C.

$ pip install --upgrade nyawc

Crawling flow

  1. You can define your startpoint (a request) and the crawling scope and then start the crawler.
  2. The crawler repeatedly starts the first request in the queue until max threads is reached.
  3. The crawler adds all requests found in the response to the end of the queue (except duplicates).
  4. The crawler goes back to step #2 to spawn new requests repeatedly until max threads is reached.

.. image:: https://tijme.github.io/not-your-average-web-crawler/latest/_static/img/flow.svg :alt: N.Y.A.W.C crawling flow

Please note that if the queue is empty and all crawler threads are finished, the crawler will stop.

Documentation

Please refer to the documentation <https://tijme.github.io/not-your-average-web-crawler/>__ or the API <https://tijme.github.io/not-your-average-web-crawler/latest/py-modindex.html>__ for all the information about N.Y.A.W.C.

Minimal implementation

You can use the callbacks in example_minimal.py to run your own exploit against the requests. If you want an example of automated exploit scanning, please take a look at ACSTIS <https://github.com/tijme/angularjs-csti-scanner>__ (it uses N.Y.A.W.C to scan for AngularJS client-side template injection vulnerabilities).

You can also use the kitchen sink <https://tijme.github.io/not-your-average-web-crawler/latest/kitchen_sink.html>__ (which contains all the functionalities from N.Y.A.W.C.) instead of the example below. The code below is a minimal implementation of N.Y.A.W.C.

  • $ python example_minimal.py
  • $ python -u example_minimal.py > output.log

.. code:: python

# example_minimal.py

from nyawc.Options import Options
from nyawc.QueueItem import QueueItem
from nyawc.Crawler import Crawler
from nyawc.CrawlerActions import CrawlerActions
from nyawc.http.Request import Request

def cb_crawler_before_start():
    print("Crawler started.")

def cb_crawler_after_finish(queue):
    print("Crawler finished.")
    print("Found " + str(len(queue.get_all(QueueItem.STATUS_FINISHED))) + " requests.")

def cb_request_before_start(queue, queue_item):
    print("Starting: {}".format(queue_item.request.url))
    return CrawlerActions.DO_CONTINUE_CRAWLING

def cb_request_after_finish(queue, queue_item, new_queue_items):
    print("Finished: {}".format(queue_item.request.url))
    return CrawlerActions.DO_CONTINUE_CRAWLING

options = Options()

options.callbacks.crawler_before_start = cb_crawler_before_start # Called before the crawler starts crawling. Default is a null route.
options.callbacks.crawler_after_finish = cb_crawler_after_finish # Called after the crawler finished crawling. Default is a null route.
options.callbacks.request_before_start = cb_request_before_start # Called before the crawler starts a new request. Default is a null route.
options.callbacks.request_after_finish = cb_request_after_finish # Called after the crawler finishes a request. Default is a null route.

crawler = Crawler(options)
crawler.start_with(Request("https://finnwea.com/"))

Testing

The testing can and will automatically be done by Travis CI <https://travis-ci.org/tijme/not-your-average-web-crawler>__ on every push to the master branch. If you want to manually run the unit tests, use the command below.

$ python -m unittest discover

Issues

Issues or new features can be reported via the GitHub issue tracker. Please make sure your issue or feature has not yet been reported by anyone else before submitting a new one.

License

Not Your Average Web Crawler (N.Y.A.W.C) is open-sourced software licensed under the MIT license <https://github.com/tijme/not-your-average-web-crawler/blob/master/LICENSE.rst>__.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].