All Projects → NikosRig → Mimo-Crawler

NikosRig / Mimo-Crawler

Licence: GPL-3.0 license
A web crawler that uses Firefox and js injection to interact with webpages and crawl their content, written in nodejs.

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to Mimo-Crawler

Singlefilez
Web Extension for Firefox/Chrome/MS Edge and CLI tool to save a faithful copy of an entire web page in a self-extracting HTML/ZIP polyglot file
Stars: ✭ 882 (+3909.09%)
Mutual labels:  firefox, webpage
crawlkit
A crawler based on Phantom. Allows discovery of dynamic content and supports custom scrapers.
Stars: ✭ 23 (+4.55%)
Mutual labels:  scraper, crawling
Spam Bot 3000
Social media research and promotion, semi-autonomous CLI bot
Stars: ✭ 79 (+259.09%)
Mutual labels:  firefox, scraper
Colly
Elegant Scraper and Crawler Framework for Golang
Stars: ✭ 15,535 (+70513.64%)
Mutual labels:  scraper, crawling
gotor
This program provides efficient web scraping services for Tor and non-Tor sites. The program has both a CLI and REST API.
Stars: ✭ 97 (+340.91%)
Mutual labels:  webcrawler, webscraping
Polite
Be nice on the web
Stars: ✭ 253 (+1050%)
Mutual labels:  scraper, webscraping
BookingScraper
🌎 🏨 Scrape Booking.com 🏨 🌎
Stars: ✭ 68 (+209.09%)
Mutual labels:  scraper, webscraping
Goscraper
Golang pkg to quickly return a preview of a webpage (title/description/images)
Stars: ✭ 72 (+227.27%)
Mutual labels:  scraper, webpage
robotstxt
robots.txt file parsing and checking for R
Stars: ✭ 65 (+195.45%)
Mutual labels:  scraper, webscraping
diffbot-php-client
[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library
Stars: ✭ 53 (+140.91%)
Mutual labels:  scraper, crawling
Linkedin Profile Scraper
🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Stars: ✭ 171 (+677.27%)
Mutual labels:  scraper, crawling
wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Stars: ✭ 52 (+136.36%)
Mutual labels:  scraper, crawling
Youtube Projects
This repository contains all the code I use in my YouTube tutorials.
Stars: ✭ 144 (+554.55%)
Mutual labels:  scraper, webscraping
Linkedin scraper
A library that scrapes Linkedin for user data
Stars: ✭ 413 (+1777.27%)
Mutual labels:  firefox, scraper
Newspaper
News, full-text, and article metadata extraction in Python 3. Advanced docs:
Stars: ✭ 11,545 (+52377.27%)
Mutual labels:  scraper, crawling
web-crawler
Python Web Crawler with Selenium and PhantomJS
Stars: ✭ 19 (-13.64%)
Mutual labels:  scraper, webcrawler
Huginn
Create agents that monitor and act on your behalf. Your agents are standing by!
Stars: ✭ 33,694 (+153054.55%)
Mutual labels:  scraper, webscraping
Django Dynamic Scraper
Creating Scrapy scrapers via the Django admin interface
Stars: ✭ 1,024 (+4554.55%)
Mutual labels:  scraper, webscraping
ant
A web crawler for Go
Stars: ✭ 264 (+1100%)
Mutual labels:  scraper, web-crawler
proxycrawl-python
ProxyCrawl Python library for scraping and crawling
Stars: ✭ 51 (+131.82%)
Mutual labels:  scraper, crawling

Mimo Crawler

Mimo is a "state of the art" web crawler that uses non-headless Firefox and js injection to crawl webpages.

demo

Why Mimo?

What makes Mimo special is that instead of using DevTools Protocol and a browser in headless mode, it uses websockets as a communication channel between a non-headless browser and the client. You can interact and crawl the webpage by evaluating your javascript code into the page's context.

This way:

  • An extremely high-speed crawling is achieved
  • Firewall's traceability is diminished
  • Headless browser detectors can be bypassed

Features

  • Simple Client API
  • Interactive crawling
  • Extremely fast compared to similar tools.
  • Fully operated by your javascript code
  • Web spidering

Requirements

Installation

git clone https://github.com/NikosRig/Mimo-Crawler
cd Mimo-Crawler && npm install
sudo npm link

Getting started

Start Firefox and the Mimo Server

  • --firefox (optional) Overrides the default firefox binary path.
mimo-start

You can also use Mimo on machines with no display hardware and no physical input devices with the help of Xvfb. Mimo will be continue using a non-headless Firefox.

xvfb-run mimo-start

Then you are ready to use the Mimo API by including mimoClient.js

Using the Mimo client API

mimoClient.crawl(options)

Sends a new crawl request to Mimo.

  • options {Object}
    • url {String} The url that you want to be crawled.
    • code {String} The javascript code that will be evaluated into the webpage.
    • closeTabDelay (optional) {millisecond} Overrides the tab's default closing time
    • disableWindowAlert (optional) {boolean} If is setted to true it disables window.alert()

You can also write a script, parse it with node's fs.readFileSync and pass it as code's value.

   let options = { code: fs.readFileSync('./myscript.js', 'utf8'); };

In order to get response from Mimo your code must call the response method with the value that you want to be returned as a parameter.

let mycode = `setTimeout(() => {
   //do some things
   
  response({
   pageTitle: document.title,      // Then return an object with the pagetitle and the body.
   body: document.body.innerHTML
  }); 
},2000)`;
mimoClient.addResponseListener(callback)

Every time Mimo sends you back a response, this callback function will be called with the response as an argument.

mimoClient.addResponseListener(response => console.log(response) )

mimoClient.close()

Closes the connection with Mimo and terminates the client script.

Basic Example

const mimo_client = require('./src/app/mimoClient');

let message = {
    url: 'https://www.amazon.com/s?bbn=493964&rh=n%3A172282%2Cn%3A%21493964%2Cn%3A281407%2Cp_n_shipping_option-bin%3A3242350011&dc&fst=as%3Aoff&pf_rd_i=16225009011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=82d03e2f-30e3-48bf-a811-d3d2a6628949&pf_rd_r=MF600JK13S83FRSH3667&pf_rd_s=merchandised-search-4&pf_rd_t=101&qid=1486423355&rnid=493964&ref=s9_acss_bw_cts_AEElectr_T1_w',
    code: `
   
       let product_urls = [];
       
       document.querySelectorAll('a.a-link-normal').forEach(aElement => {
       
           product_urls.push('https://www.amazon.com' + aElement.getAttribute('href'))
       })
            
       response({category_products: product_urls})
    `
};

mimo_client.crawl(message)

mimo_client.addResponseListener((msg) => {
    console.log(msg)
    mimo_client.close();
})

Web Spidering

Every request that you send to mimo creates a new tab, stores your attached code on the browser's storage and executes it every time you open a webpage in that tab. For example if you reload the page or if you click on a link your code will be re-executed.

const mimo_client = require('./src/app/mimoClient');

let spiderCode = `
   if (document.querySelector('a')) {
        // This will open a new url in that tab, and your code will be re-executed
        document.querySelector('a').click()
   }
    response(document.title)
`;

mimo_client.crawl({
    url: 'https://www.example.com',
    code: spiderCode
})

Licence

Copyright (c) 2020 Nikos Rigas

This software is released under the terms of the GNU General Public License v3.0. See the licence file for further information.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].