NikosRig / Mimo-Crawler

Licence: GPL-3.0 license

A web crawler that uses Firefox and js injection to interact with webpages and crawl their content, written in nodejs.

Programming Languages

184084 projects - #8 most used programming language

Projects that are alternatives of or similar to Mimo-Crawler

Web Extension for Firefox/Chrome/MS Edge and CLI tool to save a faithful copy of an entire web page in a self-extracting HTML/ZIP polyglot file

Stars: ✭ 882 (+3909.09%)

Mutual labels: firefox, webpage

crawlkit

A crawler based on Phantom. Allows discovery of dynamic content and supports custom scrapers.

Stars: ✭ 23 (+4.55%)

Mutual labels: scraper, crawling

Spam Bot 3000

Social media research and promotion, semi-autonomous CLI bot

Stars: ✭ 79 (+259.09%)

Mutual labels: firefox, scraper

Colly

Elegant Scraper and Crawler Framework for Golang

Stars: ✭ 15,535 (+70513.64%)

Mutual labels: scraper, crawling

gotor

This program provides efficient web scraping services for Tor and non-Tor sites. The program has both a CLI and REST API.

Stars: ✭ 97 (+340.91%)

Mutual labels: webcrawler, webscraping

Polite

Be nice on the web

Stars: ✭ 253 (+1050%)

Mutual labels: scraper, webscraping

BookingScraper

🌎 🏨 Scrape Booking.com 🏨 🌎

Stars: ✭ 68 (+209.09%)

Mutual labels: scraper, webscraping

Goscraper

Golang pkg to quickly return a preview of a webpage (title/description/images)

Stars: ✭ 72 (+227.27%)

Mutual labels: scraper, webpage

robotstxt

robots.txt file parsing and checking for R

Stars: ✭ 65 (+195.45%)

Mutual labels: scraper, webscraping

diffbot-php-client

[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library

Stars: ✭ 53 (+140.91%)

Mutual labels: scraper, crawling

Linkedin Profile Scraper

🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.

Stars: ✭ 171 (+677.27%)

Mutual labels: scraper, crawling

wget-lua

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

Stars: ✭ 52 (+136.36%)

Mutual labels: scraper, crawling

Youtube Projects

This repository contains all the code I use in my YouTube tutorials.

Stars: ✭ 144 (+554.55%)

Mutual labels: scraper, webscraping

Linkedin scraper

A library that scrapes Linkedin for user data

Stars: ✭ 413 (+1777.27%)

Mutual labels: firefox, scraper

Newspaper

News, full-text, and article metadata extraction in Python 3. Advanced docs:

Stars: ✭ 11,545 (+52377.27%)

Mutual labels: scraper, crawling

web-crawler

Python Web Crawler with Selenium and PhantomJS

Stars: ✭ 19 (-13.64%)

Mutual labels: scraper, webcrawler

Huginn

Create agents that monitor and act on your behalf. Your agents are standing by!

Stars: ✭ 33,694 (+153054.55%)

Mutual labels: scraper, webscraping

Django Dynamic Scraper

Creating Scrapy scrapers via the Django admin interface

Stars: ✭ 1,024 (+4554.55%)

Mutual labels: scraper, webscraping

ant

A web crawler for Go

Stars: ✭ 264 (+1100%)

Mutual labels: scraper, web-crawler

proxycrawl-python

ProxyCrawl Python library for scraping and crawling

Stars: ✭ 51 (+131.82%)

Mutual labels: scraper, crawling

View All Similar Projects ➔

Mimo Crawler

Mimo is a "state of the art" web crawler that uses non-headless Firefox and js injection to crawl webpages.

Why Mimo?

What makes Mimo special is that instead of using DevTools Protocol and a browser in headless mode, it uses websockets as a communication channel between a non-headless browser and the client. You can interact and crawl the webpage by evaluating your javascript code into the page's context.

This way:

An extremely high-speed crawling is achieved
Firewall's traceability is diminished
Headless browser detectors can be bypassed

Features

Simple Client API
Interactive crawling
Extremely fast compared to similar tools.
Fully operated by your javascript code
Web spidering

Requirements

Firefox
node >= 14
Xvfb (optional)

Installation

git clone https://github.com/NikosRig/Mimo-Crawler
cd Mimo-Crawler && npm install
sudo npm link

Getting started

Start Firefox and the Mimo Server

--firefox (optional) Overrides the default firefox binary path.

mimo-start

You can also use Mimo on machines with no display hardware and no physical input devices with the help of Xvfb. Mimo will be continue using a non-headless Firefox.

xvfb-run mimo-start

Then you are ready to use the Mimo API by including mimoClient.js

Using the Mimo client API

`mimoClient.crawl(options)`

Sends a new crawl request to Mimo.

options {Object}
- url {String} The url that you want to be crawled.
- code {String} The javascript code that will be evaluated into the webpage.
- closeTabDelay (optional) {millisecond} Overrides the tab's default closing time
- disableWindowAlert (optional) {boolean} If is setted to true it disables window.alert()

You can also write a script, parse it with node's fs.readFileSync and pass it as code's value.

   let options = { code: fs.readFileSync('./myscript.js', 'utf8'); };

In order to get response from Mimo your code must call the response method with the value that you want to be returned as a parameter.

let mycode = `setTimeout(() => {
   //do some things
   
  response({
   pageTitle: document.title,      // Then return an object with the pagetitle and the body.
   body: document.body.innerHTML
  }); 
},2000)`;

`mimoClient.addResponseListener(callback)`

Every time Mimo sends you back a response, this callback function will be called with the response as an argument.

mimoClient.addResponseListener(response => console.log(response) )

`mimoClient.close()`

Closes the connection with Mimo and terminates the client script.

Basic Example

const mimo_client = require('./src/app/mimoClient');

let message = {
    url: 'https://www.amazon.com/s?bbn=493964&rh=n%3A172282%2Cn%3A%21493964%2Cn%3A281407%2Cp_n_shipping_option-bin%3A3242350011&dc&fst=as%3Aoff&pf_rd_i=16225009011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=82d03e2f-30e3-48bf-a811-d3d2a6628949&pf_rd_r=MF600JK13S83FRSH3667&pf_rd_s=merchandised-search-4&pf_rd_t=101&qid=1486423355&rnid=493964&ref=s9_acss_bw_cts_AEElectr_T1_w',
    code: `
   
       let product_urls = [];
       
       document.querySelectorAll('a.a-link-normal').forEach(aElement => {
       
           product_urls.push('https://www.amazon.com' + aElement.getAttribute('href'))
       })
            
       response({category_products: product_urls})
    `
};

mimo_client.crawl(message)

mimo_client.addResponseListener((msg) => {
    console.log(msg)
    mimo_client.close();
})

Web Spidering

Every request that you send to mimo creates a new tab, stores your attached code on the browser's storage and executes it every time you open a webpage in that tab. For example if you reload the page or if you click on a link your code will be re-executed.

const mimo_client = require('./src/app/mimoClient');

let spiderCode = `
   if (document.querySelector('a')) {
        // This will open a new url in that tab, and your code will be re-executed
        document.querySelector('a').click()
   }
    response(document.title)
`;

mimo_client.crawl({
    url: 'https://www.example.com',
    code: spiderCode
})

Licence

This software is released under the terms of the GNU General Public License v3.0. See the licence file for further information.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

NikosRig / Mimo-Crawler

Programming Languages

Labels

Projects that are alternatives of or similar to Mimo-Crawler

Mimo Crawler

Why Mimo?

Features

Requirements

Installation

Getting started

Start Firefox and the Mimo Server

Using the Mimo client API

`mimoClient.crawl(options)`

`mimoClient.addResponseListener(callback)`

`mimoClient.close()`

Basic Example

Web Spidering

Licence