All Projects → duckduckgo → Tracker Radar Collector

duckduckgo / Tracker Radar Collector

Licence: other
🕸 Modular, multithreaded, puppeteer-based crawler

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to Tracker Radar Collector

Linkedin Profile Scraper
🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Stars: ✭ 171 (+155.22%)
Mutual labels:  crawler, puppeteer
Rendora
dynamic server-side rendering using headless Chrome to effortlessly solve the SEO problem for modern javascript websites
Stars: ✭ 1,853 (+2665.67%)
Mutual labels:  crawler, puppeteer
Puppeteer Walker
a puppeteer walker 🕷 🕸
Stars: ✭ 78 (+16.42%)
Mutual labels:  crawler, puppeteer
Webster
a reliable high-level web crawling & scraping framework for Node.js.
Stars: ✭ 364 (+443.28%)
Mutual labels:  crawler, puppeteer
Ppspider
web spider built by puppeteer, support task-queue and task-scheduling by decorators,support nedb / mongodb, support data visualization; 基于puppeteer的web爬虫框架,提供灵活的任务队列管理调度方案,提供便捷的数据保存方案(nedb/mongodb),提供数据可视化和用户交互的实现方案
Stars: ✭ 237 (+253.73%)
Mutual labels:  crawler, puppeteer
Jvppeteer
Headless Chrome For Java (Java 爬虫)
Stars: ✭ 193 (+188.06%)
Mutual labels:  crawler, puppeteer
Squidwarc
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
Stars: ✭ 125 (+86.57%)
Mutual labels:  crawler, puppeteer
Chromium for spider
dynamic crawler for web vulnerability scanner
Stars: ✭ 220 (+228.36%)
Mutual labels:  crawler, puppeteer
bots-zoo
No description or website provided.
Stars: ✭ 59 (-11.94%)
Mutual labels:  crawler, puppeteer
Headless Chrome Crawler
Distributed crawler powered by Headless Chrome
Stars: ✭ 5,129 (+7555.22%)
Mutual labels:  crawler, puppeteer
Car Prices
Golang爬虫 爬取汽车之家 二手车产品库
Stars: ✭ 57 (-14.93%)
Mutual labels:  crawler
Nuxt Jest Puppeteer
🚀 Nuxt.js zero configuration tests, run with Jest and Puppetter
Stars: ✭ 57 (-14.93%)
Mutual labels:  puppeteer
Puppeteer Docs Zh Cn
Google Puppeteer 文档的中文版本 , 目标版本 1.9.0, 翻译中...
Stars: ✭ 61 (-8.96%)
Mutual labels:  puppeteer
Terpene Profile Parser For Cannabis Strains
Parser and database to index the terpene profile of different strains of Cannabis from online databases
Stars: ✭ 63 (-5.97%)
Mutual labels:  crawler
Awesome Python Primer
自学入门 Python 优质中文资源索引,包含 书籍 / 文档 / 视频,适用于 爬虫 / Web / 数据分析 / 机器学习 方向
Stars: ✭ 57 (-14.93%)
Mutual labels:  crawler
Boj Autocommit
When you solve the problem of Baekjoon Online Judge, it automatically commits and pushes to the remote repository.
Stars: ✭ 60 (-10.45%)
Mutual labels:  crawler
Picacomic downloader
哔咔漫画收藏夹下载程序
Stars: ✭ 57 (-14.93%)
Mutual labels:  crawler
Capture Website
Capture screenshots of websites
Stars: ✭ 1,075 (+1504.48%)
Mutual labels:  puppeteer
Marvelheroes
Marvel Heroes
Stars: ✭ 54 (-19.4%)
Mutual labels:  puppeteer
Page2image
📷 page2image is a npm package for taking screenshots which also provides CLI command
Stars: ✭ 66 (-1.49%)
Mutual labels:  puppeteer

DuckDuckGo Tracker Radar Collector

🕸 Modular, multithreaded, puppeteer-based crawler used to generate third party request data for the Tracker Radar.

How do I use it?

Use it from the command line

  1. Clone this project locally (git clone [email protected]:duckduckgo/tracker-radar-collector.git)
  2. Install all dependencies (npm i)
  3. Run the command line tool:
npm run crawl -- -u "https://example.com" -o ./data/ -v

Available options:

  • -o, --output <path> - (required) output folder where output files will be created
  • -u, --url <url> - single URL to crawl
  • -i, --input-list <path> - path to a text file with list of URLs to crawl (each in a separate line)
  • -d, --data-collectors <list> - comma separated list (e.g -d 'requests,cookies') of data collectors that should be used (all by default)
  • -c, --crawlers <number> - override the default number of concurrent crawlers (default number is picked based on the number of CPU cores)
  • -v, --verbose - log additional information on screen (progress bar will not be shown when verbose logging is enabled)
  • -l, --log-file <path> - save log data to a file
  • -f, --force-overwrite - overwrite existing output files (by default entries with existing output files are skipped)
  • -3, --only-3p - don't save any first-party data (e.g. requests, API calls for the same eTLD+1 as the main document)
  • -m, --mobile - emulate a mobile device when crawling
  • -p, --proxy-config <host> - optional SOCKS proxy host
  • -r, --region-code <region> - optional 2 letter region code. For metadata only

Use it as a module

  1. Install this project as a dependency (npm i git+https://github.com:duckduckgo/tracker-radar-collector.git).

  2. Import it:

// you can either import a "crawlerConductor" that runs multiple crawlers for you
const {crawlerConductor} = require('tracker-radar-collector');
// or a single crawler
const {crawler} = require('tracker-radar-collector');

// you will also need some data collectors (/collectors/ folder contains all build-in collectors)
const {RequestCollector, CookieCollector, } = require('tracker-radar-collector');
  1. Use it:
crawlerConductor({
    // required ↓
    urls: ['https://example.com', 'https://duck.com', ],
    dataCallback: (url, result) => {},
    // optional ↓
    dataCollectors: [new RequestCollector(), new CookieCollector()],
    failureCallback: (url, error) => {},
    numberOfCrawlers: 12,// custom number of crawlers (there is a hard limit of 38 though)
    logFunction: (...msg) => {},// custom logging function
    filterOutFirstParty: true,// don't save any first-party data (false by default)
    emulateMobile: true,// emulate a mobile device (false by default)
    proxyHost: 'socks5://myproxy:8080',// SOCKS proxy host (none by default)
});

OR (if you prefer to run a single crawler)

// crawler will throw an exception if crawl fails
const data = await crawler(new URL('https://example.com'), {
    // optional ↓
    collectors: [new RequestCollector(), new CookieCollector(), ],
    log: (...msg) => {},
    rank: 1,
    urlFilter: (url) => {},// function that, for each request URL, decides if its data should be stored or not
    emulateMobile: false,
    emulateUserAgent: false,// don't use the default puppeteer UA (default true)
    proxyHost: 'socks5://myproxy:8080',
    browserContext: context,// if you prefer to create the browser context yourself (to e.g. use other browser or non-incognito context) you can pass it here (by default crawler will create an incognito context using standard chromium for you)
});

ℹ️ Hint: check out crawl-cli.js and crawlerConductor.js to see how crawlerConductor and crawler are used in the wild.

Output format

Each successfully crawled website will create a separate file named after the website (when using the CLI tool). Output data format is specified in crawler.js (see CollectResult type definition). Additionally, for each crawl metadata.json file will be created containing crawl configuration, system configuration and some high-level stats.

Data post-processing

Example post-processing script, that can be used as a template, can be found in post-processing/summary.js. Execute it from the command line like this:

node ./post-processing/summary.js -i ./collected-data/ -o ./result.json

ℹ️ Hint: When dealing with huge amounts of data you may need to increase nodejs's memory limit e.g. node --max_old_space_size=4096.

Creating new collectors

Each collector needs to extend the BaseCollector and has to override following methods:

  • id() which returns name of the collector (e.g. 'cookies')
  • getData(options) which should return collected data. options have following properties:
    • finalUrl - final URL of the main document (after all redirects) that you may want to use,
    • filterFunction which, if provided, takes an URL and returns a boolean telling you if given piece of data should be returned or filtered out based on its origin.

Additionally, each collector can override following methods:

  • init(options) which is called before the crawl begins
  • addTarget(targetInfo) which is called whenever new target is created (main page, iframe, web worker etc.)

There are couple of build in collectors in the collectors/ folder. CookieCollector is the simplest one and can be used as a template.

Each new collector has to be added in two places to be discoverable:

  • crawlerConductor.js - so that crawlerConductor knows about it (and it can be used in the CLI tool)
  • main.js - so that the new collector can be imported by other projects
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].