All Projects β†’ ulixee β†’ double-agent

ulixee / double-agent

Licence: MIT license
A test suite of common scraper detection techniques. See how detectable your scraper stack is.

Programming Languages

typescript
32286 projects
javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to double-agent

scrapy-fieldstats
A Scrapy extension to log items coverage when the spider shuts down
Stars: ✭ 17 (-86.18%)
Mutual labels:  scraping, crawling, scrapy
Linkedin Profile Scraper
πŸ•΅οΈβ€β™‚οΈ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Stars: ✭ 171 (+39.02%)
Mutual labels:  scraping, crawling, puppeteer
Dotnetcrawler
DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c
Stars: ✭ 100 (-18.7%)
Mutual labels:  scraping, crawling, scrapy
Headless Chrome Crawler
Distributed crawler powered by Headless Chrome
Stars: ✭ 5,129 (+4069.92%)
Mutual labels:  scraping, crawling, puppeteer
ARGUS
ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of different websites. On the websites, ARGUS is able to perform tasks like scraping texts or collecting hyperlinks between websites. See: https://link.springer.com/article/10.1007/s11192-020-03726-9
Stars: ✭ 68 (-44.72%)
Mutual labels:  scraping, crawling, scrapy
scrapy-distributed
A series of distributed components for Scrapy. Including RabbitMQ-based components, Kafka-based components, and RedisBloom-based components for Scrapy.
Stars: ✭ 38 (-69.11%)
Mutual labels:  scraping, crawling, scrapy
Awesome Puppeteer
A curated list of awesome puppeteer resources.
Stars: ✭ 1,728 (+1304.88%)
Mutual labels:  scraping, crawling, puppeteer
bots-zoo
No description or website provided.
Stars: ✭ 59 (-52.03%)
Mutual labels:  scraping, crawling, puppeteer
Apify Js
Apify SDK β€” The scalable web scraping and crawling library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.
Stars: ✭ 3,154 (+2464.23%)
Mutual labels:  scraping, crawling, puppeteer
Easy Scraping Tutorial
Simple but useful Python web scraping tutorial code.
Stars: ✭ 583 (+373.98%)
Mutual labels:  scraping, crawling, scrapy
Grawler
Grawler is a tool written in PHP which comes with a web interface that automates the task of using google dorks, scrapes the results, and stores them in a file.
Stars: ✭ 98 (-20.33%)
Mutual labels:  scraping, crawling
Email Extractor
The main functionality is to extract all the emails from one or several URLs - La funcionalidad principal es extraer todos los correos electrΓ³nicos de una o varias Url
Stars: ✭ 81 (-34.15%)
Mutual labels:  scraping, scrapy
Scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
Stars: ✭ 42,343 (+34325.2%)
Mutual labels:  scraping, crawling
Seleniumcrawler
An example using Selenium webdrivers for python and Scrapy framework to create a web scraper to crawl an ASP site
Stars: ✭ 117 (-4.88%)
Mutual labels:  scraping, scrapy
Django Dynamic Scraper
Creating Scrapy scrapers via the Django admin interface
Stars: ✭ 1,024 (+732.52%)
Mutual labels:  scraping, scrapy
Secret Agent
The web browser that's built for scraping.
Stars: ✭ 151 (+22.76%)
Mutual labels:  scraping, puppeteer
Scrapy Cluster
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
Stars: ✭ 921 (+648.78%)
Mutual labels:  scraping, scrapy
Educative.io Downloader
πŸ“– This tool is to download course from educative.io for offline usage. It uses your login credentials and download the course.
Stars: ✭ 139 (+13.01%)
Mutual labels:  scraping, puppeteer
Antch
Antch, a fast, powerful and extensible web crawling & scraping framework for Go
Stars: ✭ 198 (+60.98%)
Mutual labels:  scraping, crawling
Memorious
Distributed crawling framework for documents and structured data.
Stars: ✭ 248 (+101.63%)
Mutual labels:  scraping, crawling

NOTICE πŸ“ This module is merged into the unblocked monorepo for future development!


Double Agent is a suite of tools written to allow a scraper engine to test if it is detectable when trying to blend into the most common web traffic.

Structure:

DoubleAgent has been organized into two main layers:

  • /collect: scripts/plugins for collecting browser profiles. Each plugin generates a series of pages to test how a browser behaves.
  • /analyze: scripts/plugins for analyzing browser profiles against verified profiles. Scraper results from collect are compared to legit "profiles" to find discrepancies. These checks are given a Looks Human"β„’ score, which indicates the likelihood that a scraper would be flagged as bot or human.

The easiest way to use collect is with the collect-controller:

  • /collect-controller: a server that can generate step-by-step assignments for a scraper to run all tests

Plugins

The bulk of the collect and analyze logic has been organized into what we call plugins.

Collect Plugins

Name Description
browser-codecs Collects the audio, video and WebRTC codecs of the browser
browser-dom-environment Collects the browser's DOM environment such as object structure, class inheritance amd key order
browser-fingerprints Collects various browser attributes that can be used to fingerprint a given session
browser-fonts Collects the fonts of the current browser/os.
browser-speech Collects browser speech synthesis voices
http-assets Collects the headers used when loading assets such as css, js, and images in a browser
http-basic-headers Collects the headers sent by browser when requesting documents in various contexts
http-ua-hints Collects User Agent hints for a browser
http-websockets Collects the headers used when initializing and facilitating web sockets
http-xhr Collects the headers used by browsers when facilitating XHR requests
http2-session Collects the settings, pings and frames sent across by a browser http2 client
tcp Collects tcp packet values such as window-size and time-to-live
tls-clienthello Collects the TLS clienthello handshake when initiating a secure connection
http-basic-cookies Collects a wide range of cookies configuration options and whether they're settable/gettable

Analyze Plugins

Name Description
browser-codecs Analyzes that the audio, video and WebRTC codecs match the given user agent
browser-dom-environment Analyzes the DOM environment, such as functionality and object structure, match the given user-agent
browser-fingerprints Analyzes whether the browser's fingerprints leak across sessions
http-assets Analyzes http header order, capitalization and default values for common document assets (images, fonts, media, scripts, stylesheet, etc)
http-basic-cookies Analyzes whether cookies are enabled correctly, including same-site and secure
http-basic-headers Analyzes header order, capitalization and default values
http-websockets Analyzes websocket upgrade request header order, capitalization and default values
http-xhr Analyzes header order, capitalization and default values of XHR requests
http2-session Analyzes http2 session settings and frames
tcp Analyzes tcp packet values, including window-size and time-to-live
tls-clienthello Analyzes clienthello handshake signatures, including ciphers, extensions and version

Probes:

DoubleAgent operates off of the notion of "probes". Probes are checks or "tests" to reliably check a piece of information emitted by a browser. The collect phase of DoubleAgent gathers raw data from browsers running a series of tests. The analyze phase turns that raw data into "probes" using these patterns.

Each measured "signal" from a browser is stored as a probe-id, which is the raw output of the actual values emitted.

Probes are created during "Profile Generation", which will create all the possible probe-ids, along with which browsers and operating systems they correspond to. These are called "Probe Buckets". They're a tool to find overlap between the millions of signals browsers put out and reduce the noise when presenting the information.

{
  "id": "aord-accv",
  "checkName": "ArrayOrderIndexCheck",
  "checkType": "Individual",
  "checkMeta": {
    "path": "headers:none:Document:host",
    "protocol": "http",
    "httpMethod": "GET"
  },
  "args": [
    [
      [],
      [
        "connection",
        "upgrade-insecure-requests",
        "user-agent",
        "accept",
        "accept-encoding",
        "accept-language",
        "cookie"
      ]
    ]
  ]
}

Probe ids for that pattern look like: http:GET:headers:none:Document:host:ArrayOrderIndexCheck:;connection,upgrade-insecure-requests,user-agent,accept,accept-encoding,accept-language,cookie. This probe id captures a bit about the test, as well as the measured signal from the browser.

Updating the Probe "Sources"

Probes are generated from a baseline of browsers. Double Agent comes with some built-in profiles in probe-data based on the browsers here. Double Agent is built to allow testing a single browser, or to generate a massive data set to see how well scrapers can emulate many browsers. As this is very time consuming, we tend to limit the tested browsers to the last couple versions of Chrome, which is what Unblocked Agent can currently emulate.

If you wish to generate probes for different data browsers (or a wider set), you can follow these steps to update the data:

  1. Clone the unblocked-web/unblocked monorepo and install git submodules.
  2. Download the unblocked-web/browser-profiler data by running yarn downloadData in that workspace folder.
  3. Modify double-agent/stacks/data/external/userAgentConfig.json to include browser ids you wish to test (<browser.toLowercase()>-<major>-<minor ?? 0>).
  4. Run yarn 0 to copy in the profile data.
  5. Run yarn 1 to create new probes.

Testing your Scraper:

To view examples of running the test suite with a custom browser, check-out the DoubleAgent Stacks project in Unblocked.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].