All Projects → simon987 → Architeuthis

simon987 / Architeuthis

Licence: GPL-3.0 license
MITM HTTP(S) proxy with integrated load-balancing, rate-limiting and error handling. Built for automated web scraping.

Programming Languages

go
31211 projects - #10 most used programming language
python
139335 projects - #7 most used programming language
groovy
2714 projects
HTML
75241 projects
Dockerfile
14818 projects
shell
77523 projects

Projects that are alternatives of or similar to Architeuthis

unpoller
Application: Collect ALL UniFi Controller, Site, Device & Client Data - Export to InfluxDB or Prometheus
Stars: ✭ 1,613 (+4508.57%)
Mutual labels:  influxdb
turtle
Instagram Photo Downloader
Stars: ✭ 15 (-57.14%)
Mutual labels:  scraping
docker-compose-scale-example
Example of Docker Compose scale and load balancing features
Stars: ✭ 18 (-48.57%)
Mutual labels:  load-balancer
docker-speedtest-influxdb
Speedtest results to InfluxDB for Grafana
Stars: ✭ 20 (-42.86%)
Mutual labels:  influxdb
iot-edge-offline-dashboarding
Azure IoT Edge offline dashboarding/reporting sample. Guidance and sample dashboards
Stars: ✭ 31 (-11.43%)
Mutual labels:  influxdb
docker-iot-dashboard
A complete IoT server for LoRaWAN IoT projects: node-red + influxdb + grafana + ssl + let's encrypt using docker-compose.
Stars: ✭ 79 (+125.71%)
Mutual labels:  influxdb
InfluxDB
App Metrics Extensions for InfluxDB reporting
Stars: ✭ 17 (-51.43%)
Mutual labels:  influxdb
socials
👨‍👩‍👦 Social account detection and extraction in Python, e.g. for crawling/scraping.
Stars: ✭ 37 (+5.71%)
Mutual labels:  scraping
NBA-Fantasy-Optimizer
NBA Daily Fantasy Lineup Optimizer for FanDuel Using Python
Stars: ✭ 21 (-40%)
Mutual labels:  scraping
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Stars: ✭ 711 (+1931.43%)
Mutual labels:  scraping
gochanges
**[ARCHIVED]** website changes tracker 🔍
Stars: ✭ 12 (-65.71%)
Mutual labels:  scraping
double-agent
A test suite of common scraper detection techniques. See how detectable your scraper stack is.
Stars: ✭ 123 (+251.43%)
Mutual labels:  scraping
etf4u
📊 Python tool to scrape real-time information about ETFs from the web and mixing them together by proportionally distributing their assets allocation
Stars: ✭ 29 (-17.14%)
Mutual labels:  scraping
styx
Programmable, asynchronous, event-based reverse proxy for JVM.
Stars: ✭ 250 (+614.29%)
Mutual labels:  load-balancer
dagger
Dagger is an easy-to-use, configuration over code, cloud-native framework built on top of Apache Flink for stateful processing of real-time streaming data.
Stars: ✭ 238 (+580%)
Mutual labels:  influxdb
crawler-chrome-extensions
爬虫工程师常用的 Chrome 插件 | Chrome extensions used by crawler developer
Stars: ✭ 53 (+51.43%)
Mutual labels:  scraping
tracker
Track your activities!
Stars: ✭ 14 (-60%)
Mutual labels:  influxdb
influx4mqtt
Insert incoming MQTT values into InfluxDB. Follows mqtt-smarthome architecture.
Stars: ✭ 34 (-2.86%)
Mutual labels:  influxdb
balance
Client side load balancing for Kubernetes clusters
Stars: ✭ 18 (-48.57%)
Mutual labels:  load-balancer
scrapers
scrapers for building your own image databases
Stars: ✭ 46 (+31.43%)
Mutual labels:  scraping

Architeuthis 🦑

CodeFactor GitHub Build Status

HTTP(S) proxy with integrated load-balancing, rate-limiting and error handling. Built for automated web scraping.

  • Strictly obeys configured rate-limiting for each IP & Host
  • Seamless exponential backoff retries on timeout or error HTTP codes
  • Requires no additional configuration for integration into existing programs
  • Configurable per-host behavior
  • Monitoring with InfluxDB

grafana

Typical use case

user_case

Usage

git clone https://github.com/simon987/Architeuthis
vim config.json # Configure settings here

docker-compose up

You can add proxies using the /add_proxy API:

curl http://<Architeuthis IP>:5050/add_proxy?url=<url>&name=<name>

Or automatically using Proxybroker:

python3 import_from_broker.py http://<Architeuthis IP>:5050

Example usage with wget

export http_proxy="http://localhost:5050"
# --no-check-certificates is necessary for https mitm
# You don't need to specify user-agent if it's already in your config.json
wget -m -np -c --no-check-certificate -R index.html* http http://ca.releases.ubuntu.com/

With "every": "500ms" and a single proxy, you should see

...
level=trace msg=Sleeping wait=414.324437ms
level=trace msg="Routing request" conns=0 proxy=p0 url="http://ca.releases.ubuntu.com/12.04/SHA1SUMS.gpg"
level=trace msg=Sleeping wait=435.166127ms
level=trace msg="Routing request" conns=0 proxy=p0 url="http://ca.releases.ubuntu.com/12.04/SHA256SUMS"
level=trace msg=Sleeping wait=438.657784ms
level=trace msg="Routing request" conns=0 proxy=p0 url="http://ca.releases.ubuntu.com/12.04/SHA256SUMS.gpg"
level=trace msg=Sleeping wait=457.06543ms
level=trace msg="Routing request" conns=0 proxy=p0 url="http://ca.releases.ubuntu.com/12.04/ubuntu-12.04.5-alternate-amd64.iso"
level=trace msg=Sleeping wait=433.394361ms
...

Hot config reload

# Note: this will reset current rate limiters, if there are many active
# connections, this might cause a small request spike and go over
# the rate limits.
./reload.sh

Rules

Conditions

Left operand Description Allowed operators Right operand
body  Contents of the response =, != String w/ wildcard
body  Contents of the response <, > float
status  HTTP response code =, != String w/ wildcard
status  HTTP response code <, > float
response_time  HTTP response code <, > duration (e.g. 20s)
header:<header>  Response header =, != String w/ wildcard
header:<header>  Response header <, > float

Note that response_time can never be higher than the configured timeout value.

Examples:

[
  {"condition":  "header:X-Test>10", "action":  "..."},
  {"condition":  "body=*Try again in a few minutes*", "action":  "..."},
  {"condition":  "response_time>10s", "action":  "..."},
  {"condition":  "status>500", "action":  "..."},
  {"condition":  "status=404", "action":  "..."},
  {"condition":  "status=40*", "action":  "..."}
]

Actions

Action Description
should_retry  Override default retry behavior for http errors (by default it retries on 403,408,429,444,499,>500)
force_retry  Always retry (Up to retries_hard times)
dont_retry  Immediately stop retrying

In the event of a temporary network error, should_retry is ignored (it will always retry unless dont_retry is set)

Note that having too many rules for one host might negatively impact performance (especially the body condition for large requests)

Sample configuration

{
  "addr": "localhost:5050",
  "timeout": "15s",
  "wait": "4s",
  "multiplier": 2.5,
  "retries": 3,
  "hosts": [
    {
      "host": "*",
      "every": "500ms",
      "burst": 25,
      "headers": {
        "User-Agent": "Some user agent for all requests",
        "X-Test": "Will be overwritten"
      }
    },
    {
      "host": "*.reddit.com",
      "every": "2s",
      "burst": 2,
      "headers": {
        "X-Test": "Will overwrite default"
      }
    },
    {
      "host": ".s3.amazonaws.com",
      "every": "2s",
      "burst": 30,
      "rules": [
        {"condition": "status=403", "action": "dont_retry"}
      ]
    }
  ]
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].