Dagger is an easy-to-use, configuration over code, cloud-native framework built on top of Apache Flink for stateful processing of real-time streaming data.

Stars: ✭ 238 (+580%)

Mutual labels: influxdb

crawler-chrome-extensions

爬虫工程师常用的 Chrome 插件 | Chrome extensions used by crawler developer

Stars: ✭ 53 (+51.43%)

Mutual labels: scraping

tracker

Track your activities!

Stars: ✭ 14 (-60%)

Mutual labels: influxdb

influx4mqtt

Insert incoming MQTT values into InfluxDB. Follows mqtt-smarthome architecture.

Stars: ✭ 34 (-2.86%)

Mutual labels: influxdb

balance

Client side load balancing for Kubernetes clusters

Stars: ✭ 18 (-48.57%)

Mutual labels: load-balancer

scrapers

scrapers for building your own image databases

Stars: ✭ 46 (+31.43%)

Mutual labels: scraping

View All Similar Projects ➔

Architeuthis 🦑

HTTP(S) proxy with integrated load-balancing, rate-limiting and error handling. Built for automated web scraping.

Strictly obeys configured rate-limiting for each IP & Host
Seamless exponential backoff retries on timeout or error HTTP codes
Requires no additional configuration for integration into existing programs
Configurable per-host behavior
Monitoring with InfluxDB

Typical use case

Usage

git clone https://github.com/simon987/Architeuthis
vim config.json # Configure settings here

docker-compose up

You can add proxies using the /add_proxy API:

curl http://<Architeuthis IP>:5050/add_proxy?url=<url>&name=<name>

Or automatically using Proxybroker:

python3 import_from_broker.py http://<Architeuthis IP>:5050

Example usage with wget

export http_proxy="http://localhost:5050"
# --no-check-certificates is necessary for https mitm
# You don't need to specify user-agent if it's already in your config.json
wget -m -np -c --no-check-certificate -R index.html* http http://ca.releases.ubuntu.com/

With "every": "500ms" and a single proxy, you should see

...
level=trace msg=Sleeping wait=414.324437ms
level=trace msg="Routing request" conns=0 proxy=p0 url="http://ca.releases.ubuntu.com/12.04/SHA1SUMS.gpg"
level=trace msg=Sleeping wait=435.166127ms
level=trace msg="Routing request" conns=0 proxy=p0 url="http://ca.releases.ubuntu.com/12.04/SHA256SUMS"
level=trace msg=Sleeping wait=438.657784ms
level=trace msg="Routing request" conns=0 proxy=p0 url="http://ca.releases.ubuntu.com/12.04/SHA256SUMS.gpg"
level=trace msg=Sleeping wait=457.06543ms
level=trace msg="Routing request" conns=0 proxy=p0 url="http://ca.releases.ubuntu.com/12.04/ubuntu-12.04.5-alternate-amd64.iso"
level=trace msg=Sleeping wait=433.394361ms
...

Hot config reload

# Note: this will reset current rate limiters, if there are many active
# connections, this might cause a small request spike and go over
# the rate limits.
./reload.sh

Rules

Conditions

Left operand	Description	Allowed operators	Right operand
body	Contents of the response	`=`, `!=`	String w/ wildcard
body	Contents of the response	`<`, `>`	float
status	HTTP response code	`=`, `!=`	String w/ wildcard
status	HTTP response code	`<`, `>`	float
response_time	HTTP response code	`<`, `>`	duration (e.g. `20s`)
header:`<header>`	Response header	`=`, `!=`	String w/ wildcard
header:`<header>`	Response header	`<`, `>`	float

Note that response_time can never be higher than the configured timeout value.

Examples:

[
  {"condition":  "header:X-Test>10", "action":  "..."},
  {"condition":  "body=*Try again in a few minutes*", "action":  "..."},
  {"condition":  "response_time>10s", "action":  "..."},
  {"condition":  "status>500", "action":  "..."},
  {"condition":  "status=404", "action":  "..."},
  {"condition":  "status=40*", "action":  "..."}
]

Actions

Action	Description
should_retry	Override default retry behavior for http errors (by default it retries on 403,408,429,444,499,>500)
force_retry	Always retry (Up to retries_hard times)
dont_retry	Immediately stop retrying

In the event of a temporary network error, should_retry is ignored (it will always retry unless dont_retry is set)

Note that having too many rules for one host might negatively impact performance (especially the body condition for large requests)

Sample configuration

{
  "addr": "localhost:5050",
  "timeout": "15s",
  "wait": "4s",
  "multiplier": 2.5,
  "retries": 3,
  "hosts": [
    {
      "host": "*",
      "every": "500ms",
      "burst": 25,
      "headers": {
        "User-Agent": "Some user agent for all requests",
        "X-Test": "Will be overwritten"
      }
    },
    {
      "host": "*.reddit.com",
      "every": "2s",
      "burst": 2,
      "headers": {
        "X-Test": "Will overwrite default"
      }
    },
    {
      "host": ".s3.amazonaws.com",
      "every": "2s",
      "burst": 30,
      "rules": [
        {"condition": "status=403", "action": "dont_retry"}
      ]
    }
  ]
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

simon987 / Architeuthis

Programming Languages

Labels

Projects that are alternatives of or similar to Architeuthis

Architeuthis 🦑

Typical use case

Usage

Example usage with wget

Hot config reload

Rules

Sample configuration