All Projects → dennis-tra → nebula-crawler

dennis-tra / nebula-crawler

Licence: Apache-2.0 license
🌌 A libp2p DHT crawler, monitor, and measurement tool that exposes timely information about DHT networks.

Programming Languages

PLpgSQL
1095 projects
python
139335 projects - #7 most used programming language
go
31211 projects - #10 most used programming language
Jupyter Notebook
11667 projects
shell
77523 projects
Makefile
30231 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to nebula-crawler

web3.storage
⁂ The simple file storage service for IPFS & Filecoin
Stars: ✭ 417 (+329.9%)
Mutual labels:  ipfs, libp2p, filecoin
borg
Client-server stack for Web3! Turn your Raspberry Pi to a BAS server in minutes and enjoy the freedom of decentralized Web with a superior user experience!
Stars: ✭ 25 (-74.23%)
Mutual labels:  ipfs, libp2p, filecoin
go-libp2p-quic-transport
An implementation of a libp2p transport using QUIC
Stars: ✭ 102 (+5.15%)
Mutual labels:  ipfs, libp2p
Berty
Berty is a secure peer-to-peer messaging app that works with or without internet access, cellular data or trust in the network
Stars: ✭ 5,101 (+5158.76%)
Mutual labels:  ipfs, libp2p
estuary-www
https://estuary.tech
Stars: ✭ 32 (-67.01%)
Mutual labels:  ipfs, filecoin
hydra-booster
A DHT Indexer node & Peer Router
Stars: ✭ 56 (-42.27%)
Mutual labels:  ipfs, libp2p
nft-website
NFT School: Community education platform for developers in the non-fungible token space.
Stars: ✭ 260 (+168.04%)
Mutual labels:  ipfs, filecoin
Js Libp2p
The JavaScript Implementation of libp2p networking stack.
Stars: ✭ 1,686 (+1638.14%)
Mutual labels:  ipfs, libp2p
filecoin-client
Golang的轻量级filecoin客户端,支持离线签名,基本满足钱包交易所充值提现逻辑
Stars: ✭ 50 (-48.45%)
Mutual labels:  ipfs, filecoin
go-libp2p-http
HTTP on top of libp2p
Stars: ✭ 49 (-49.48%)
Mutual labels:  ipfs, libp2p
peer-id-generator
Vanity public key generator for use with IPFS and IPNS
Stars: ✭ 27 (-72.16%)
Mutual labels:  ipfs, libp2p
add-to-web3
⁂ Github Action to upload your website to web3.storage
Stars: ✭ 22 (-77.32%)
Mutual labels:  ipfs, filecoin
ipfs-crawler
A crawler for the IPFS network, code for our paper (https://arxiv.org/abs/2002.07747). Also holds scripts to evaluate the obtained data and make similar plots as in the paper.
Stars: ✭ 46 (-52.58%)
Mutual labels:  ipfs, libp2p
nft.storage
😋 Free decentralized storage and bandwidth for NFTs on IPFS and Filecoin.
Stars: ✭ 309 (+218.56%)
Mutual labels:  ipfs, filecoin
go-libp2p-tor-transport
🚧 WIP: tor transport for libp2p
Stars: ✭ 41 (-57.73%)
Mutual labels:  ipfs, libp2p
Js Ipfs
IPFS implementation in JavaScript
Stars: ✭ 6,129 (+6218.56%)
Mutual labels:  ipfs, libp2p
go-ipfs-recovery
Data recovery for IPFS protocol.
Stars: ✭ 16 (-83.51%)
Mutual labels:  ipfs, libp2p
py-multiaddr
multiaddr implementation in Python
Stars: ✭ 27 (-72.16%)
Mutual labels:  ipfs, libp2p
edgevpn
⛵ The immutable, decentralized, statically built p2p VPN without any central server and automatic discovery! Create decentralized introspectable tunnels over p2p with shared tokens
Stars: ✭ 223 (+129.9%)
Mutual labels:  ipfs, libp2p
valist
Web3-native software distribution. Publish and install executables, Docker images, WebAssembly, and more. Powered by Ethereum, IPFS, and Filecoin.
Stars: ✭ 107 (+10.31%)
Mutual labels:  ipfs, filecoin

Nebula Crawler Logo

Nebula Crawler

standard-readme compliant readme nebula GitHub license

A libp2p DHT crawler that also monitors the liveness and availability of peers. The crawler connects to the standard DHT bootstrap nodes and then recursively follows all entries in their k-buckets until all peers have been visited. Currently I'm running it for the IPFS and Filecoin networks.

🏆 The crawler was awarded a prize in the context of the DI2F Workshop hackathon. 🏆

📊 A Demo Dashboard can be found here! 📊

Screenshot from a Grafana dashboard

Table of Contents

Project Status

The crawler is used for a couple of academic project, and I'm running it since July '21 continuously.

The gathered numbers about the IPFS network are in line with existing data like from the wiberlin/ipfs-crawler. Their crawler also powers a dashboard which can be found here.

Usage

Nebula is a command line tool and provides the crawl sub-command. To simply crawl the IPFS network run:

nebula crawl --dry-run

Usually the crawler will persist its result in a postgres database - the --dry-run flag prevents it from doing that. One run takes ~5-10 min depending on your internet connection.

See the command line help page below for configuration options:

NAME:
   nebula - A libp2p DHT crawler, monitor and measurement tool that exposes timely information about DHT networks.

USAGE:
   nebula [global options] command [command options] [arguments...]

VERSION:
   vdev+5f3759df

AUTHOR:
   Dennis Trautwein <[email protected]>

COMMANDS:
   crawl    Crawls the entire network starting with a set of bootstrap nodes.
   monitor  Monitors the network by periodically dialing previously crawled peers.
   resolve  Resolves all multi addresses to their IP addresses and geo location information
   ping     Runs an ICMP latency measurement over the set of online peers of the most recent crawl
   provide  Starts a DHT measurement experiment by providing and requesting random content.
   help, h  Shows a list of commands or help for one command

GLOBAL OPTIONS:
   --debug                  Set this flag to enable debug logging (default: false) [$NEBULA_DEBUG]
   --log-level value        Set this flag to a value from 0 (least verbose) to 6 (most verbose). Overrides the --debug flag (default: 4) [$NEBULA_LOG_LEVEL]
   --config FILE             Load configuration from FILE [$NEBULA_CONFIG_FILE]
   --dial-timeout value     How long should be waited before a dial is considered unsuccessful (default: 1m0s) [$NEBULA_DIAL_TIMEOUT]
   --prom-port value        On which port should prometheus serve the metrics endpoint (default: 6666) [$NEBULA_PROMETHEUS_PORT]
   --prom-host value        Where should prometheus serve the metrics endpoint (default: 0.0.0.0) [$NEBULA_PROMETHEUS_HOST]
   --db-host value          On which host address can nebula reach the database (default: 0.0.0.0) [$NEBULA_DATABASE_HOST]
   --db-port value          On which port can nebula reach the database (default: 5432) [$NEBULA_DATABASE_PORT]
   --db-name value          The name of the database to use (default: nebula) [$NEBULA_DATABASE_NAME]
   --db-password value      The password for the database to use (default: password) [$NEBULA_DATABASE_PASSWORD]
   --db-user value          The user with which to access the database to use (default: nebula) [$NEBULA_DATABASE_USER]
   --protocols value        Comma separated list of protocols that this crawler should look for (default: "/ipfs/kad/1.0.0", "/ipfs/kad/2.0.0") [$NEBULA_PROTOCOLS]
   --bootstrap-peers value  Comma separated list of multi addresses of bootstrap peers [$NEBULA_BOOTSTRAP_PEERS]
   --help, -h               show help (default: false)
   --version, -v            print the version (default: false)

How does it work?

crawl

The crawl sub-command starts by connecting to a set of bootstrap nodes and constructing the routing tables (kademlia k-buckets) of the remote peers based on their PeerIds. Then nebula builds random PeerIds with a common prefix length (CPL) and asks each remote peer if they know any peers that are closer to the ones nebula just constructed (XOR distance). This will effectively yield a list of all PeerIds that a peer has in its routing table. The process repeats for all found peers until nebula does not find any new PeerIds.

This process is heavily inspired by the basic-crawler in libp2p/go-libp2p-kad-dht from @aschmahmann.

Every peer that was visited is persisted in the database. The information includes latency measurements (dial/connect/crawl durations), current set of multi addresses, current agent version and current set of supported protocols. If the peer was dialable nebula will also create a session instance that contains the following information:

type Session struct {
  // A unique id that identifies a particular session
  ID int
  // The peer ID in the form of Qm... or 12D3...
  PeerID string
  // When was the peer successfully dialed the first time
  FirstSuccessfulDial time.Time
  // When was the most recent successful dial to the peer above
  LastSuccessfulDial time.Time
  // When should we try to dial the peer again
  NextDialAttempt null.Time
  // When did we notice that this peer is not reachable.
  // This cannot be null because otherwise the unique constraint
  // uq_peer_id_first_failed_dial would not work (nulls are distinct).
  // An unset value corresponds to the timestamp 1970-01-01
  FirstFailedDial time.Time
  // The duration that this peer was online due to multiple subsequent successful dials
  MinDuration null.String
  // The duration from the first successful dial to the point were it was unreachable
  MaxDuration null.String
  // indicates whether this session is finished or not. Equivalent to check for
  // 1970-01-01 in the first_failed_dial field.
  Finished bool
  // How many subsequent successful dials could we track
  SuccessfulDials int
  // When was this session instance updated the last time
  UpdatedAt time.Time
  // When was this session instance created
  CreatedAt time.Time
}

At the end of each crawl nebula persists general statistics about the crawl like the total duration, dialable peers, encountered errors, agent versions etc...

Info: You can use the crawl sub-command with the --dry-run option that skips any database operations.

monitor

The monitor sub-command polls every 10 seconds all sessions from the database (see above) that are due to be dialed in the next 10 seconds (based on the NextDialAttempt timestamp). It attempts to dial all peers using previously saved multi-addresses and updates their session instances accordingly if they're dialable or not.

The NextDialAttempt timestamp is calculated based on the uptime that nebula has observed for that given peer. If the peer is up for a long time nebula assumes that it stays up and thus decreases the dial frequency aka. sets the NextDialAttempt timestamp to a time further in the future.

ping

The ping command fetches all peers that were found online of the most recent successful crawl from the database and sends ten ICM pings to each host. The measured latencies are saved in the latencies table.

resolve

The resolve sub-command goes through all multi addresses that are present in the database and resolves them to their respective IP-addresses. Behind one multi address can be multiple IP addresses due to, e.g., the dnsaddr protocol. Further, it queries the GeoLite2 database from Maxmind to extract country information about the IP addresses and saves them alongside the resolved addresses.

Install

Release download

There is no release yet.

From source

To compile it yourself run:

go install github.com/dennis-tra/nebula-crawler/cmd/nebula@latest # Go 1.16 or higher is required (may work with a lower version too)

Make sure the $GOPATH/bin is in your PATH variable to access the installed nebula executable.

Development

To develop this project you need Go > 1.16 and the following tools:

To install the necessary tools you can run make tools. This will use the go install command to download and install the tools into your $GOPATH/bin directory. So make sure you have it in your $PATH environment variable.

Database

You need a running postgres instance to persist and/or read the crawl results. Use the following command to start a local instance of postgres:

docker run -p 5432:5432 -e POSTGRES_PASSWORD=password -e POSTGRES_USER=nebula -e POSTGRES_DB=nebula postgres:13

Info: You can use the crawl sub-command with the --dry-run option that skips any database operations.

The default database settings are:

Name     = "nebula",
Password = "password",
User     = "nebula",
Host     = "localhost",
Port     = 5432,

To apply migrations then run:

# Up migrations
migrate -database 'postgres://nebula:password@localhost:5432/nebula?sslmode=disable' -path migrations up
# OR
make migrate-up

# Down migrations
migrate -database 'postgres://nebula:password@localhost:5432/nebula?sslmode=disable' -path migrations down
# OR
make migrate-down

# Create new migration
migrate create -ext sql -dir migrations -seq some_migration_name

To generate the ORM with SQLBoiler run:

sqlboiler psql

Deployment

First, you need to build the nebula docker image:

make docker
# OR
docker build . -t dennis-tra/nebula-crawler:latest

The deploy subfolder contains a docker-compose setup to get up and running quickly. It will start and configure nebula (monitoring mode), postgres, prometheus and grafana. The configuration can serve as a starting point to see how things fit together. Then you can start the aforementioned services by changing in the ./deploy directory and running:

docker compose up 

A few seconds later you should be able to access Grafana at localhost:3000. The initial credentials are

USERNAME: admin
PASSWORD: admin

There is one preconfigured dashboard in the General folder with the name IPFS Dashboard. To start a crawl that puts its results in the docker compose provisioned postgres database run:

./deploy/crawl.sh
# OR
docker run \
  --network nebula \
  --name nebula_crawler \
  --hostname nebula_crawler \
  dennis-tra/nebula-crawler:latest \
  nebula --db-host=postgres crawl

Currently, I'm running the crawler docker-less on a tiny VPS in a 30m interval. The corresponding crontab configuration is:

*/30 * * * * /some/path/nebula crawl 2> /var/log/nebula/crawl-$(date "+\%w-\%H-\%M")-stderr.log 1> /var/log/nebula/crawl-$(date "+\%w-\%H-\%M")-stdout.log

The logs will rotate every 7 days.


To run the crawler for multiple DHTs the idea is to also start multiple instances of nebula with the corresponding configuration. For instance, I'm running the crawler for the IPFS and Filecoin networks. The monitoring commands look like this:

nebula --prom-port=6667 monitor --workers=1000 # for IPFS
nebula --prom-port=6669 --config filecoin.json monitor --workers=1000 # for Filecoin

The filecoin.json file contains the following content:

 {
  "BootstrapPeers": [
    "/ip4/3.224.142.21/tcp/1347/p2p/12D3KooWCVe8MmsEMes2FzgTpt9fXtmCY7wrq91GRiaC8PHSCCBj",
    "/ip4/107.23.112.60/tcp/1347/p2p/12D3KooWCwevHg1yLCvktf2nvLu7L9894mcrJR4MsBCcm4syShVc",
    "/ip4/100.25.69.197/tcp/1347/p2p/12D3KooWEWVwHGn2yR36gKLozmb4YjDJGerotAPGxmdWZx2nxMC4",
    "/ip4/3.123.163.135/tcp/1347/p2p/12D3KooWKhgq8c7NQ9iGjbyK7v7phXvG6492HQfiDaGHLHLQjk7R",
    "/ip4/18.198.196.213/tcp/1347/p2p/12D3KooWL6PsFNPhYftrJzGgF5U18hFoaVhfGk7xwzD8yVrHJ3Uc",
    "/ip4/18.195.111.146/tcp/1347/p2p/12D3KooWLFynvDQiUpXoHroV1YxKHhPJgysQGH2k3ZGwtWzR4dFH",
    "/ip4/52.77.116.139/tcp/1347/p2p/12D3KooWP5MwCiqdMETF9ub1P3MbCvQCcfconnYHbWg6sUJcDRQQ",
    "/ip4/18.136.2.101/tcp/1347/p2p/12D3KooWRs3aY1p3juFjPy8gPN95PEQChm2QKGUCAdcDCC4EBMKf",
    "/ip4/13.250.155.222/tcp/1347/p2p/12D3KooWScFR7385LTyR4zU1bYdzSiiAb5rnNABfVahPvVSzyTkR",
    "/ip4/47.115.22.33/tcp/41778/p2p/12D3KooWGhufNmZHF3sv48aQeS13ng5XVJZ9E6qy2Ms4VzqeUsHk",
    "/ip4/61.147.123.111/tcp/12757/p2p/12D3KooWGHpBMeZbestVEWkfdnC9u7p6uFHXL1n7m1ZBqsEmiUzz",
    "/ip4/61.147.123.121/tcp/12757/p2p/12D3KooWQZrGH1PxSNZPum99M1zNvjNFM33d1AAu5DcvdHptuU7u",
    "/ip4/3.129.112.217/tcp/1235/p2p/12D3KooWBF8cpp65hp2u9LK5mh19x67ftAam84z9LsfaquTDSBpt",
    "/ip4/36.103.232.198/tcp/34721/p2p/12D3KooWQnwEGNqcM2nAcPtRR9rAX8Hrg4k9kJLCHoTR5chJfz6d",
    "/ip4/36.103.232.198/tcp/34723/p2p/12D3KooWMKxMkD5DMpSWsW7dBddKxKT7L2GgbNuckz9otxvkvByP"
  ],
  "DialTimeout": 60000000000,
  "CrawlWorkerCount": 1000,
  "MonitorWorkerCount": 1000,
  "CrawlLimit": 0,
  "MinPingInterval": 30000000000,
  "PrometheusHost": "localhost",
  "PrometheusPort": 6668, // this is overwritten by the command line arg and only picked up by the crawl command
  "DatabaseHost": "localhost",
  "DatabasePort": 5432,
  "DatabaseName": "nebula_filecoin",
  "DatabasePassword": "<password>",
  "DatabaseUser": "nebula_filecoin",
  "Protocols": [
    "/fil/kad/testnetnet/kad/1.0.0"
  ]
}

This configuration is created upon the first start of nebula in $XDG_CONFIG_HOME/nebula/config.json and can be adapted from there.

The corresponding crawl commands look like:

nebula crawl --workers=1000 # for IPFS (uses defaults like prom-port 6666)
nebula --config filecoin.json crawl --workers=1000 # for Filecoin (uses configuration prom-port 6668)

The workers flag configures the amount of concurrent connection/dials. I increased it until I didn't notice any performance improvement anymore.

So this is the prometheus ports configuration:

  • nebula crawl - ipfs - 6666
  • nebula monitor - ipfs - 6667
  • nebula crawl - filecoin - 6668
  • nebula monitor - filecoin - 6669

Furthermore, nebula has a hidden flag called --pprof-port if this flag is set it also serves pprof at localhost:given-port for debugging.

Analysis

There is a top-level analysis folder that contains various scripts to help understand the gathered data. More information can be found in the respective subfolders README file. The following evaluations can be found there

  • geoip - Uses a Maxmind database to map IP addresses to country ISO codes and plots the results.
  • churn - Uses a sessions database dump to construct a CDF of peer session lengths.
  • mixed - Multiple plotting scripts for various metrics of interest. See wcgcyx/nebula-crawler for plots as I have just copied the scripts from there.
  • report - A semi-automated set of scripts to generate the reports for dennis-tra/nebula-crawler-reports
  • More to come...

Related Efforts

Maintainers

@dennis-tra.

Contributing

Feel free to dive in! Open an issue or submit PRs.

Support

It would really make my day if you supported this project through Buy Me A Coffee.

Other Projects

You may be interested in one of my other projects:

  • pcp - Command line peer-to-peer data transfer tool based on libp2p.
  • image-stego - A novel way to image manipulation detection. Steganography-based image integrity - Merkle tree nodes embedded into image chunks so that each chunk's integrity can be verified on its own.

License

Apache License Version 2.0 © Dennis Trautwein

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].