All Projects → tb0hdan → domains

tb0hdan / domains

Licence: BSD-3-Clause license
World’s single largest Internet domains dataset

Programming Languages

HTML
75241 projects
shell
77523 projects

Projects that are alternatives of or similar to domains

pagser
Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler
Stars: ✭ 82 (-82.21%)
Mutual labels:  scrapy, colly
Filesensor
Dynamic file detection tool based on crawler 基于爬虫的动态敏感文件探测工具
Stars: ✭ 227 (-50.76%)
Mutual labels:  scrapy
Marmot
💐Marmot | Web Crawler/HTTP protocol Download Package 🐭
Stars: ✭ 186 (-59.65%)
Mutual labels:  scrapy
Stealer
抖音、快手、火山、皮皮虾,视频去水印程序
Stars: ✭ 217 (-52.93%)
Mutual labels:  scrapy
Livetv mining
直播网站数据采集
Stars: ✭ 188 (-59.22%)
Mutual labels:  scrapy
City Scrapers
Scrape, standardize and share public meetings from local government websites
Stars: ✭ 220 (-52.28%)
Mutual labels:  scrapy
Wenshu spider
🌈Wenshu_Spider-Scrapy框架爬取中国裁判文书网案件数据(2019-1-9最新版)
Stars: ✭ 177 (-61.61%)
Mutual labels:  scrapy
Awesome crawl
腾讯新闻、知乎话题、微博粉丝,Tumblr爬虫、斗鱼弹幕、妹子图爬虫、分布式设计等
Stars: ✭ 246 (-46.64%)
Mutual labels:  scrapy
Scrapy Splash
Scrapy+Splash for JavaScript integration
Stars: ✭ 2,666 (+478.31%)
Mutual labels:  scrapy
Gerapy
Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js
Stars: ✭ 2,601 (+464.21%)
Mutual labels:  scrapy
Py Elasticsearch Django
基于python语言开发的千万级别搜索引擎
Stars: ✭ 207 (-55.1%)
Mutual labels:  scrapy
Goribot
[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。
Stars: ✭ 190 (-58.79%)
Mutual labels:  scrapy
Sourcecodeofbook
《Python爬虫开发 从入门到实战》配套源代码。
Stars: ✭ 226 (-50.98%)
Mutual labels:  scrapy
Scrapydweb
Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI. DEMO 👉
Stars: ✭ 2,385 (+417.35%)
Mutual labels:  scrapy
Ecommercecrawlers
码云仓库链接:AJay13/ECommerceCrawlers Github 仓库链接:DropsDevopsOrg/ECommerceCrawlers 项目展示平台链接:http://wechat.doonsec.com
Stars: ✭ 3,073 (+566.59%)
Mutual labels:  scrapy
Weibospider
This is a sina weibo spider built by scrapy [微博爬虫/持续维护]
Stars: ✭ 2,408 (+422.34%)
Mutual labels:  scrapy
Github Spider
Github 仓库及用户分析爬虫
Stars: ✭ 190 (-58.79%)
Mutual labels:  scrapy
Ruiji.net
crawler framework, distributed crawler extractor
Stars: ✭ 220 (-52.28%)
Mutual labels:  scrapy
estate-crawler
Scraping the real estate agencies for up-to-date house listings as soon as they arrive!
Stars: ✭ 20 (-95.66%)
Mutual labels:  scrapy
Spider job
招聘网数据爬虫
Stars: ✭ 234 (-49.24%)
Mutual labels:  scrapy

Domains Project: Processing petabytes of data so you don't have to

Domain count GitHub stars GitHub forks GitHub code size in bytes GitHub repo size GitHub issues GitHub license GitHub commit activity

World's single largest Internet domains dataset

This public dataset contains freely available sorted list of Internet domains.

Dataset statistics

Project news

Support needed!

You can support this project by doing any combination of the following:

  • Posting a link on your website to DomainsProject
  • Sponsoring this project on Patreon
  • Opening issue and attaching other domain datasets that are not here yet (be sure to scroll through this README first)

Milestones:

Domains

  • 10 Million
  • 100 Million
  • 1 Billion
  • 1.7 Billion

(Wasted) Internet traffic:

  • 500TB
  • 925TB
  • 1PB
  • 1.3PB
  • 1.5PB

Random facts:

  • More than 1TB of Internet traffic is just 3 Mbytes of compressed data
  • 1 million domains is just 5 Mbytes compressed
  • More than 5.7PB of Internet traffic is necessary to crawl 1.7 billion domains (3.4TB / 1 million).
  • Only 4.6Gb of disk space is required to store 1.7 billion domains in compressed form
  • 1Gbit fully saturated link is good for about 2 million new domains every day
  • 8c/16t and 64 Gbytes of RAM machine is good for about 2 million new domains every day
  • 2 ISC Bind9 instances (>400 Mbytes RSS each) are required to get 2 million new domains every day
  • After reaching 9 million domains repository was switched to compressed files. Please use freely available XZ to unpack files.
  • After reaching 30 million records, files were moved to /data so repository doesn't have it's README at the very bottom.

Used by

CloudSEK

Using dataset

This repository empoys Git LFS technology, therefore user has to use both git lfs and xz to retrieve data. Cloning procedure is as follows:

git clone https://github.com/tb0hdan/domains.git
cd domains
git lfs install
./unpack.sh

Getting unfiltered dataset

For Patreon subscribers raw data is available at https://dataset.domainsproject.org.

wget -m https://dataset.domainsproject.org

Data format

After unpacking, domain lists are just text files (~49Gb at 1.7 bil) with one domain per line. Sample for data/afghanistan/domain2multi-af.txt:

1tv.af
1tvnews.af
3rdeye.af
8am.af
aan.af
acaa.gov.af
acb.af
acbr.gov.af
acci.org.af
ach.af
acku.edu.af
acsf.af
adras.af
aeiti.af

Search engines and crawlers

Crawlers

Domains Project bot

Domains Project uses crawler and DNS checks to get new domains.

DNS checks client is in early stages and is used by select few. It is called Freya and I'm working on making it stable and good enough for general public.

HTTP crawler is being rewritten as well. It is called Idun

Typical user agent for Domains Project bot looks like this:

Mozilla/5.0 (compatible; Domains Project/1.0.8; +https://domainsproject.org)

Some older versions have set to Github repo:

Mozilla/5.0 (compatible; Domains Project/1.0.4; +https://github.com/tb0hdan/domains)

All data in this dataset is gathered using Scrapy and Colly frameworks.

Starting with version 1.0.7 crawler has partial robots.txt support and rate limiting. Please open issue if you experience any problems. Don't forget to include your domain.

Others

Yacy

Yacy is a great opensource search engine. Here's my post on Yacy forum: https://searchlab.eu/t/domain-list-for-easier-search-bootstrapping/231

Additional sources

Rapid7 Sonar FDNS - no longer open

List of .FR domains from AfNIC.fr

Majestic Million

Internetstiftelsen Zone Data

DNS Census 2013

bigdatanews extract from Common Crawl (circa 2012)

Common Crawl - March/April 2020

The CAIDA UCSD IPv4 Routed /24 DNS Names Dataset - January/July 2019

GSA Data

OpenPageRank 10m hosts

Switch.ch Open Data

Slovak domains - Open Data

Research

This dataset can be used for research. There are papers that cover different topics. I'm just going to leave links to them here for reference.

Published works based on this dataset

Phishing Protection SPF, DKIM, DMARC

Email address analysis (Czech)

Proteus: A Self-Designing Range Filter

Large Scale String Analytics in Arkouda

Analysis

The Internet of Names: A DNS Big Dataset

Enabling Network Security Through Active DNS Datasets

Re-registration and general statistics

Analysis of the Internet Domain Names Re-registration Market

Lexical analysis of malicious domains

Detection of malicious domains through lexical analysis

Malicious Domain Names Detection Algorithm Based on Lexical Analysis and Feature Quantification

Detecting Malicious URLs Using Lexical Analysis

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].