All Projects → centic9 → CommonCrawlDocumentDownload

centic9 / CommonCrawlDocumentDownload

Licence: BSD-2-Clause license
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

Programming Languages

java
68154 projects - #9 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to CommonCrawlDocumentDownload

Mime Db
Media Type Database
Stars: ✭ 612 (+1323.26%)
Mutual labels:  mime-types
Swime
🗂 Swift MIME type checking based on magic bytes
Stars: ✭ 119 (+176.74%)
Mutual labels:  mime-types
php-mimetyper
PHP mime type and extension mapping library: built with jshttp/mime-db, compatible with Symfony and Laravel
Stars: ✭ 21 (-51.16%)
Mutual labels:  mime-types
Mime Types
The ultimate javascript content-type utility.
Stars: ✭ 865 (+1911.63%)
Mutual labels:  mime-types
Filetype
Fast, dependency-free, small Go package to infer the binary file type based on the magic numbers signature
Stars: ✭ 1,278 (+2872.09%)
Mutual labels:  mime-types
Fileio.jl
Main Package for IO, loading all different kind of files
Stars: ✭ 133 (+209.3%)
Mutual labels:  mime-types
Ruby Mime Types
Ruby MIME type registry library
Stars: ✭ 288 (+569.77%)
Mutual labels:  mime-types
mimer
A simple Mime type getter
Stars: ✭ 15 (-65.12%)
Mutual labels:  mime-types
Mime
The Hoa\Mime library.
Stars: ✭ 100 (+132.56%)
Mutual labels:  mime-types
KeywordAnalysis
Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends
Stars: ✭ 49 (+13.95%)
Mutual labels:  commoncrawl
Sixarm ruby magic number type
SixArm.com » Ruby » MagicNumberType infers a data type from the data's leading bytes
Stars: ✭ 13 (-69.77%)
Mutual labels:  mime-types
Mog
A different take on the UNIX tool cat
Stars: ✭ 62 (+44.19%)
Mutual labels:  mime-types
Yagmail
Send email in Python conveniently for gmail using yagmail
Stars: ✭ 2,169 (+4944.19%)
Mutual labels:  mime-types
Mime
Shared MIME-info database in D programming language
Stars: ✭ 7 (-83.72%)
Mutual labels:  mime-types
chatnoir-resiliparse
A robust web archive analytics toolkit
Stars: ✭ 26 (-39.53%)
Mutual labels:  warc
Mimetype
A fast golang library for MIME type and file extension detection, based on magic numbers
Stars: ✭ 452 (+951.16%)
Mutual labels:  mime-types
Apaxy
a simple, customisable theme for your apache directory listing
Stars: ✭ 1,672 (+3788.37%)
Mutual labels:  mime-types
khudro
Khudro is a very light weight web-server built with C.
Stars: ✭ 19 (-55.81%)
Mutual labels:  mime-types
ungoliant
🕷️ The pipeline for the OSCAR corpus
Stars: ✭ 69 (+60.47%)
Mutual labels:  commoncrawl
mimesniff
MIME Sniffing Standard
Stars: ✭ 89 (+106.98%)
Mutual labels:  mime-types

Build Status Gradle Status Release GitHub release Tag Maven Central Maven Central

This is a small tool to find matching URLs and download the corresponding binary data from the CommonCrawl indexes.

Support for the newer URL Index (http://blog.commoncrawl.org/2015/04/announcing-the-common-crawl-index/) is available, older URL Index as described at https://github.com/trivio/common_crawl_index and http://blog.commoncrawl.org/2013/01/common-crawl-url-index/ is still available in the "oldindex" package.

Please note that a full run usually finds a huge number of files and thus downloading will require a large amount of time and lots of disk-space if the data is stored locally!

Getting started

Grab it

git clone https://github.com/centic9/CommonCrawlDocumentDownload.git

Build it and create the distribution files

cd CommonCrawlDocumentDownload
./gradlew check

Run it

Fetch a list of interesting documents

./gradlew lookupURLs

Reads the current Common Crawl URL index data and extracts all URLs for interesting mime-types or file extensions, stores the URLs in a file called commoncrawl-CC-MAIN-<year>-<crawl>.txt

Download documents

./gradlew downloadDocuments

Uses the URLs listed in commoncrawl-CC-MAIN-<year>-<crawl>.txt to download the documents from the Common Crawl

Deduplicate files

./gradlew deduplicate

Some files have equal content, this task will detect these based on file-size and content-hash and move all duplicates to a backup-directory to leave only unique files in place.

Deprecated: Download documents from the old-index

./gradlew downloadOldIndex

Starts downloading the URL index files from the old index and looks at each URL, downloading binary data from the common crawl archives.

The longer stuff

Change it

Run unit tests

./gradlew check jacocoTestReport

Adjust which files are found

There are a few things that you can tweak:

  • The file-extensions that are detected as download-able files are handled in the class Extensions.
  • The mime-types that are detected as download-able files isare handled in the class MimeTypes.
  • Adjust the name of the list of found files in DownloadURLIndex.COMMON_CRAWL_FILE.
  • Adjust the location where files are downloaded to in Utils.DOWNLOAD_DIR.
  • The starting file-index (of the approximately 300 cdx-files) is currently set as constant in class org.dstadler.commoncrawl.index.DownloadURLIndex, this way you can also re-start a download that was interrupted before.

Adjust which commoncrawl-index is fetched

CommonCrawl periodically runs crawls and publishes them. You can switch to newer crawls by adjusting the constant CURRENT_CRAWL in DownloadURLIndex.java to the proper <year>-<week> number of the newer crawl.

See https://commoncrawl.org/connect/blog/ for announcemnts of the latest crawls.

Ideas

  • Old Index: By adding a new implementation of BlockProcesser (likely re-using existing stuff by deriving from one of the available implementations), you can do things like streaming processing of the file instead of storing the file locally, which will avoid using too much disk-space

Estimates (based on Old Index)

  • Size of overall URL Index is 233689120776, i.e. 217GB
  • Header: 6 Bytes
  • Index-Blocks: 2644
  • Block-Size: 65536
  • => Data-Blocks: 3563169
  • Aprox. Files per Block: 2.421275
  • Resulint aprox. number of files: 8627412
  • Avg. size per file: 221613
  • Needed storage: 1911954989425 bytes = 1.7TB!

Related projects/pages

Release it

./gradlew --console=plain release && ./gradlew closeAndReleaseRepository
  • This should automatically release the new version on MavenCentral
  • Afterwards go to the Github releases page and add release-notes

Support this project

If you find this library useful and would like to support it, you can Sponsor the author

Licensing

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].