All Projects → commoncrawl → Commoncrawl

commoncrawl / Commoncrawl

Common Crawl support library to access 2008-2012 crawl archives (ARC files)

Labels

Projects that are alternatives of or similar to Commoncrawl

ton-client-rs
TON Labs SDK Client Library for Rust
Stars: ✭ 15 (-96.81%)
Mutual labels:  archived
siteleaf-v1-api
Siteleaf v1 API documentation
Stars: ✭ 35 (-92.55%)
Mutual labels:  archived
Query Engine
Even though this is an amazing piece of technology, and can be used independently, it requires a conversion to TypeScript and proper documetnation and tutorials, which we do not have time for.
Stars: ✭ 332 (-29.36%)
Mutual labels:  archived
AASecondaryScreen
[Deprecated] · Approachable implementation of iOS AirPlay-Mirroring using Swift.
Stars: ✭ 40 (-91.49%)
Mutual labels:  archived
Sims4ScriptingBPProj
Sims 4 Scripting Boilerplate Project
Stars: ✭ 32 (-93.19%)
Mutual labels:  archived
Liquid-Application-Framework-1.0-deprecated
Liquid is a framework to speed up the development of microservices
Stars: ✭ 26 (-94.47%)
Mutual labels:  archived
OSM-Completionist
⛔️ DEPRECATED iOS companion app for OpenStreetMap that allows contributors to complete missing information
Stars: ✭ 17 (-96.38%)
Mutual labels:  archived
Code Gov Web
DEPRECATED 🛑- Federal Source Code policy implementation.
Stars: ✭ 423 (-10%)
Mutual labels:  archived
lazy-require
Sponsor this project to keep it maintained, or use Deno instead.
Stars: ✭ 16 (-96.6%)
Mutual labels:  archived
VRTK.Prefabs
*Deprecated* - A collection of productive prefabs for rapidly building spatial computing solutions in the Unity software.
Stars: ✭ 61 (-87.02%)
Mutual labels:  archived
steam
DEPRECATED Build, manage and deploy H2O's high-speed machine learning models.
Stars: ✭ 59 (-87.45%)
Mutual labels:  archived
react-native-apple-sign-in
Apple Signin for your React Native applications
Stars: ✭ 16 (-96.6%)
Mutual labels:  archived
SimpleEmptyAudioEffect
A template for creating an Audio Unit in XCode 4.2.1
Stars: ✭ 15 (-96.81%)
Mutual labels:  archived
react-relay-rebind
Component-scope state management for Relay modern & React.
Stars: ✭ 15 (-96.81%)
Mutual labels:  archived
Sphero.js
🚫 DEPRECATED: The Sphero JavaScript SDK to control Sphero robots.
Stars: ✭ 346 (-26.38%)
Mutual labels:  archived
Sphero-AR-SDK
🚫 DEPRECATED: Sphero's augmented reality SDK
Stars: ✭ 46 (-90.21%)
Mutual labels:  archived
Drupal-Scaffold
DEPRECATED: This project has been replaced by documentation within Confluence regarding best practices for setting up a new Drupal 9 project.
Stars: ✭ 48 (-89.79%)
Mutual labels:  archived
Tinx
⛔️ Laravel Tinx is archived and no longer maintained.
Stars: ✭ 437 (-7.02%)
Mutual labels:  archived
Piranha
[DEPRECATED] This is the legacy version of Piranha CMS for .NET 4.5, MVC 5.2 & WebPages 3.2.
Stars: ✭ 418 (-11.06%)
Mutual labels:  archived
vcmi old mirror
[HISTORICAL] Old git mirror on VCMI subversion
Stars: ✭ 48 (-89.79%)
Mutual labels:  archived

Common Crawl Support Library

Overview

This library provides support code for the consumption of the Common Crawl Corpus RAW crawl data (ARC Files) stored on S3. More information about how to access the corpus can be found at https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set .

You can take two primary routes to consuming the ARC File content:

(1) You can run a Hadoop cluster on EC2 or use EMR to run a Hadoop job. In this case, you can use the ARCFileInputFormat to drive data to your mappers/reducers. There are two versions of the InputFormat: One written to conform to the deprecated mapred package, located at org.commoncrawl.hadoop.io.mapred and one written for the mapreduce package, correspondingly located at org.commoncrawl.hadoop.io.mapreduce.

(2) You can decode data directly by feeding an InputStream to the ARCFileReader class located in the org.commoncrawl.util.shared package.

Both routes (InputFormat or ARCFileReader direct route) produce a tuple consisting of a UTF-8 encoded URL (Text), and the raw content (BytesWritable), including HTTP headers, that were downloaded by the crawler. The HTTP headers are UTF-8 encoded, and the headers and content are delimited by a consecutive set of CRLF tokens. The content itself, when it is of a text mime type, is encoded using the source text encoding.

Build Notes:

  1. You need to define JAVA_HOME, and make sure you have Ant & Maven installed.
  2. Set hadoop.path (in build.properties) to point to your Hadoop distribution.

Sample Usage:

Once the commoncrawl.jar has been built, you can validate that the ARCFileReader works for you by executing the sample command line from root for the commoncrawl source directory:

./bin/launcher.sh org.commoncrawl.util.shared.ARCFileReader --awsAccessKey <ACCESS KEY> --awsSecret <SECRET> --file s3n://aws-publicdatasets/common-crawl/parse-output/segment/1341690164240/1341819847375_4319.arc.gz
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].