All Projects → commoncrawl → Example Warc Java

commoncrawl / Example Warc Java

Programming Languages

java
68154 projects - #9 most used programming language

Labels

Projects that are alternatives of or similar to Example Warc Java

VRTK.Prefabs
*Deprecated* - A collection of productive prefabs for rapidly building spatial computing solutions in the Unity software.
Stars: ✭ 61 (+38.64%)
Mutual labels:  archived
Aawindow
[Deprecated] · UIWindow subclass to enable behavior like adaptive round-corners & detecting when Control Center is opened.
Stars: ✭ 486 (+1004.55%)
Mutual labels:  archived
React Native Typescript Boilerplate
The default React Native empty project converted to use TypeScript.
Stars: ✭ 7 (-84.09%)
Mutual labels:  archived
Sphero.js
🚫 DEPRECATED: The Sphero JavaScript SDK to control Sphero robots.
Stars: ✭ 346 (+686.36%)
Mutual labels:  archived
Commoncrawl
Common Crawl support library to access 2008-2012 crawl archives (ARC files)
Stars: ✭ 470 (+968.18%)
Mutual labels:  archived
Docker Cleanup
DEPRECATED Automatic Docker image, container and volume cleanup
Stars: ✭ 582 (+1222.73%)
Mutual labels:  archived
SimpleEmptyAudioEffect
A template for creating an Audio Unit in XCode 4.2.1
Stars: ✭ 15 (-65.91%)
Mutual labels:  archived
Ipfs Textbook
[unmaintained] Helping me understand what IPFS is and how it works
Stars: ✭ 43 (-2.27%)
Mutual labels:  archived
Pygeoip
DEPRECATED: Pure Python API for Maxmind's binary GeoIP databases
Stars: ✭ 483 (+997.73%)
Mutual labels:  archived
Timegrid
Free, open-source, online appointments platform based on Laravel PHP Framework.
Stars: ✭ 793 (+1702.27%)
Mutual labels:  archived
Piranha
[DEPRECATED] This is the legacy version of Piranha CMS for .NET 4.5, MVC 5.2 & WebPages 3.2.
Stars: ✭ 418 (+850%)
Mutual labels:  archived
Tinx
⛔️ Laravel Tinx is archived and no longer maintained.
Stars: ✭ 437 (+893.18%)
Mutual labels:  archived
Ui Fabric Ios
DEPRECATED Please use the new repo
Stars: ✭ 590 (+1240.91%)
Mutual labels:  archived
Query Engine
Even though this is an amazing piece of technology, and can be used independently, it requires a conversion to TypeScript and proper documetnation and tutorials, which we do not have time for.
Stars: ✭ 332 (+654.55%)
Mutual labels:  archived
Materialdrawer Xamarin
DEPRECATED!!! Xamarin bindings for https://github.com/mikepenz/MaterialDrawer
Stars: ✭ 22 (-50%)
Mutual labels:  archived
vcmi old mirror
[HISTORICAL] Old git mirror on VCMI subversion
Stars: ✭ 48 (+9.09%)
Mutual labels:  archived
Mern Cli
⛔️ DEPRECATED - A cli tool for getting started with MERN
Stars: ✭ 575 (+1206.82%)
Mutual labels:  archived
Affiliates
*Archived* A cozy new home for the former SpreadFirefox affiliates program.
Stars: ✭ 43 (-2.27%)
Mutual labels:  archived
Noty
⛔️ DEPRECATED - Dependency-free notification library that makes it easy to create alert - success - error - warning - information - confirmation messages as an alternative the standard alert dialog.
Stars: ✭ 6,725 (+15184.09%)
Mutual labels:  archived
Mern Starter
⛔️ DEPRECATED - Boilerplate for getting started with MERN stack
Stars: ✭ 5,175 (+11661.36%)
Mutual labels:  archived

Java and Clojure examples for processing Common Crawl WARC files

Mark Watson 2014/1/26

There are two Java examples and one Clojure example for now (more to come):

  • ReadWARC - reads a local WARC file that was manually copied from S3 storage to your laptop
  • ReadS3Bucket - this should be run on an EC2 instance for fast access to S3
  • clojure-examples/src/clojure-examples/core.clj - reads a local WARC file that was manually copied from S3 storage to your laptop

A JDK 1.7 or later is required (JDK 1.6 will not work).

Special thanks to the developers of the edu.cmu.lemurproject package from Carnegie Mellon University. This code reads WARC files and the source code is included in the src subdirectory.

I have just started experimenting with Common Crawl data. I plan on adding a Hadoop/Elastic MapReduce example and also more examples using other JVM languages like Clojure and JRuby.

ReadWARC

Assuming that you have the aws command line tools installed, you can list the contents of a crawl using:

aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2014-10/ --recursive | head -6

You can copy one segment to your laptop (segment files are less than 1 gigabytes) using:

aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2014-10/segments/1394023864559/warc/CC-MAIN-20140305125104-00002-ip-10-183-142-35.ec2.internal.warc.gz .

Then run this example using:

mvn install
mvn exec:java -Dexec.mainClass=org.commoncrawl.examples.java_warc.ReadWARC

ReadS3Bucket

You can set the maximum number of segment files to process using the max argument:

public class ReadS3Bucket {
  static public void process(AmazonS3 s3, String bucketName, String prefix, int max) {

As you can see in the example code, I pass the bucket and prefix as:

    process(s3, "commoncrawl", "crawl-data/CC-MAIN-2014-10", 2);

Note, using the Common Crawl AMI (I run it on a Medium EC2 instance), I installed JDK 1.7 (required for the edu.cmu.lemurproject package):

sudo yum install java-1.7.0-openjdk-devel.x86_64

TODO: In addition to installing Java 7, you also need to configure it using

sudo alternatives --config javac sudo alternatives --config java

TODO: Maven needs to be installed and it's not available through yum without some gymnastics.

After cloning the Github repository to get these examples on an EC2 instance:

git clone https://github.com/commoncrawl/example-warc-java.git
cd example-warc-java

build and run using:

mvn install
mvn exec:java -Dexec.mainClass=org.commoncrawl.examples.java_warc.ReadS3Bucket

Note: I also tested this using a micro EC2 instance. The time to process two gzipped segment files (of size a little less than 1 gigabyte each) is about 45 seconds on a micro EC2 instance.

Clojure Examples

You need to install the commoncrawl JAR file in your local maven repository:

mvn install:install-file -Durl=file:repo -DpomFile=pom.xml -DgroupId=local -DartifactId=commoncrawl -Dversion=0.0.1 -Dpackaging=jar -Dfile=target/commoncrawl-0.0.1.jar

Then you can:

cd clojure-examples
lein deps
lein test

License

This code is licensed under the Apache 2 license. Please give back to Common Crawl if you found it useful.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].