All Projects → commoncrawl → Commoncrawl Crawler

commoncrawl / Commoncrawl Crawler

The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)

Programming Languages

java
68154 projects - #9 most used programming language

Labels

Projects that are alternatives of or similar to Commoncrawl Crawler

Feedr
Use feedr to fetch the data from a remote url, respect its caching, and parse its data. Despite its name, it's not just for feed data but also for all data that you can feed into it (including binary data).
Stars: ✭ 56 (-72.14%)
Mutual labels:  archived
Closure Linter
Automatically exported from code.google.com/p/closure-linter
Stars: ✭ 104 (-48.26%)
Mutual labels:  archived
React Pinterest
Stars: ✭ 151 (-24.88%)
Mutual labels:  archived
Sphero Mac Sdk
🚫 DEPRECATED: Sphero SDK for the Mac platform.
Stars: ✭ 70 (-65.17%)
Mutual labels:  archived
Julian
⛔️DEPRECATED Brilliantly clever PHP calendar class
Stars: ✭ 89 (-55.72%)
Mutual labels:  archived
Python Firebase
⛔️ [DEPRECATED] python wrapper for Firebase's REST API
Stars: ✭ 117 (-41.79%)
Mutual labels:  archived
Codeigniter Base Model
⛔️DEPRECATED CodeIgniter base CRUD model to remove repetition and increase productivity
Stars: ✭ 1,052 (+423.38%)
Mutual labels:  archived
Sketch Toolbox
DEPRECATED: A plugin manager for Sketch.app
Stars: ✭ 2,159 (+974.13%)
Mutual labels:  archived
Secretary
DEPRECATED Secrets management for dynamic environments
Stars: ✭ 93 (-53.73%)
Mutual labels:  archived
Pinlater
PinLater is a Thrift service to manage scheduling and execution of asynchronous jobs.
Stars: ✭ 125 (-37.81%)
Mutual labels:  archived
Siteleaf V1 Themes
Siteleaf v1 theme documentation
Stars: ✭ 72 (-64.18%)
Mutual labels:  archived
Codeigniter Schema
⛔️DEPRECATED Expressive table definitions
Stars: ✭ 87 (-56.72%)
Mutual labels:  archived
Go Web3
Ethereum Go Client [obsolete]
Stars: ✭ 120 (-40.3%)
Mutual labels:  archived
Commoncrawl Examples
A library of examples showing how to use the Common Crawl corpus (2008-2012, ARC format)
Stars: ✭ 63 (-68.66%)
Mutual labels:  archived
Sphero Android Sdk
🚫 DEPRECATED REPO: Sphero™ is the amazing robotic ball ( gosphero.com ), this is the repository for the Android SDK for Sphero™. Visit dev site for more information:
Stars: ✭ 160 (-20.4%)
Mutual labels:  archived
Graphql Modules
⚠️ [DEPRECATED] GraphQL module library for Apollo.
Stars: ✭ 53 (-73.63%)
Mutual labels:  archived
Codeigniter Base Controller
⛔️DEPRECATED CodeIgniter base controller with view autoloading and layout support
Stars: ✭ 115 (-42.79%)
Mutual labels:  archived
Terraintoolsamples
Unity has archived the TerrainToolSamples repository. For future development, please use the Terrain Tools package.
Stars: ✭ 195 (-2.99%)
Mutual labels:  archived
Terrapin
Serving system for batch generated data sets
Stars: ✭ 168 (-16.42%)
Mutual labels:  archived
Benchmark Php
🚀 A benchmark script for PHP and MySQL (Archived)
Stars: ✭ 122 (-39.3%)
Mutual labels:  archived

This is the primary repository for the services & map-reduce jobs used to produce the CommonCrawl web corpus from 2008 to 2012.

Tree Structure

  • org.commoncrawl.async - Utility code used to build Async server.
  • org.commoncrawl.hadoop.io - ARCInputFormat and related classes.
  • org.commoncrawl.hadoop.mergeutils - Support for merge-sorts outside the context of a Hadoop job.
  • org.commoncrawl.hadoop.template - Sample Hadoop Job.
  • org.commoncrawl.io - CommonCrawl IO library used by crawlers.
  • org.commoncrawl.mapred - Root for all MapReduce jobs. Also contains data structure definitions shared across jobs (database.jr).
  • org.commoncrawl.mapred.ec2.parser - Code used to generate ARCFiles and intermediate data on EC2 using EMR.
  • org.commoncrawl.mapred.ec2.postprocess.deduper - Code to support a parallel dedupe using a 64bit Simhash.
  • org.commoncrawl.mapred.ec2.postprocess.linkCollector - Code to merge metadata generated by the parser job.
  • org.commoncrawl.mapred.pipelineV3 - The start of the new Nutch Free map-reduce pipeline used to process crawl metadata and generate new crawl lists.
  • org.commoncrawl.mapred.segmenter - Support code used to generate Crawl Segment (URL lists consumed by the crawlers).
  • org.commoncrawl.protocol - Shared data structure and enum definitions (generated).
  • org.commoncrawl.rpc - CommonCrawl RPC library used to build distributed systems.
  • org.commoncrawl.server - CommonCrawl Server base class used by various services.
  • org.commoncrawl.service - All long lived processes in the CommonCrawl system are house under this directory.
  • org.commoncrawl.service.crawler - The crawler long running process (Consumes Crawl Lists, writes content to HDFS).
  • org.commoncrawl.service.crawlhistory - A service that manages a crawler's crawl state in a BloomFilter.
  • org.commoncrawl.service.directory - A barebones service used to store and subscribe to lists via a path.
  • org.commoncrawl.service.dns - CommonCrawl DNS Service (used by crawlers to queue up DNS requests).
  • org.commoncrawl.service.listcrawler - A different type of list crawler that supports dynamic uploading a crawling of very large lists of URLS.
  • org.commoncrawl.service.pagerank - PageRank Master / Slave implementations (and related code) used to compute PageRank across the graph.
  • org.commoncrawl.service.parser - The beginnings of a distributed parser service that Crawlers can use to do on demand link extraction.
  • org.commoncrawl.service.queryserver - The (deprecated) crawl metadata service.
  • org.commoncrawl.service.statscollector - Service that receives crawl stats.
  • org.commoncrawl.util - The catch-all repository of Utility classes used by the CommonCrawl system.

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Contributors

Ahad Rana (ahad at commoncrawl.org)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].