Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → apache → Giraph

apache / Giraph

Licence: other

Mirror of Apache Giraph

Programming Languages

java

68154 projects - #9 most used programming language

Labels

big-data

Projects that are alternatives of or similar to Giraph

Cortx

CORTX Community Object Storage is 100% open source object storage uniquely optimized for mass capacity storage devices.

Stars: ✭ 426 (-25.13%)

Mutual labels: big-data

Stream Framework

Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:

Stars: ✭ 4,576 (+704.22%)

Mutual labels: big-data

Thrill

Thrill - An EXPERIMENTAL Algorithmic Distributed Big Data Batch Processing Framework in C++

Stars: ✭ 528 (-7.21%)

Mutual labels: big-data

Data Science Ipython Notebooks

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

Stars: ✭ 22,048 (+3774.87%)

Mutual labels: big-data

Redislite

Redis in a python module.

Stars: ✭ 464 (-18.45%)

Mutual labels: big-data

Magellan

Geo Spatial Data Analytics on Spark

Stars: ✭ 507 (-10.9%)

Mutual labels: big-data

Datascience Ai Machinelearning Resources

Alex Castrounis' curated set of resources for artificial intelligence (AI), machine learning, data science, internet of things (IoT), and more.

Stars: ✭ 414 (-27.24%)

Mutual labels: big-data

Pachyderm

Reproducible Data Science at Scale!

Stars: ✭ 5,305 (+832.34%)

Mutual labels: big-data

Fit Sne

Fast Fourier Transform-accelerated Interpolation-based t-SNE (FIt-SNE)

Stars: ✭ 485 (-14.76%)

Mutual labels: big-data

Arkime

Arkime (formerly Moloch) is an open source, large scale, full packet capturing, indexing, and database system.

Stars: ✭ 4,994 (+777.68%)

Mutual labels: big-data

Conjure Up

Deploying complex solutions, magically.

Stars: ✭ 454 (-20.21%)

Mutual labels: big-data

Hazelcast

Open-source distributed computation and storage platform

Stars: ✭ 4,662 (+719.33%)

Mutual labels: big-data

Onlinestats.jl

Single-pass algorithms for statistics

Stars: ✭ 507 (-10.9%)

Mutual labels: big-data

Circosjs

d3 library to build circular graphs

Stars: ✭ 436 (-23.37%)

Mutual labels: big-data

Couchdb

Seamless multi-master syncing database with an intuitive HTTP/JSON API, designed for reliability

Stars: ✭ 5,166 (+807.91%)

Mutual labels: big-data

Listenbrainz Server

Server for the ListenBrainz project

Stars: ✭ 420 (-26.19%)

Mutual labels: big-data

Pgm Index

🏅State-of-the-art learned data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes

Stars: ✭ 499 (-12.3%)

Mutual labels: big-data

Scanner

Efficient video analysis at scale

Stars: ✭ 569 (+0%)

Mutual labels: big-data

Nipype

Workflows and interfaces for neuroimaging packages

Stars: ✭ 557 (-2.11%)

Mutual labels: big-data

Beam

Apache Beam is a unified programming model for Batch and Streaming

Stars: ✭ 5,149 (+804.92%)

Mutual labels: big-data

View All Similar Projects ➔

Giraph : Large-scale graph processing on Hadoop

Web and online social graphs have been rapidly growing in size and scale during the past decade. In 2008, Google estimated that the number of web pages reached over a trillion. Online social networking and email sites, including Yahoo!, Google, Microsoft, Facebook, LinkedIn, and Twitter, have hundreds of millions of users and are expected to grow much more in the future. Processing these graphs plays a big role in relevant and personalized information for users, such as results from a search engine or news in an online social networking site.

Graph processing platforms to run large-scale algorithms (such as page rank, shared connections, personalization-based popularity, etc.) have become quite popular. Some recent examples include Pregel and HaLoop. For general-purpose big data computation, the map-reduce computing model has been well adopted and the most deployed map-reduce infrastructure is Apache Hadoop. We have implemented a graph-processing framework that is launched as a typical Hadoop job to leverage existing Hadoop infrastructure, such as Amazon’s EC2. Giraph builds upon the graph-oriented nature of Pregel but additionally adds fault-tolerance to the coordinator process with the use of ZooKeeper as its centralized coordination service.

Giraph follows the bulk-synchronous parallel model relative to graphs where vertices can send messages to other vertices during a given superstep. Checkpoints are initiated by the Giraph infrastructure at user-defined intervals and are used for automatic application restarts when any worker in the application fails. Any worker in the application can act as the application coordinator and one will automatically take over if the current application coordinator fails.

Hadoop versions for use with Giraph:

Secure Hadoop versions:

Apache Hadoop 1 (latest version: 1.2.1)

This is the default version used by Giraph: if you do not specify a profile with the -P flag, maven will use this version. You may also explicitly specify it with "mvn -Phadoop_1 ".
Apache Hadoop 2 (latest version: 2.5.1)

This is the latest version of Hadoop 2 (supporting YARN in addition to MapReduce) Giraph could use. You may tell maven to use this version with "mvn -Phadoop_2 ".
Apache Hadoop Yarn with 2.2.0

You may tell maven to use this version with "mvn -Phadoop_yarn -Dhadoop.version=2.2.0 ".
Apache Hadoop 3.0.0-SNAPSHOT

You may tell maven to use this version with "mvn -Phadoop_snapshot ".

Unsecure Hadoop versions:

Facebook Hadoop releases: https://github.com/facebook/hadoop-20, Master branch

You may tell maven to use this version with "mvn -Phadoop_facebook "

-- Other versions reported working include: --- Cloudera CDH3u0, CDH3u1

While we provide support for unsecure and Facebook versions of Hadoop with the maven profiles 'hadoop_non_secure' and 'hadoop_facebook', respectively, we have been primarily focusing on secure Hadoop releases at this time.

Building and testing:

You will need the following:

Java 1.8
Maven 3 or higher. Giraph uses the munge plugin (http://sonatype.github.com/munge-maven-plugin/), which requires Maven 3, to support multiple versions of Hadoop. Also, the web site plugin requires Maven 3.

Use the maven commands with secure Hadoop to:

compile (i.e mvn compile)
package (i.e. mvn package)
test (i.e. mvn test)

For the non-secure versions of Hadoop, run the maven commands with the additional argument '-Phadoop_non_secure'. Example compilation commands is 'mvn -Phadoop_non_secure compile'.

For the Facebook Hadoop release, run the maven commands with the additional arguments '-Phadoop_facebook'. Example compilation commands is 'mvn -Phadoop_facebook compile'.

Developing:

Giraph is a multi-module maven project. The top level generates a POM that carries information common to all the modules. Each module creates a jar with the code contained in it.

The giraph/ module contains the main giraph code. If you just want to work on the main code only you can do all your work inside this subdirectory. Specifically you would do something like:

giraph-root/giraph/ $ mvn verify # build from current state giraph-root/giraph/ $ mvn clean # wipe out build files giraph-root/giraph/ $ mvn clean verify # build from fresh state giraph-root/giraph/ $ mvn install # install jar to local repository

The giraph-formats/ module contains hooks to read/write from various formats (e.g. Accumulo, HBase, Hive). It depends on the giraph module. This means if you make local changes to the giraph codebase you will first need to install the giraph/ jar locally so that giraph-formats/ will pick it up. In other words something like this:

giraph-root/giraph/ $ mvn install giraph-root/giraph-formats $ mvn verify

To build everything at once you can issue the maven commands at the top level. Note that we use the "install" target so that if you have any local changes to giraph/ which formats needs it will get picked up because it will install locally first.

giraph-root/ $ mvn clean install

Scripting:

Giraph has support for writing user logic in languages other than Java. A Giraph job involves at the very least a Computation and Input/Output Formats. There are other optional pieces as well like Aggregators and Combiners.

As of this writing we support writing the Computation logic in Jython. The Computation class is at the core of the algorithm so it was a natural starting point. Eventually it is our goal to allow users to write any / all components of their algorithms in any language they desire.

To use Jython with our job launcher, GiraphRunner, pass the path to the script as the Computation class argument. Additionally, you should set the -jythonClass option to let Giraph know the name of your Jython Computation class. Lastly, you will need to set -typesHolder to a class that extends Giraph's TypesHolder so that Giraph can infer the types you use. Look at page-rank.py as an example.

How to run the unittests on a local pseudo-distributed Hadoop instance:

As mentioned earlier, Giraph supports several versions of Hadoop. In this section, we describe how to run the Giraph unittests against a single node instance of Apache Hadoop 0.20.203.

Download Apache Hadoop 0.20.203 (hadoop-0.20.203.0/hadoop-0.20.203.0rc1.tar.gz) from a mirror picked at http://www.apache.org/dyn/closer.cgi/hadoop/common/ and unpack it into a local directory

Follow the guide at http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html#PseudoDistributed to setup a pseudo-distributed single node Hadoop cluster.

Giraph’s code assumes that you can run at least 4 mappers at once, unfortunately the default configuration allows only 2. Therefore you need to update conf/mapred-site.xml:

mapred.tasktracker.map.tasks.maximum 4 mapred.map.tasks 4

After preparing the local filesystem with:

rm -rf /tmp/hadoop- /path/to/hadoop/bin/hadoop namenode -format

you can start the local hadoop instance:

/path/to/hadoop/bin/start-all.sh

and finally run Giraph’s unittests:

mvn clean test -Dprop.mapred.job.tracker=localhost:9001

Now you can open a browser, point it to http://localhost:50030 and watch the Giraph jobs from the unittests running on your local Hadoop instance!

Notes: Counter limit: In Hadoop 0.20.203.0 onwards, there is a limit on the number of counters one can use, which is set to 120 by default. This limit restricts the number of iterations/supersteps possible in Giraph. This limit can be increased by setting a parameter "mapreduce.job.counters.limit" in job tracker's config file mapred-site.xml.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 569

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (26) 🔗