All Projects → imri → mizo

imri / mizo

Licence: Apache-2.0 license
Super-fast Spark RDD for Titan Graph Database on HBase

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to mizo

Gaffer
A large-scale entity and relation database supporting aggregation of properties
Stars: ✭ 1,642 (+6741.67%)
Mutual labels:  hbase, graph-database
NoSQLDataEngineering
NoSQL Data Engineering
Stars: ✭ 25 (+4.17%)
Mutual labels:  hbase, graph-database
janusgraph-docker
Yet another JanusGraph, Cassandra/Scylla and Elasticsearch in Docker Compose setup
Stars: ✭ 54 (+125%)
Mutual labels:  graph-database, titan
Janusgraph
JanusGraph: an open-source, distributed graph database
Stars: ✭ 4,277 (+17720.83%)
Mutual labels:  hbase, graph-database
reactive-gremlin
akka http gremlin 3 websocket connector
Stars: ✭ 32 (+33.33%)
Mutual labels:  graph-database, titan
Hgraphdb
HBase as a TinkerPop Graph Database
Stars: ✭ 226 (+841.67%)
Mutual labels:  hbase, graph-database
Spring Boot Quick
🌿 基于springboot的快速学习示例,整合自己遇到的开源框架,如:rabbitmq(延迟队列)、Kafka、jpa、redies、oauth2、swagger、jsp、docker、spring-batch、异常处理、日志输出、多模块开发、多环境打包、缓存cache、爬虫、jwt、GraphQL、dubbo、zookeeper和Async等等📌
Stars: ✭ 1,819 (+7479.17%)
Mutual labels:  hbase
Bigdata docker
Big Data Ecosystem Docker
Stars: ✭ 161 (+570.83%)
Mutual labels:  hbase
Haproxy Configs
80+ HAProxy Configs for Hadoop, Big Data, NoSQL, Docker, Elasticsearch, SolrCloud, HBase, MySQL, PostgreSQL, Apache Drill, Hive, Presto, Impala, Hue, ZooKeeper, SSH, RabbitMQ, Redis, Riak, Cloudera, OpenTSDB, InfluxDB, Prometheus, Kibana, Graphite, Rancher etc.
Stars: ✭ 106 (+341.67%)
Mutual labels:  hbase
Hbase Operator Tools
Apache HBase Operator Tools
Stars: ✭ 104 (+333.33%)
Mutual labels:  hbase
Node Hbase
Asynchronous HBase client for NodeJs using REST
Stars: ✭ 226 (+841.67%)
Mutual labels:  hbase
Imposter
Scriptable, multipurpose mock server.
Stars: ✭ 187 (+679.17%)
Mutual labels:  hbase
Tera
An Internet-Scale Database.
Stars: ✭ 1,846 (+7591.67%)
Mutual labels:  hbase
Hbase Doc Zh
📖 HBase 中文参考指南
Stars: ✭ 164 (+583.33%)
Mutual labels:  hbase
Python Bigdata
Data science and Big Data with Python
Stars: ✭ 112 (+366.67%)
Mutual labels:  hbase
Gimel
Big Data Processing Framework - Unified Data API or SQL on Any Storage
Stars: ✭ 216 (+800%)
Mutual labels:  hbase
Flink Learning
flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》
Stars: ✭ 11,378 (+47308.33%)
Mutual labels:  hbase
Camellia
camellia framework by netease-im. provider: 1) redis-client; 2) redis-proxy(redis-sentinel/redis-cluster); 3) hbase-client; 4) others
Stars: ✭ 146 (+508.33%)
Mutual labels:  hbase
Sparkstreaming
💥 🚀 封装sparkstreaming动态调节batch time(有数据就执行计算);🚀 支持运行过程中增删topic;🚀 封装sparkstreaming 1.6 - kafka 010 用以支持 SSL。
Stars: ✭ 179 (+645.83%)
Mutual labels:  hbase
Technology Talk
汇总java生态圈常用技术框架、开源中间件,系统架构、数据库、大公司架构案例、常用三方类库、项目管理、线上问题排查、个人成长、思考等知识
Stars: ✭ 12,136 (+50466.67%)
Mutual labels:  hbase

Mizo

Super-fast Spark RDD for Titan Graph Database on HBase

Mizo enables you to perform Spark transformations and actions over a Titan DB, under the following circumstances:

  • It runs with an HBase backend
  • Its HBase internal data files (HFiles) are accessible via the network

Mizo was originally developed due to a lack of an efficient and quick OLAP engine on top of Titan. OLAP over Titan was meant to be solved by libraries such as Faunus and Tinkerpop's SparkGraphComputer, but neither of the solutions can be used in production - the former is buggy and misses data, and the latter is generally a non-efficient mechanism that spills lots of data. Moreover, both of the solutions rely on HBase API to retrieve the Graph data in bulk, but this API is by itself very slow. Mizo relies on HBase internal data files (called HFiles), parses them and builds vertices and edges from them - without interacting with HBase API.

In production

Mizo was tested in production on a Titan Graph with a about ten billion vertices and hundreds of billions of edges. Using a Spark cluster with total of 100 cores and 150 GB of RAM (each Spark worker gets 1 core and 1.5 GB of RAM) it takes about 8 hours for Mizo to iterate over a graph with 2000 HBase regions.

Limitations

Mizo is limited in terms of traversing the graph - it is intended for single-hop queries only, meaning that you can reach a vertex and its edges, but you cannot jump to the other vertex, you can only get its ID. For example, Mizo can be used for counting how many vertices exist that have a property called 'first_name', but Mizo cannot be used to count edges that connect two vertices with a property called 'first_name', because only one vertex is available at a time.

You can run Mizo on a working HBase cluster. The problem here is that HBase performs regularly performs compactions, which basically change and delete HFiles. While not locking the HFiles, Mizo can suffer from data misses if an HFile is removed (it skips the file and moves next). The best practice is to run Mizo on an idle HBase cluster.

RDDs and customization

Mizo supports different levels of customization -- by default, it'll parse every vertex and edge. More accurately, due to Titan's internal data structure, which keeps each edge twice - one time on the 'in' vertex and another time on the 'out' vertex, Mizo will return each edge twice (on time when parsing the HBase region containing the 'in' vertex, and another time while parsing the 'out' vertex). You can prevent this by customizing Mizo to parse only in/out edges.

Mizo exposes two types of RDDs:

  • MizoVerticesRDD is an RDD of vertices, with their in and/or out edges and their properties. It is much more heavy in terms of memory, because each vertex returned also contains a collection of edges.
  • MizoEdgesRDD is an RDD of edges, with the vertex it originated from. An edge is much more lighter in terms of memory usage - each edge contains a vertex inside, but that vertex does not contain a list of vertices (more accurately, the list is always null), so there are no heavy lists to keep in memory.

Build

Use Maven in order to build Mizo:

cd mizo/
mvn -DskipTests=true package

Getting Started

Using Mizo for counting edges on graph:

import mizo.rdd.MizoBuilder;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;

public class MizoEdgesCounter {
    public static void main(String[] args) {
        SparkConf conf = new SparkConf()
                .setAppName("Mizo Edges Counter")
                .setMaster("local[1]")
                .set("spark.executor.memory", "4g")
                .set("spark.executor.cores", "1")
                .set("spark.rpc.askTimeout", "1000000")
                .set("spark.rpc.frameSize", "1000000")
                .set("spark.network.timeout", "1000000")
                .set("spark.rdd.compress", "true")
                .set("spark.core.connection.ack.wait.timeout", "6000")
                .set("spark.driver.maxResultSize", "100m")
                .set("spark.task.maxFailures", "20")
                .set("spark.shuffle.io.maxRetries", "20");

        SparkContext sc = new SparkContext(conf);

        long count = new MizoBuilder()
                .titanConfigPath("titan-graph.properties")
                .regionDirectoriesPath("hdfs://my-graph/*/e")
                .parseInEdges(v -> false)
                .edgesRDD(sc)
                .toJavaRDD()
                .count();

        System.out.println("Edges count is: " + count);
    }
}

Using Mizo for counting vertices on graph:

import mizo.rdd.MizoBuilder;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;

public class MizoVerticesCounter {
    public static void main(String[] args) {
        SparkConf conf = new SparkConf()
                .setAppName("Mizo Edges Counter")
                .setMaster("local[1]")
                .set("spark.executor.memory", "4g")
                .set("spark.executor.cores", "1")
                .set("spark.rpc.askTimeout", "1000000")
                .set("spark.rpc.frameSize", "1000000")
                .set("spark.network.timeout", "1000000")
                .set("spark.rdd.compress", "true")
                .set("spark.core.connection.ack.wait.timeout", "6000")
                .set("spark.driver.maxResultSize", "100m")
                .set("spark.task.maxFailures", "20")
                .set("spark.shuffle.io.maxRetries", "20");

        SparkContext sc = new SparkContext(conf);

        long count = new MizoBuilder()
                .titanConfigPath("titan-graph.properties")
                .regionDirectoriesPath("hdfs://my-graph/*/e")
                .parseInEdges(v -> false)
                .verticesRDD(sc)
                .toJavaRDD()
                .count();

        System.out.println("Vertices count is: " + count);
    }
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].