Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

50+ DockerHub public images for Docker & Kubernetes - Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak, TeamCity and DevOps tools built on the major Linux distros: Alpine, CentOS, Debian, Fedora, Ubuntu

Stars: ✭ 847 (+3037.04%)

Mutual labels: spark, hbase

Devops Python Tools

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

Stars: ✭ 406 (+1403.7%)

Mutual labels: spark, hbase

Sparkstreaming

💥 🚀 封装sparkstreaming动态调节batch time(有数据就执行计算)；🚀 支持运行过程中增删topic；🚀 封装sparkstreaming 1.6 - kafka 010 用以支持 SSL。

Stars: ✭ 179 (+562.96%)

Mutual labels: spark, hbase

Bigdataguide

大数据学习，从零开始学习大数据，包含大数据学习各阶段学习视频、面试资料

Stars: ✭ 817 (+2925.93%)

Mutual labels: spark, hbase

Bigdata docker

Big Data Ecosystem Docker

Stars: ✭ 161 (+496.3%)

Mutual labels: spark, hbase

yuzhouwan

Code Library for My Blog

Stars: ✭ 39 (+44.44%)

Mutual labels: spark, hbase

Technology Talk

汇总java生态圈常用技术框架、开源中间件，系统架构、数据库、大公司架构案例、常用三方类库、项目管理、线上问题排查、个人成长、思考等知识

Stars: ✭ 12,136 (+44848.15%)

Mutual labels: spark, hbase

Hbase Rdd

Spark RDD to read, write and delete from HBase

Stars: ✭ 277 (+925.93%)

Mutual labels: spark, hbase

Python Bigdata

Data science and Big Data with Python

Stars: ✭ 112 (+314.81%)

Mutual labels: spark, hbase

Spring Boot Quick

🌿 基于springboot的快速学习示例,整合自己遇到的开源框架,如：rabbitmq(延迟队列)、Kafka、jpa、redies、oauth2、swagger、jsp、docker、spring-batch、异常处理、日志输出、多模块开发、多环境打包、缓存cache、爬虫、jwt、GraphQL、dubbo、zookeeper和Async等等📌

Stars: ✭ 1,819 (+6637.04%)

Mutual labels: spark, hbase

Wedatasphere

WeDataSphere is a financial level one-stop open-source suitcase for big data platforms. Currently the source code of Scriptis and Linkis has already been released to the open-source community. WeDataSphere, Big Data Made Easy!

Stars: ✭ 372 (+1277.78%)

Mutual labels: spark, hbase

Bdp Dataplatform

大数据生态解决方案数据平台：基于大数据、数据平台、微服务、机器学习、商城、自动化运维、DevOps、容器部署平台、数据平台采集、数据平台存储、数据平台计算、数据平台开发、数据平台应用搭建的大数据解决方案。

Stars: ✭ 456 (+1588.89%)

Mutual labels: spark, hbase

View All Similar Projects ➔

Heracles: Fast SQL on HBase using SparkSQL

Note: The original project name is "HSpark" which is requested to rename to the current name due to trade mark concerns by Apache foundation.

Apache HBase is a distributed Key-Value store of data on HDFS. It is modeled after Google’s Big Table, and provides APIs to query the data. The data is organized, partitioned and distributed by its “row keys”. Per partition, the data is further physically partitioned by “column families” that specify collections of “columns” of data. The data model is for wide and sparse tables where columns are dynamic and may well be sparse.

Although HBase is a very useful big data store, its access mechanism is very primitive and only through client-side APIs, Map/Reduce interfaces and interactive shells. SQL accesses to HBase data are available through Map/Reduce or interfaces mechanisms such as Apache Hive and Impala, or some “native” SQL technologies like Apache Phoenix. While the former is usually cheaper to implement and use, their latencies and efficiencies often cannot compare favorably with the latter and are often suitable only for offline analysis. The latter category, in contrast, often performs better and qualifies more as online engines; they are often on top of purpose-built execution engines.

Currently Spark supports queries against HBase data through HBase’s Map/Reduce interface (i.e., TableInputFormat). Spark SQL supports use of Hive data, which theoretically should be able to support HBase data access, out-of-box, through HBase’s Map/Reduce interface and therefore falls into the first category of the “SQL on HBase” technologies.

We believe, as a unified big data processing engine, Spark is in good position to provide better HBase support.

Online Documentation

Online documentation is in the doc folder.

Requirements

This version of 2.2.0 requires Spark 2.2.0.

Building Spark HBase

Spark HBase is built using Apache Maven.

I. Clone the Heracles project from GitHub

$ git clone https://github.com/bomeng/Heracles.git

$ git clone [email protected]:bomeng/Heracles.git

II. Go to the root of the source tree

$ cd Heracles

III. Build the project Build without testing

$ mvn -DskipTests clean install

Or, build with testing. It will run test suites against a HBase minicluster.

$ mvn clean install

Coprocessor

Currently, HBase coprocessor is not supported in this release.

Interactive Scala Shell

The shell will connect to a local HBase master. You need to configure the HBase's hbase-env.sh file under "conf" folder by adding hspark.jar to its classpath.

export HBASE_CLASSPATH=<path_to_hspark>/hspark-2.2.0.jar

You may need to set the JAVA_HOME in the hbase-env.sh as well. Follow the instruction of configuring HBase to its proper settings (e.g. hbase-site.xml etc), after that, you can start HBase by following command:

start-hbase.sh

Then, the easiest way to start using Spark HBase is through the Scala shell:

./bin/hbase-sql

Python Shell

First, add the spark-hbase jar to the SPARK_CLASSPATH in the $SPARK_HOME/conf directory, as follows:

SPARK_CLASSPATH=$SPARK_CLASSPATH:/spark-hbase-root-dir/target/Heracles-2.2.0.jar

Then go to the spark-hbase installation directory and issue

./bin/pyspark-hbase

A successful message is as follows:

   You are using Heracles !!!
   HBaseSQLContext available as hsqlContext.

To run a python script, the PYTHONPATH environment should be set to the "python" directory of the Spark-HBase installation. For example,

export PYTHONPATH=/root-of-Heracles/python

Note that the shell commands are not included in the Zip file of the Spark release. They are for developers' use only for this version of 2.2.0. Instead, users can use "$SPARK_HOME/bin/spark-shell --packages Heracles/Heracles:2.2.0" for SQL shell or "$SPARK_HOME/bin/pyspark --packages Heracles/Heracles:2.2.0" for Pythin shell.

Running Tests

Testing first requires building Spark HBase. Once Spark HBase is built ...

Run all test suites from Maven:

mvn -Phbase,hadoop-2.4 test

Run a single test suite from Maven, for example:

mvn -Phbase,hadoop-2.4 test -DwildcardSuites=org.apache.spark.sql.hbase.BasicQueriesSuite

IDE Setup

We use IntelliJ IDEA for Spark HBase development. You can get the community edition for free and install the JetBrains Scala plugin from Preferences > Plugins.

To import the current Spark HBase project for IntelliJ:

Download IntelliJ and install the Scala plug-in for IntelliJ. You may also need to install Maven plug-in for IntelliJ.
Go to "File -> Import Project", locate the Spark HBase source directory, and select "Maven Project".
In the Import Wizard, select "Import Maven projects automatically" and leave other settings at their default.
Make sure some specific profiles are enabled. Select corresponding Hadoop version, "maven3" and also "hbase" in order to get dependencies.
Leave other settings at their default and you should be able to start your development.
When you run the scala test, sometimes you will get out of memory exception. You can increase your VM memory usage by the following setting, for example:

-XX:MaxPermSize=512m -Xmx3072m

You can also make those setting to be the default by setting to the "Defaults -> ScalaTest".

Configuration

Please refer to the Configuration guide in the online documentation for an overview on how to configure Spark.

For HBase 1.2, it is recommended to use higher "open files" and "max user processes" ulimit values. A typical value is 65536(64K).

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 27

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (4) 🔗