All Projects → spoddutur → Cloud Based Sql Engine Using Spark

spoddutur / Cloud Based Sql Engine Using Spark

Cloud-based SQL engine using SPARK where data is accessible as JDBC/ODBC data source via Spark ThriftServer.

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Cloud Based Sql Engine Using Spark

Flintrock
A command-line tool for launching Apache Spark clusters.
Stars: ✭ 568 (+1793.33%)
Mutual labels:  apache-spark
Mycat2
MySQL Proxy using Java NIO based on Sharding SQL,Calcite ,simple and fast
Stars: ✭ 750 (+2400%)
Mutual labels:  jdbc
Live log analyzer spark
Spark Application for analysis of Apache Access logs and detect anamolies! Along with Medium Article.
Stars: ✭ 14 (-53.33%)
Mutual labels:  apache-spark
Hibernate Springboot
Collection of best practices for Java persistence performance in Spring Boot applications
Stars: ✭ 589 (+1863.33%)
Mutual labels:  jdbc
Hasor
Hasor是一套基于 Java 语言的开发框架,区别于其它框架的是 Hasor 有着自己一套完整的体系,同时还可以和先有技术体系做到完美融合。它包含:IoC/Aop容器框架、Web框架、Jdbc框架、RSF分布式RPC框架、DataQL引擎,等几块。
Stars: ✭ 713 (+2276.67%)
Mutual labels:  jdbc
Goodreads etl pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Stars: ✭ 793 (+2543.33%)
Mutual labels:  apache-spark
Openscoring
REST web service for the true real-time scoring (<1 ms) of Scikit-Learn, R and Apache Spark models
Stars: ✭ 536 (+1686.67%)
Mutual labels:  apache-spark
Datahacksummit 2017
Apache Zeppelin notebooks for Recommendation Engines using Keras and Machine Learning on Apache Spark
Stars: ✭ 30 (+0%)
Mutual labels:  apache-spark
Kafka Storm Starter
Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.
Stars: ✭ 728 (+2326.67%)
Mutual labels:  apache-spark
Mobius
C# and F# language binding and extensions to Apache Spark
Stars: ✭ 929 (+2996.67%)
Mutual labels:  apache-spark
Dist Keras
Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
Stars: ✭ 613 (+1943.33%)
Mutual labels:  apache-spark
Kafka Connect Jdbc
Kafka Connect connector for JDBC-compatible databases
Stars: ✭ 698 (+2226.67%)
Mutual labels:  jdbc
Myjdbc Rainbow
jpa--轻量级orm模式对象与数据库映射api
Stars: ✭ 23 (-23.33%)
Mutual labels:  jdbc
Jailer
Database Subsetting and Relational Data Browsing Tool.
Stars: ✭ 576 (+1820%)
Mutual labels:  jdbc
Spark Streaming Monitoring With Lightning
Plot live-stats as graph from ApacheSpark application using Lightning-viz
Stars: ✭ 15 (-50%)
Mutual labels:  apache-spark
Streaming Readings
Streaming System 相关的论文读物
Stars: ✭ 554 (+1746.67%)
Mutual labels:  apache-spark
Sparklyr
R interface for Apache Spark
Stars: ✭ 775 (+2483.33%)
Mutual labels:  apache-spark
Spark Flamegraph
Easy CPU Profiling for Apache Spark applications
Stars: ✭ 30 (+0%)
Mutual labels:  apache-spark
Spark
Apache Spark - A unified analytics engine for large-scale data processing
Stars: ✭ 31,618 (+105293.33%)
Mutual labels:  jdbc
Pgjdbc
Postgresql JDBC Driver
Stars: ✭ 925 (+2983.33%)
Mutual labels:  jdbc

Spark As CLoudBased SQL Engine

This project shows how to use SPARK as Cloud-based SQL Engine and expose your big-data as a JDBC/ODBC data source via the Spark thrift server.

1. Central Idea

Traditional relational Database engines like SQL had scalability problems and so evolved couple of SQL-on-Hadoop frameworks like Hive, Cloudier Impala, Presto etc. These frameworks are essentially cloud-based solutions and they all come with their own advantages and limitations. This project will demo how SparkSQL comes across as one more SQL-on-Hadoop framework.

2. Architecture

Following picture illustrates how ApacheSpark can be used as SQL-on-Hadoop framework to serve your big-data as a JDBC/ODBC data source via the Spark thrift server.:

  • Data from multiple sources can be pushed into Spark and then exposed as SQLtable
  • These tables are then made accessible as a JDBC/ODBC data source via the Spark thrift server.
  • Multiple clients like Beeline CLI, JDBC, ODBC or BI tools like Tableau connect to Spark thrift server.
  • Once the connection is established, ThriftServer will contact SparkSQL engine to access Hive or Spark temp tables and run the sql queries on ApacheSpark framework.
  • Spark Thrift basically works similar to HiveServer2 thrift where HiveServer2 submits the sql queries as Hive MapReduce job vs Spark thrift server will use Spark SQL engine which underline uses full spark capabilities.

To know more about this topic, please refer to my blog here where I briefed the concept in detail.

3. Structure of the project:

  • data: Contains input json used in MainApp to register sample data with SparkSql.
  • src/main/java/MainApp.scala: Spark 2.1 implementation where it starts SparkSession and registers data from input.json with SparkSQL. (To keep the spark-session alive, there's a continuous while-loop in there).
  • src/test/java/TestThriftClient.java: Java class to demo how to connect to thrift server as JDBC source and query the registered data

4. How to run this project?

This project does demo 2 things:

  • 4.1. How to register data with SparkSql
  • 4.2. How to query registered data via Spark ThriftServer - using Beeline and JDBC

4.1 How to register data with SparkSql

  • Download this project.
  • Build it: mvn clean install and
  • Run MainApp: spark-submit --class MainApp cloud-based-sql-engine-using-spark.jar. Tht's it!
  • It'll register some sample data in records table with SparkSQL.

4.2 How to query registered data via Spark Thrift Server using Beeline and JDBC?

For this, first connect to Spark ThriftServer. Once the connection is established, just like HiveServer2, access Hive or Spark temp tables to run the sql queries on ApacheSpark framework. I'll show 2 ways to do this:

  1. Beeline: Perhaps, the simplest is to use beeline command-line tool provided in Spark's bin folder.
`$> beeline`
Beeline version 2.1.1-amzn-0 by Apache Hive

// Connect to spark thrift server..
`beeline> !connect jdbc:hive2://localhost:10000`
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000:
Enter password for jdbc:hive2://localhost:10000:

// run your sql queries and access data..
`jdbc:hive2://localhost:10000> show tables;,`
  1. Java JDBC: Please refer to this project's test folder where I've shared a java example TestThriftClient.java to demo the same.

5. Requirements

  • Spark 2.1.0, Java 1.8 and Scala 2.11

6. References:

  • Complete guide and references to this project are briefed in my blog here.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].