All Projects → ssavvides → tpch-spark

ssavvides / tpch-spark

Licence: MIT License
TPC-H queries in Apache Spark SQL using native DataFrames API

Programming Languages

c
50402 projects - #5 most used programming language
scala
5932 projects
objective c
16641 projects - #2 most used programming language
shell
77523 projects
Makefile
30231 projects
C++
36643 projects - #6 most used programming language

Projects that are alternatives of or similar to tpch-spark

Spark
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
Stars: ✭ 1,721 (+2631.75%)
Mutual labels:  spark, tpch
BinKit
Binary Code Similarity Analysis (BCSA) Benchmark
Stars: ✭ 54 (-14.29%)
Mutual labels:  benchmark
spark-kubernetes
spark on kubernetes
Stars: ✭ 80 (+26.98%)
Mutual labels:  spark
best
🏆 Delightful Benchmarking & Performance Testing
Stars: ✭ 73 (+15.87%)
Mutual labels:  benchmark
micro bench
⏰ Dead simple, non intrusive, realtime benchmarks
Stars: ✭ 13 (-79.37%)
Mutual labels:  benchmark
sentry-spark
Apache Spark Sentry Integration
Stars: ✭ 14 (-77.78%)
Mutual labels:  spark
Search Ads Web Service
Online search advertisement platform & Realtime Campaign Monitoring [Maybe Deprecated]
Stars: ✭ 30 (-52.38%)
Mutual labels:  spark
frovedis
Framework of vectorized and distributed data analytics
Stars: ✭ 59 (-6.35%)
Mutual labels:  spark
leaflet heatmap
简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
Stars: ✭ 13 (-79.37%)
Mutual labels:  spark
Python Master Courses
人生苦短 我用Python
Stars: ✭ 61 (-3.17%)
Mutual labels:  spark
spark-acid
ACID Data Source for Apache Spark based on Hive ACID
Stars: ✭ 91 (+44.44%)
Mutual labels:  spark
spark-word2vec
A parallel implementation of word2vec based on Spark
Stars: ✭ 24 (-61.9%)
Mutual labels:  spark
docker-spark
Apache Spark docker container image (Standalone mode)
Stars: ✭ 34 (-46.03%)
Mutual labels:  spark
kubernetes-iperf3
Simple wrapper around iperf3 to measure network bandwidth from all nodes of a Kubernetes cluster
Stars: ✭ 80 (+26.98%)
Mutual labels:  benchmark
BigData-News
基于Spark2.2新闻网大数据实时系统项目
Stars: ✭ 36 (-42.86%)
Mutual labels:  spark
shamash
Autoscaling for Google Cloud Dataproc
Stars: ✭ 31 (-50.79%)
Mutual labels:  spark
spark-sql-flow-plugin
Visualize column-level data lineage in Spark SQL
Stars: ✭ 20 (-68.25%)
Mutual labels:  spark
KLUE
📖 Korean NLU Benchmark
Stars: ✭ 420 (+566.67%)
Mutual labels:  benchmark
incubator-linkis
Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
Stars: ✭ 2,459 (+3803.17%)
Mutual labels:  spark
Spark-PMoF
Spark Shuffle Optimization with RDMA+AEP
Stars: ✭ 28 (-55.56%)
Mutual labels:  spark

tpch-spark

TPC-H queries implemented in Spark using the DataFrames API. Tested under Spark 2.4.0

Savvas Savvides

[email protected]

Generating tables

Under the dbgen directory do:

make

This should generate an executable called dbgen

./dbgen -h

gives you the various options for generating the tables. The simplest case is running:

./dbgen

which generates tables with extension .tbl with scale 1 (default) for a total of rougly 1GB size across all tables. For different size tables you can use the -s option:

./dbgen -s 10

will generate roughly 10GB of input data.

You can then either upload your data to hdfs or read them locally.

Running

First compile using:

sbt package

Make sure you set the INPUT_DIR and OUTPUT_DIR in TpchQuery class before compiling to point to the location the of the input data and where the output should be saved.

You can then run a query using:

spark-submit --class "main.scala.TpchQuery" --master MASTER target/scala-2.11/spark-tpc-h-queries_2.11-1.0.jar ##

where ## is the number of the query to run e.g 1, 2, ..., 22 and MASTER specifies the spark-mode e.g local, yarn, standalone etc...

Other Implementations

  1. Data generator (http://www.tpc.org/tpch/)

  2. TPC-H for Hive (https://issues.apache.org/jira/browse/hive-600)

  3. TPC-H for PIG (https://github.com/ssavvides/tpch-pig)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].