Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → zrlio → Parquet Generator

zrlio / Parquet Generator

Licence: apache-2.0

Parquet file generator

Programming Languages

scala

5932 projects

Labels

sql spark parquet

Projects that are alternatives of or similar to Parquet Generator

Parquet Index

Spark SQL index for Parquet tables

Stars: ✭ 109 (+581.25%)

Mutual labels: sql, spark, parquet

Linkis

Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.

Stars: ✭ 2,323 (+14418.75%)

Mutual labels: sql, spark

Spark With Python

Fundamentals of Spark with Python (using PySpark), code examples

Stars: ✭ 150 (+837.5%)

Mutual labels: sql, spark

Roapi

Create full-fledged APIs for static datasets without writing a single line of code.

Stars: ✭ 253 (+1481.25%)

Mutual labels: sql, parquet

Spark Website

Apache Spark Website

Stars: ✭ 75 (+368.75%)

Mutual labels: sql, spark

Quicksql

A Flexible, Fast, Federated(3F) SQL Analysis Middleware for Multiple Data Sources

Stars: ✭ 1,821 (+11281.25%)

Mutual labels: sql, spark

experiments

Code examples for my blog posts

Stars: ✭ 21 (+31.25%)

Mutual labels: spark, parquet

Schemer

Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.

Stars: ✭ 97 (+506.25%)

Mutual labels: spark, parquet

Metorikku

A simplified, lightweight ETL Framework based on Apache Spark

Stars: ✭ 361 (+2156.25%)

Mutual labels: sql, spark

Kyuubi

Kyuubi is a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark

Stars: ✭ 363 (+2168.75%)

Mutual labels: sql, spark

Iceberg

Iceberg is a table format for large, slow-moving tabular data

Stars: ✭ 393 (+2356.25%)

Mutual labels: spark, parquet

Kamu Cli

Next generation tool for decentralized exchange and transformation of semi-structured data

Stars: ✭ 69 (+331.25%)

Mutual labels: sql, spark

Spark

Apache Spark - A unified analytics engine for large-scale data processing

Stars: ✭ 31,618 (+197512.5%)

Mutual labels: sql, spark

Gaffer

A large-scale entity and relation database supporting aggregation of properties

Stars: ✭ 1,642 (+10162.5%)

Mutual labels: spark, parquet

Datafusion

DataFusion has now been donated to the Apache Arrow project

Stars: ✭ 611 (+3718.75%)

Mutual labels: sql, spark

Xsql

Unified SQL Analytics Engine Based on SparkSQL

Stars: ✭ 176 (+1000%)

Mutual labels: sql, spark

Pucket

Bucketing and partitioning system for Parquet

Stars: ✭ 29 (+81.25%)

Mutual labels: spark, parquet

Rumble

⛈️ Rumble 1.11.0 "Banyan Tree"🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more

Stars: ✭ 58 (+262.5%)

Mutual labels: spark, parquet

Oap

Optimized Analytics Package for Spark* Platform

Stars: ✭ 343 (+2043.75%)

Mutual labels: spark, parquet

Devops Python Tools

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

Stars: ✭ 406 (+2437.5%)

Mutual labels: spark, parquet

View All Similar Projects ➔

Parquet-Generator

Parquet file generator for humans

How to build

mvn -DskipTests -T 1C install

This should give you parquet-generator-1.0.jar in your target folder. To build for with-dependencies, you can use:

mvn -DskipTests -T 1C clean compile assembly:single

How to run

./bin/spark-submit --master yarn \ 
--class com.ibm.crail.spark.tools ParquetGenerator \ 
parquet-generator-1.0.jar [OPTIONS]

Current options are:

  usage: parquet-generator
   -a,--affix                                   affix random payload. Means that in each instance of worker, 
                                                the variable payload data will be generated once, and used
                                                multiple times (default false)
   -c,--case <arg>                              case class schema currently supported are:
                                                ParquetExample (default),
                                                IntWithPayload, and tpcds.
                                                These classes are in ./schema/ in src.
   -C,--compress <arg>                          <String> compression type, valid values are:
                                                uncompressed, snappy, gzip,
                                                lzo (default: uncompressed)
   -f,--format <arg>                            <String> output format type (e.g., parquet (default), csv, etc.)
   -h,--help                                    show help
   -o,--output <arg>                            <String> the output file name (default: /ParqGenOutput.parquet)
   -O,--options <arg>                           <str,str> key,value strings that will be passed to the data source of spark in
                                                writing. E.g., for parquet you may want to re-consider parquet.block.size. The
                                                default is 128MB (the HDFS block size).
   -p,--partitions <arg>                        <int> number of output partitions (default: 1)
   -r,--rows <arg>                              <long> total number of rows (default: 10)
   -R,--rangeInt <arg>                          <int> maximum int value, value for any Int column will be generated between
                                                [0,rangeInt), (default: 2147483647)
   -s,--size <arg>                              <int> any variable payload size, string or payload in IntPayload (default: 100)
   -S,--show <arg>                              <int> show <int> number of rows (default: 0, zero means do not show)
   -t,--tasks <arg>                             <int> number of tasks to generate this data (default: 1)
   -tcbp,--clusterByPartition <arg>             <int> true(1) or false(0, default), pass the int
   -tdd,--doubleForDecimal <arg>                <int> true(1) or false(0, default), pass the int
   -tdsd,--dsdgenDir <arg>                      <String> location of the dsdgen tool
   -tfon,--filterOutNullPartitionValues <arg>   <int> true(1) or false (0, default), pass the int
   -tow,--overWrite <arg>                       <int> true(1, default) or false(0), pass the int
   -tpt,--partitionTable <arg>                  <int> true(1) or false(0, default), pass the int
   -tsd,--stringForDate <arg>                   <int> true(1) or false(0, default), pass the int
   -tsf,--scaleFactor <arg>                     <Int> scaling factor (default: 1) 
   -ttf,--tableFiler <arg>                      <String> ?

An example run would be :

./bin/spark-submit --master yarn \
--class com.ibm.crail.spark.tools.ParquetGenerator parquet-generator-1.0.jar \
-c IntWithPayload -C snappy -o /myfile.parquet -r 84 -s 42 -p 12

This will create 984 ( = 12 * 84) rows for case class IntWithPayload as [Int, Array[Byte]] with 42 bytes byte array, and save this as a parquet file format in /myfile.parquet in 12 different partitions.

How to generate TPC-DS dataset

This is an example command to generate the dataset with the scaling factor of 2, with 8 tasks but in 2 files (or partitions) when running spark locally. The output goes to crail.

./bin/spark-submit -v --num-executors 2 --executor-cores 1 --executor-memory 1G --driver-memory 1G --master local 
--class com.ibm.crail.spark.tools.ParquetGenerator 
~/parquet-generator/target/parquet-generator-1.0.jar 
-c tpcds 
-o crail://localhost:9060/tpcds 
-t 8 
-p 2
-tsf 2 
-tdsd ~/tpcds-kit/tools/

Note: on a cluster the location of dsdgen directory should be accessible on each machine.

Acknowledgement: The data generation logic is derived from https://github.com/databricks/spark-sql-perf

How to get and build the `dsdgen-kit` tool

As described here, the logic uses a slightly modified version of the original TPC-DS kit. It can be downloaded and build from https://github.com/databricks/tpcds-kit as

$ git clone https://github.com/databricks/tpcds-kit.git
$ cd ./tpcds-kit/tools/
$ make OS=LINUX

Good-to-know

 17/11/13 14:10:01 12276 main INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
 Exception in thread "main" java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
	at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1053)
	at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130)
	at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130)
	at scala.Option.getOrElse(Option.scala:121)

In a recent commit, I have introduced hive and ORC dependencies. In case you run into above problem then probably you have spark.driver.userClassPathFirst=true set. See https://issues.apache.org/jira/browse/SPARK-16680

Contributions

PRs are always welcome. Please fork, and make necessary modifications you propose, and let us know.

Contact

If you have questions or suggestions, feel free to post at:

https://groups.google.com/forum/#!forum/zrlio-users

or email: [email protected]

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 16

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

zrlio / Parquet Generator

Programming Languages

Labels

Projects that are alternatives of or similar to Parquet Generator

Parquet-Generator

How to build

How to run

How to generate TPC-DS dataset

How to get and build the dsdgen-kit tool

Good-to-know

Contributions

Contact

How to get and build the `dsdgen-kit` tool