All Projects → zrlio → Parquet Generator

zrlio / Parquet Generator

Licence: apache-2.0
Parquet file generator

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to Parquet Generator

Parquet Index
Spark SQL index for Parquet tables
Stars: ✭ 109 (+581.25%)
Mutual labels:  sql, spark, parquet
Linkis
Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
Stars: ✭ 2,323 (+14418.75%)
Mutual labels:  sql, spark
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (+837.5%)
Mutual labels:  sql, spark
Roapi
Create full-fledged APIs for static datasets without writing a single line of code.
Stars: ✭ 253 (+1481.25%)
Mutual labels:  sql, parquet
Spark Website
Apache Spark Website
Stars: ✭ 75 (+368.75%)
Mutual labels:  sql, spark
Quicksql
A Flexible, Fast, Federated(3F) SQL Analysis Middleware for Multiple Data Sources
Stars: ✭ 1,821 (+11281.25%)
Mutual labels:  sql, spark
experiments
Code examples for my blog posts
Stars: ✭ 21 (+31.25%)
Mutual labels:  spark, parquet
Schemer
Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.
Stars: ✭ 97 (+506.25%)
Mutual labels:  spark, parquet
Metorikku
A simplified, lightweight ETL Framework based on Apache Spark
Stars: ✭ 361 (+2156.25%)
Mutual labels:  sql, spark
Kyuubi
Kyuubi is a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark
Stars: ✭ 363 (+2168.75%)
Mutual labels:  sql, spark
Iceberg
Iceberg is a table format for large, slow-moving tabular data
Stars: ✭ 393 (+2356.25%)
Mutual labels:  spark, parquet
Kamu Cli
Next generation tool for decentralized exchange and transformation of semi-structured data
Stars: ✭ 69 (+331.25%)
Mutual labels:  sql, spark
Spark
Apache Spark - A unified analytics engine for large-scale data processing
Stars: ✭ 31,618 (+197512.5%)
Mutual labels:  sql, spark
Gaffer
A large-scale entity and relation database supporting aggregation of properties
Stars: ✭ 1,642 (+10162.5%)
Mutual labels:  spark, parquet
Datafusion
DataFusion has now been donated to the Apache Arrow project
Stars: ✭ 611 (+3718.75%)
Mutual labels:  sql, spark
Xsql
Unified SQL Analytics Engine Based on SparkSQL
Stars: ✭ 176 (+1000%)
Mutual labels:  sql, spark
Pucket
Bucketing and partitioning system for Parquet
Stars: ✭ 29 (+81.25%)
Mutual labels:  spark, parquet
Rumble
⛈️ Rumble 1.11.0 "Banyan Tree"🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more
Stars: ✭ 58 (+262.5%)
Mutual labels:  spark, parquet
Oap
Optimized Analytics Package for Spark* Platform
Stars: ✭ 343 (+2043.75%)
Mutual labels:  spark, parquet
Devops Python Tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
Stars: ✭ 406 (+2437.5%)
Mutual labels:  spark, parquet

Parquet-Generator

Parquet file generator for humans

How to build

mvn -DskipTests -T 1C install

This should give you parquet-generator-1.0.jar in your target folder. To build for with-dependencies, you can use:

mvn -DskipTests -T 1C clean compile assembly:single

How to run

./bin/spark-submit --master yarn \ 
--class com.ibm.crail.spark.tools ParquetGenerator \ 
parquet-generator-1.0.jar [OPTIONS]

Current options are:

  usage: parquet-generator
   -a,--affix                                   affix random payload. Means that in each instance of worker, 
                                                the variable payload data will be generated once, and used
                                                multiple times (default false)
   -c,--case <arg>                              case class schema currently supported are:
                                                ParquetExample (default),
                                                IntWithPayload, and tpcds.
                                                These classes are in ./schema/ in src.
   -C,--compress <arg>                          <String> compression type, valid values are:
                                                uncompressed, snappy, gzip,
                                                lzo (default: uncompressed)
   -f,--format <arg>                            <String> output format type (e.g., parquet (default), csv, etc.)
   -h,--help                                    show help
   -o,--output <arg>                            <String> the output file name (default: /ParqGenOutput.parquet)
   -O,--options <arg>                           <str,str> key,value strings that will be passed to the data source of spark in
                                                writing. E.g., for parquet you may want to re-consider parquet.block.size. The
                                                default is 128MB (the HDFS block size).
   -p,--partitions <arg>                        <int> number of output partitions (default: 1)
   -r,--rows <arg>                              <long> total number of rows (default: 10)
   -R,--rangeInt <arg>                          <int> maximum int value, value for any Int column will be generated between
                                                [0,rangeInt), (default: 2147483647)
   -s,--size <arg>                              <int> any variable payload size, string or payload in IntPayload (default: 100)
   -S,--show <arg>                              <int> show <int> number of rows (default: 0, zero means do not show)
   -t,--tasks <arg>                             <int> number of tasks to generate this data (default: 1)
   -tcbp,--clusterByPartition <arg>             <int> true(1) or false(0, default), pass the int
   -tdd,--doubleForDecimal <arg>                <int> true(1) or false(0, default), pass the int
   -tdsd,--dsdgenDir <arg>                      <String> location of the dsdgen tool
   -tfon,--filterOutNullPartitionValues <arg>   <int> true(1) or false (0, default), pass the int
   -tow,--overWrite <arg>                       <int> true(1, default) or false(0), pass the int
   -tpt,--partitionTable <arg>                  <int> true(1) or false(0, default), pass the int
   -tsd,--stringForDate <arg>                   <int> true(1) or false(0, default), pass the int
   -tsf,--scaleFactor <arg>                     <Int> scaling factor (default: 1) 
   -ttf,--tableFiler <arg>                      <String> ?

An example run would be :

./bin/spark-submit --master yarn \
--class com.ibm.crail.spark.tools.ParquetGenerator parquet-generator-1.0.jar \
-c IntWithPayload -C snappy -o /myfile.parquet -r 84 -s 42 -p 12

This will create 984 ( = 12 * 84) rows for case class IntWithPayload as [Int, Array[Byte]] with 42 bytes byte array, and save this as a parquet file format in /myfile.parquet in 12 different partitions.

How to generate TPC-DS dataset

This is an example command to generate the dataset with the scaling factor of 2, with 8 tasks but in 2 files (or partitions) when running spark locally. The output goes to crail.

./bin/spark-submit -v --num-executors 2 --executor-cores 1 --executor-memory 1G --driver-memory 1G --master local 
--class com.ibm.crail.spark.tools.ParquetGenerator 
~/parquet-generator/target/parquet-generator-1.0.jar 
-c tpcds 
-o crail://localhost:9060/tpcds 
-t 8 
-p 2
-tsf 2 
-tdsd ~/tpcds-kit/tools/

Note: on a cluster the location of dsdgen directory should be accessible on each machine.

Acknowledgement: The data generation logic is derived from https://github.com/databricks/spark-sql-perf

How to get and build the dsdgen-kit tool

As described here, the logic uses a slightly modified version of the original TPC-DS kit. It can be downloaded and build from https://github.com/databricks/tpcds-kit as

$ git clone https://github.com/databricks/tpcds-kit.git
$ cd ./tpcds-kit/tools/
$ make OS=LINUX

Good-to-know

 17/11/13 14:10:01 12276 main INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
 Exception in thread "main" java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
	at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1053)
	at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130)
	at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130)
	at scala.Option.getOrElse(Option.scala:121)

In a recent commit, I have introduced hive and ORC dependencies. In case you run into above problem then probably you have spark.driver.userClassPathFirst=true set. See https://issues.apache.org/jira/browse/SPARK-16680

Contributions

PRs are always welcome. Please fork, and make necessary modifications you propose, and let us know.

Contact

If you have questions or suggestions, feel free to post at:

https://groups.google.com/forum/#!forum/zrlio-users

or email: [email protected]

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].