Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → apache → Parquet Mr

apache / Parquet Mr

Licence: apache-2.0

Apache Parquet

Programming Languages

68154 projects - #9 most used programming language

Labels

big-data parquet

Projects that are alternatives of or similar to Parquet Mr

Apache Drill is a distributed MPP query layer for self describing data

Stars: ✭ 1,619 (+26.68%)

Mutual labels: big-data, parquet

Big Data Toolkit for the JVM

Stars: ✭ 140 (-89.05%)

Mutual labels: big-data, parquet

Amazon S3 Find And Forget

Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)

Stars: ✭ 115 (-91%)

Mutual labels: big-data, parquet

Simple windows desktop application for viewing & querying Apache Parquet files

Stars: ✭ 145 (-88.65%)

Mutual labels: big-data, parquet

terraform-aws-kinesis-firehose

This code creates a Kinesis Firehose in AWS to send CloudWatch log data to S3.

Stars: ✭ 25 (-98.04%)

Mutual labels: big-data, parquet

Apache Parquet

Stars: ✭ 339 (-73.47%)

Mutual labels: big-data, parquet

A large-scale entity and relation database supporting aggregation of properties

Stars: ✭ 1,642 (+28.48%)

Mutual labels: big-data, parquet

Bigdata Playground

A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL

Stars: ✭ 177 (-86.15%)

Mutual labels: big-data, parquet

Manipulate arrays of complex data structures as easily as Numpy.

Stars: ✭ 216 (-83.1%)

Mutual labels: big-data, parquet

🏐 Apache Parquet for modern .NET

Stars: ✭ 276 (-78.4%)

Mutual labels: big-data, parquet

Apache Parquet

Stars: ✭ 800 (-37.4%)

Mutual labels: big-data, parquet

Attic Predictionio Template Recommender

PredictionIO Recommendation Engine Template (Scala-based parallelized engine)

Stars: ✭ 78 (-93.9%)

Mutual labels: big-data

Countly Sdk Cordova

Countly Product Analytics SDK for Cordova, Icenium and Phonegap

Stars: ✭ 69 (-94.6%)

Mutual labels: big-data

Mirror of Apache CarbonData

Stars: ✭ 1,158 (-9.39%)

Mutual labels: big-data

Hazelcast Cpp Client

Hazelcast IMDG C++ Client

Stars: ✭ 67 (-94.76%)

Mutual labels: big-data

A Global Scale Network Telemetry Ecosystem

Stars: ✭ 80 (-93.74%)

Mutual labels: big-data

Apache Spark Website

Stars: ✭ 75 (-94.13%)

Mutual labels: big-data

Apache Flink shaded artifacts repository

Stars: ✭ 67 (-94.76%)

Mutual labels: big-data

RSparkling: Use H2O Sparkling Water from R (Spark + R + Machine Learning)

Stars: ✭ 65 (-94.91%)

Mutual labels: big-data

Read and write Neuroglancer datasets programmatically.

Stars: ✭ 63 (-95.07%)

Mutual labels: big-data

View All Similar Projects ➔

Parquet MR

Parquet-MR contains the java implementation of the Parquet format. Parquet is a columnar storage format for Hadoop; it provides efficient storage and encoding of data. Parquet uses the record shredding and assembly algorithm described in the Dremel paper to represent nested structures.

You can find some details about the format and intended use cases in our Hadoop Summit 2013 presentation

Building

Parquet-MR uses Maven to build and depends on the thrift compiler (protoc is now managed by maven plugin).

Install Thrift

To build and install the thrift compiler, run:

wget -nv http://archive.apache.org/dist/thrift/0.13.0/thrift-0.13.0.tar.gz
tar xzf thrift-0.13.0.tar.gz
cd thrift-0.13.0
chmod +x ./configure
./configure --disable-libs
sudo make install

If you're on OSX and use homebrew, you can instead install Thrift 0.13.0 with brew and ensure that it comes first in your PATH.

brew install [email protected]
export PATH="/usr/local/opt/[email protected]/bin:$PATH"

Build Parquet with Maven

Once protobuf and thrift are available in your path, you can build the project by running:

LC_ALL=C mvn clean install

Features

Parquet is a very active project, and new features are being added quickly. Here are a few features:

Type-specific encoding
Hive integration (deprecated)
Pig integration
Cascading integration
Crunch integration
Apache Arrow integration
Apache Scrooge integration
Impala integration (non-nested)
Java Map/Reduce API
Native Avro support
Native Thrift support
Native Protocol Buffers support
Complex structure support
Run-length encoding (RLE)
Bit Packing
Adaptive dictionary encoding
Predicate pushdown
Column stats
Delta encoding
Index pages

Map/Reduce integration

Input and Output formats. Note that to use an Input or Output format, you need to implement a WriteSupport or ReadSupport class, which will implement the conversion of your object to and from a Parquet schema.

We've implemented this for 2 popular data formats to provide a clean migration path as well:

Thrift

Thrift integration is provided by the parquet-thrift sub-project. If you are using Thrift through Scala, you may be using Twitter's Scrooge. If that's the case, not to worry -- we took care of the Scrooge/Apache Thrift glue for you in the parquet-scrooge sub-project.

Avro

Avro conversion is implemented via the parquet-avro sub-project.

Create your own objects

The ParquetOutputFormat can be provided a WriteSupport to write your own objects to an event based RecordConsumer.
the ParquetInputFormat can be provided a ReadSupport to materialize your own objects by implementing a RecordMaterializer

See the APIs:

Apache Pig integration

A Loader and a Storer are provided to read and write Parquet files with Apache Pig

Storing data into Parquet in Pig is simple:

-- options you might want to fiddle with
SET parquet.page.size 1048576 -- default. this is your min read/write unit.
SET parquet.block.size 134217728 -- default. your memory budget for buffering data
SET parquet.compression lzo -- or you can use none, gzip, snappy
STORE mydata into '/some/path' USING parquet.pig.ParquetStorer;

Reading in Pig is also simple:

mydata = LOAD '/some/path' USING parquet.pig.ParquetLoader();

If the data was stored using Pig, things will "just work". If the data was stored using another method, you will need to provide the Pig schema equivalent to the data you stored (you can also write the schema to the file footer while writing it -- but that's pretty advanced). We will provide a basic automatic schema conversion soon.

Hive integration

Hive integration is provided via the parquet-hive sub-project.

Hive integration is now deprecated within the Parquet project. It is now maintained by Apache Hive.

Build

To run the unit tests: mvn test

To build the jars: mvn package

The build runs in GitHub Actions:

Add Parquet as a dependency in Maven

The current release is version 1.11.0

  <dependencies>
    <dependency>
      <groupId>org.apache.parquet</groupId>
      <artifactId>parquet-common</artifactId>
      <version>1.11.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.parquet</groupId>
      <artifactId>parquet-encoding</artifactId>
      <version>1.11.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.parquet</groupId>
      <artifactId>parquet-column</artifactId>
      <version>1.11.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.parquet</groupId>
      <artifactId>parquet-hadoop</artifactId>
      <version>1.11.0</version>
    </dependency>
  </dependencies>

How To Contribute

We prefer to receive contributions in the form of GitHub pull requests. Please send pull requests against the parquet-mr Git repository. If you've previously forked Parquet from its old location, you will need to add a remote or update your origin remote to https://github.com/apache/parquet-mr.git

If you are looking for some ideas on what to contribute, check out jira issues for this project labeled "pick-me-up". Comment on the issue and/or contact [email protected] with your questions and ideas.

If you’d like to report a bug but don’t have time to fix it, you can still post it to our issue tracker, or email the mailing list [email protected]

To contribute a patch:

Break your work into small, single-purpose patches if possible. It’s much harder to merge in a large change with a lot of disjoint features.
Create a JIRA for your patch on the Parquet Project JIRA.
Submit the patch as a GitHub pull request against the master branch. For a tutorial, see the GitHub guides on forking a repo and sending a pull request. Prefix your pull request name with the JIRA name (ex: https://github.com/apache/parquet-mr/pull/240).
Make sure that your code passes the unit tests. You can run the tests with mvn test in the root directory.
Add new unit tests for your code.

We tend to do fairly close readings of pull requests, and you may get a lot of comments. Some common issues that are not code structure related, but still important:

Use 2 spaces for whitespace. Not tabs, not 4 spaces. The number of the spacing shall be 2.
Give your operators some room. Not a+b but a + b and not foo(int a,int b) but foo(int a, int b).
Generally speaking, stick to the Sun Java Code Conventions
Make sure tests pass!

Thank you for getting involved!

Authors and contributors

Code of Conduct

We hold ourselves and the Parquet developer community to two codes of conduct:

Discussions

Mailing list: [email protected]
Bug trackter: jira
Discussions also take place in github pull requests

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0 See also:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 1,278

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (95) 🔗