elastacloud / parquet-usql

Licence: MIT license

A custom extractor designed to read parquet for Azure Data Lake Analytics

Programming Languages

18002 projects

powershell

5483 projects

Projects that are alternatives of or similar to parquet-usql

pan-cortex-data-lake-python

Python idiomatic SDK for Cortex™ Data Lake.

Stars: ✭ 36 (+176.92%)

Mutual labels: datalake

IMCtermite

Enables extraction of measurement data from binary files with extension 'raw' used by proprietary software imcFAMOS/imcSTUDIO and facilitates its storage in open source file formats

Stars: ✭ 20 (+53.85%)

Mutual labels: parquet

PowerPointAudio-Extractor

Python script which extracts and joins audio files from powerpoints

Stars: ✭ 12 (-7.69%)

Mutual labels: extractor

columnify

Make record oriented data to columnar format.

Stars: ✭ 28 (+115.38%)

Mutual labels: parquet

databricks-notebooks

Collection of Databricks and Jupyter Notebooks

Stars: ✭ 19 (+46.15%)

Mutual labels: parquet

hadoop-etl-udfs

The Hadoop ETL UDFs are the main way to load data from Hadoop into EXASOL

Stars: ✭ 17 (+30.77%)

Mutual labels: parquet

proc-that

proc(ess)-that - easy extendable ETL tool for Node.js. Written in TypeScript.

Stars: ✭ 25 (+92.31%)

Mutual labels: extractor

DaFlow

Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.

Stars: ✭ 24 (+84.62%)

Mutual labels: parquet

meta-extractor

Super simple and fast html page meta data extractor with low memory footprint

Stars: ✭ 38 (+192.31%)

Mutual labels: extractor

apiary

Apiary provides modules which can be combined to create a federated cloud data lake

Stars: ✭ 30 (+130.77%)

Mutual labels: datalake

RecursiveExtractor

RecursiveExtractor is a .NET Standard 2.0 archive extraction Library, and Command Line Tool which can process 7zip, ar, bzip2, deb, gzip, iso, rar, tar, vhd, vhdx, vmdk, wim, xzip, and zip archives and any nested combination of the supported formats.

Stars: ✭ 109 (+738.46%)

Mutual labels: extractor

CTR-tools

Crash Team Racing (PS1) tools - a C# framework by DCxDemo and a set of tools to parse files found in the original kart racing game by Naughty Dog.

Stars: ✭ 93 (+615.38%)

Mutual labels: extractor

wasp

WASP is a framework to build complex real time big data applications. It relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark. If you need to ingest huge amount of heterogeneous data and analyze them through complex pipelines, this is the framework for you.

Stars: ✭ 19 (+46.15%)

Mutual labels: parquet

terraform-aws-kinesis-firehose

This code creates a Kinesis Firehose in AWS to send CloudWatch log data to S3.

Stars: ✭ 25 (+92.31%)

Mutual labels: parquet

Parquet.jl

Julia implementation of Parquet columnar file format reader

Stars: ✭ 93 (+615.38%)

Mutual labels: parquet

datalake-etl-pipeline

Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

Stars: ✭ 39 (+200%)

Mutual labels: datalake

odbc2parquet

A command line tool to query an ODBC data source and write the result into a parquet file.

Stars: ✭ 95 (+630.77%)

Mutual labels: parquet

Spark

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development API's to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.This project will have sample programs for Spark in Scala language .

Stars: ✭ 55 (+323.08%)

Mutual labels: parquet

Real-time-Data-Warehouse

Real-time Data Warehouse with Apache Flink & Apache Kafka & Apache Hudi

Stars: ✭ 52 (+300%)

Mutual labels: datalake

crohme-data-extractor

A modified extractor for the CROHME handwritten math symbols dataset.

Stars: ✭ 18 (+38.46%)

Mutual labels: extractor

View All Similar Projects ➔

Apache Parquet for Azure Data Lake

Summary

This custom extractor and outputter consumes parquet-dotnet to enable reads of parquet files in Azure Data Lake Analytics. The extractor supports both the native Apache Parquet format and the type representation using Apache Spark, HIVE and Impala so that the outputs are interchangable as there are several discrepencies in representation for annotated types.

Deployment

The Parquet.Adla project compiles with all dependent assemblies into a single assembly created through ILMerge. The deploy.ps1 Powershell script can be run locally to:

Merge all dependent assemblies into Parquet.Adla.dll
Copy the assembly to your chosen blob storage container
Copy and register the assembly to the catalog of your chosen ADLS database

To install for use with ADLA open a command script at the solution root and enter the following:

powershell -File .\deploy.ps1 -BlobStorageAccountName xx
	-BlobStorageAccountKey xx
	-BlobStorageContainer xx
	-BlobStoragePath xx
	-AzureDataLakeStoreName xx
	-AzureDataLakeAnalyticsName xx
	-TenantId xx
	-ApplicationId xx
	-ApplicationKey xx
	-SubscriptionId xx

If the Blob storage parameters are omitted then the script will not deploy to storage and if the ADLS and ADLA names are omitted then the dll will not be deployed to ADLS and regsitered with the catalog.

The deployment uses a Service Principal which must be created to enable a non-interctive login. Use the following guide to create one.

Creating a Service Principal

Follow the steps to get the ApplicationId and the Key and then use them in the deployment script. You will also need to select the resources in the Azure Portal (ADLA and ADLS) and add give the Service Principal at least a contributor role under the IAM tab.

To find out your TenantId use the following Uri.

https://login.windows.net/xxx.onmicrosoft.com/.well-known/openid-configuration

Replacing xxx with your own Azure Active Directory name. This should give you a list of your subscriptions. The Guid character in each of the Urls is the TenantId.

Usage

Outputter

To use the outputter reference Parquet.Adla as follows.

REFERENCE ASSEMBLY [Parquet.Adla];

@a  = 
SELECT * FROM 
    (VALUES
        ("Contoso", 1500.0),
        ("Woodgrove", 2700.0)
    ) AS 
          D( customer, amount );
OUTPUT @a
	TO "/pqnet/test1.parquet"
	USING new Parquet.Adla.Outputter.ParquetOutputter();

Extractor

To use the Extractor reference Parquet.Adla as follows.

USE DATABASE master;
REFERENCE ASSEMBLY [Parquet.Adla];

DECLARE @input_file string = @"alltypes.plain.parquet";
DECLARE @output_file string = @"alltypes.plain.csv";

@a =
	EXTRACT bool_col bool, timestamp_col DateTime
	FROM @input_file USING new Parquet.Adla.Extractors.ParquetExtractor();

OUTPUT @a
	TO @output_file
	USING Outputters.Csv();

Limitations

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

elastacloud / parquet-usql

Programming Languages

Labels

Projects that are alternatives of or similar to parquet-usql

Apache Parquet for Azure Data Lake

Summary

Deployment

Usage

Outputter

Extractor

Limitations