All Projects → elastacloud → parquet-usql

elastacloud / parquet-usql

Licence: MIT license
A custom extractor designed to read parquet for Azure Data Lake Analytics

Programming Languages

C#
18002 projects
powershell
5483 projects

Projects that are alternatives of or similar to parquet-usql

pan-cortex-data-lake-python
Python idiomatic SDK for Cortex™ Data Lake.
Stars: ✭ 36 (+176.92%)
Mutual labels:  datalake
IMCtermite
Enables extraction of measurement data from binary files with extension 'raw' used by proprietary software imcFAMOS/imcSTUDIO and facilitates its storage in open source file formats
Stars: ✭ 20 (+53.85%)
Mutual labels:  parquet
PowerPointAudio-Extractor
Python script which extracts and joins audio files from powerpoints
Stars: ✭ 12 (-7.69%)
Mutual labels:  extractor
columnify
Make record oriented data to columnar format.
Stars: ✭ 28 (+115.38%)
Mutual labels:  parquet
databricks-notebooks
Collection of Databricks and Jupyter Notebooks
Stars: ✭ 19 (+46.15%)
Mutual labels:  parquet
hadoop-etl-udfs
The Hadoop ETL UDFs are the main way to load data from Hadoop into EXASOL
Stars: ✭ 17 (+30.77%)
Mutual labels:  parquet
proc-that
proc(ess)-that - easy extendable ETL tool for Node.js. Written in TypeScript.
Stars: ✭ 25 (+92.31%)
Mutual labels:  extractor
DaFlow
Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
Stars: ✭ 24 (+84.62%)
Mutual labels:  parquet
meta-extractor
Super simple and fast html page meta data extractor with low memory footprint
Stars: ✭ 38 (+192.31%)
Mutual labels:  extractor
apiary
Apiary provides modules which can be combined to create a federated cloud data lake
Stars: ✭ 30 (+130.77%)
Mutual labels:  datalake
RecursiveExtractor
RecursiveExtractor is a .NET Standard 2.0 archive extraction Library, and Command Line Tool which can process 7zip, ar, bzip2, deb, gzip, iso, rar, tar, vhd, vhdx, vmdk, wim, xzip, and zip archives and any nested combination of the supported formats.
Stars: ✭ 109 (+738.46%)
Mutual labels:  extractor
CTR-tools
Crash Team Racing (PS1) tools - a C# framework by DCxDemo and a set of tools to parse files found in the original kart racing game by Naughty Dog.
Stars: ✭ 93 (+615.38%)
Mutual labels:  extractor
wasp
WASP is a framework to build complex real time big data applications. It relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark. If you need to ingest huge amount of heterogeneous data and analyze them through complex pipelines, this is the framework for you.
Stars: ✭ 19 (+46.15%)
Mutual labels:  parquet
terraform-aws-kinesis-firehose
This code creates a Kinesis Firehose in AWS to send CloudWatch log data to S3.
Stars: ✭ 25 (+92.31%)
Mutual labels:  parquet
Parquet.jl
Julia implementation of Parquet columnar file format reader
Stars: ✭ 93 (+615.38%)
Mutual labels:  parquet
datalake-etl-pipeline
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (+200%)
Mutual labels:  datalake
odbc2parquet
A command line tool to query an ODBC data source and write the result into a parquet file.
Stars: ✭ 95 (+630.77%)
Mutual labels:  parquet
Spark
Apache Spark is a fast, in-memory data processing engine with elegant and expressive development API's to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.This project will have sample programs for Spark in Scala language .
Stars: ✭ 55 (+323.08%)
Mutual labels:  parquet
Real-time-Data-Warehouse
Real-time Data Warehouse with Apache Flink & Apache Kafka & Apache Hudi
Stars: ✭ 52 (+300%)
Mutual labels:  datalake
crohme-data-extractor
A modified extractor for the CROHME handwritten math symbols dataset.
Stars: ✭ 18 (+38.46%)
Mutual labels:  extractor

Apache Parquet for Azure Data Lake Build status

Summary

This custom extractor and outputter consumes parquet-dotnet to enable reads of parquet files in Azure Data Lake Analytics. The extractor supports both the native Apache Parquet format and the type representation using Apache Spark, HIVE and Impala so that the outputs are interchangable as there are several discrepencies in representation for annotated types.

Deployment

The Parquet.Adla project compiles with all dependent assemblies into a single assembly created through ILMerge. The deploy.ps1 Powershell script can be run locally to:

  • Merge all dependent assemblies into Parquet.Adla.dll
  • Copy the assembly to your chosen blob storage container
  • Copy and register the assembly to the catalog of your chosen ADLS database

To install for use with ADLA open a command script at the solution root and enter the following:

powershell -File .\deploy.ps1 -BlobStorageAccountName xx
	-BlobStorageAccountKey xx
	-BlobStorageContainer xx
	-BlobStoragePath xx
	-AzureDataLakeStoreName xx
	-AzureDataLakeAnalyticsName xx
	-TenantId xx
	-ApplicationId xx
	-ApplicationKey xx
	-SubscriptionId xx

If the Blob storage parameters are omitted then the script will not deploy to storage and if the ADLS and ADLA names are omitted then the dll will not be deployed to ADLS and regsitered with the catalog.

The deployment uses a Service Principal which must be created to enable a non-interctive login. Use the following guide to create one.

Creating a Service Principal

Follow the steps to get the ApplicationId and the Key and then use them in the deployment script. You will also need to select the resources in the Azure Portal (ADLA and ADLS) and add give the Service Principal at least a contributor role under the IAM tab.

To find out your TenantId use the following Uri.

https://login.windows.net/xxx.onmicrosoft.com/.well-known/openid-configuration

Replacing xxx with your own Azure Active Directory name. This should give you a list of your subscriptions. The Guid character in each of the Urls is the TenantId.

Usage

Outputter

To use the outputter reference Parquet.Adla as follows.

REFERENCE ASSEMBLY [Parquet.Adla];

@a  = 
SELECT * FROM 
    (VALUES
        ("Contoso", 1500.0),
        ("Woodgrove", 2700.0)
    ) AS 
          D( customer, amount );
OUTPUT @a
	TO "/pqnet/test1.parquet"
	USING new Parquet.Adla.Outputter.ParquetOutputter();

Extractor

To use the Extractor reference Parquet.Adla as follows.

USE DATABASE master;
REFERENCE ASSEMBLY [Parquet.Adla];

DECLARE @input_file string = @"alltypes.plain.parquet";
DECLARE @output_file string = @"alltypes.plain.csv";

@a =
	EXTRACT bool_col bool, timestamp_col DateTime
	FROM @input_file USING new Parquet.Adla.Extractors.ParquetExtractor();

OUTPUT @a
	TO @output_file
	USING Outputters.Csv();

Limitations

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].