Amazon S3 Find And ForgetAmazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
Stars: ✭ 115 (+296.55%)
PetastormPetastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Stars: ✭ 1,108 (+3720.69%)
KartothekA consistent table management library in python
Stars: ✭ 144 (+396.55%)
Devops Python Tools80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
Stars: ✭ 406 (+1300%)
columnifyMake record oriented data to columnar format.
Stars: ✭ 28 (-3.45%)
SchemerSchema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.
Stars: ✭ 97 (+234.48%)
Parquet.jlJulia implementation of Parquet columnar file format reader
Stars: ✭ 93 (+220.69%)
QuiltQuilt is a self-organizing data hub for S3
Stars: ✭ 1,007 (+3372.41%)
openmrs-fhir-analyticsA collection of tools for extracting FHIR resources and analytics services on top of that data.
Stars: ✭ 55 (+89.66%)
ParquetviewerSimple windows desktop application for viewing & querying Apache Parquet files
Stars: ✭ 145 (+400%)
OapOptimized Analytics Package for Spark* Platform
Stars: ✭ 343 (+1082.76%)
GafferA large-scale entity and relation database supporting aggregation of properties
Stars: ✭ 1,642 (+5562.07%)
SparkApache Spark is a fast, in-memory data processing engine with elegant and expressive development API's to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.This project will have sample programs for Spark in Scala language .
Stars: ✭ 55 (+89.66%)
Parquet IndexSpark SQL index for Parquet tables
Stars: ✭ 109 (+275.86%)
albisAlbis: High-Performance File Format for Big Data Systems
Stars: ✭ 20 (-31.03%)
Bigdata File ViewerA cross-platform (Windows, MAC, Linux) desktop application to view common bigdata binary format like Parquet, ORC, AVRO, etc. Support local file system, HDFS, AWS S3, Azure Blob Storage ,etc.
Stars: ✭ 86 (+196.55%)
graphiqueGraphQL service for arrow tables and parquet data sets.
Stars: ✭ 28 (-3.45%)
Gcs ToolsGCS support for avro-tools, parquet-tools and protobuf
Stars: ✭ 57 (+96.55%)
miniparquetLibrary to read a subset of Parquet files
Stars: ✭ 38 (+31.03%)
hadoop-etl-udfsThe Hadoop ETL UDFs are the main way to load data from Hadoop into EXASOL
Stars: ✭ 17 (-41.38%)
SkaleHigh performance distributed data processing engine
Stars: ✭ 390 (+1244.83%)
Awkward 0.xManipulate arrays of complex data structures as easily as Numpy.
Stars: ✭ 216 (+644.83%)
Parquet RsApache Parquet implementation in Rust
Stars: ✭ 144 (+396.55%)
Eel SdkBig Data Toolkit for the JVM
Stars: ✭ 140 (+382.76%)
common-datax基于DataX的通用数据同步微服务,一个Restful接口搞定所有通用数据同步
Stars: ✭ 51 (+75.86%)
Parquet4sRead and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.
Stars: ✭ 125 (+331.03%)
DataXServer为DataX(https://github.com/alibaba/DataX) 提供远程多语言调用(ThriftServer,HttpServer) 分布式运行(DataX on YARN) 功能
Stars: ✭ 130 (+348.28%)
Parquet GoGo package to read and write parquet files. parquet is a file format to store nested data structures in a flat columnar data format. It can be used in the Hadoop ecosystem and with tools such as Presto and AWS Athena.
Stars: ✭ 114 (+293.1%)
KglabGraph-Based Data Science: an abstraction layer in Python for building knowledge graphs, integrated with popular graph libraries – atop Pandas, RDFlib, pySHACL, RAPIDS, NetworkX, iGraph, PyVis, pslpython, pyarrow, etc.
Stars: ✭ 98 (+237.93%)
DaFlowApache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
Stars: ✭ 24 (-17.24%)
Parquet MrApache Parquet
Stars: ✭ 1,278 (+4306.9%)
parquet-extraA collection of Apache Parquet add-on modules
Stars: ✭ 30 (+3.45%)
LarkMidTableLarkMidTable 是一站式开源的数据中台,实现中台的 基础建设,数据治理,数据开发,监控告警,数据服务,数据的可视化,实现高效赋能数据前台并提供数据服务的产品。
Stars: ✭ 873 (+2910.34%)
Rumble⛈️ Rumble 1.11.0 "Banyan Tree"🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more
Stars: ✭ 58 (+100%)
qsvCSVs sliced, diced & analyzed.
Stars: ✭ 438 (+1410.34%)
Node ParquetNodeJS module to access apache parquet format files
Stars: ✭ 46 (+58.62%)
waspWASP is a framework to build complex real time big data applications. It relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark. If you need to ingest huge amount of heterogeneous data and analyze them through complex pipelines, this is the framework for you.
Stars: ✭ 19 (-34.48%)
PucketBucketing and partitioning system for Parquet
Stars: ✭ 29 (+0%)
DataX-srcDataX 是异构数据广泛使用的离线数据同步工具/平台,实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、OTS、ODPS 等各种异构数据源之间高效的数据同步功能。
Stars: ✭ 21 (-27.59%)
parquet2Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow
Stars: ✭ 157 (+441.38%)
IcebergIceberg is a table format for large, slow-moving tabular data
Stars: ✭ 393 (+1255.17%)
Vscode Data PreviewData Preview 🈸 extension for importing 📤 viewing 🔎 slicing 🔪 dicing 🎲 charting 📊 & exporting 📥 large JSON array/config, YAML, Apache Arrow, Avro, Parquet & Excel data files
Stars: ✭ 245 (+744.83%)
ChoetlETL Framework for .NET / c# (Parser / Writer for CSV, Flat, Xml, JSON, Key-Value, Parquet, Yaml, Avro formatted files)
Stars: ✭ 372 (+1182.76%)
odbc2parquetA command line tool to query an ODBC data source and write the result into a parquet file.
Stars: ✭ 95 (+227.59%)
Parquetjsfully asynchronous, pure JavaScript implementation of the Parquet file format
Stars: ✭ 200 (+589.66%)
experimentsCode examples for my blog posts
Stars: ✭ 21 (-27.59%)
cloud云计算之hadoop、hive、hue、oozie、sqoop、hbase、zookeeper环境搭建及配置文件
Stars: ✭ 48 (+65.52%)
parquet-usqlA custom extractor designed to read parquet for Azure Data Lake Analytics
Stars: ✭ 13 (-55.17%)
IMCtermiteEnables extraction of measurement data from binary files with extension 'raw' used by proprietary software imcFAMOS/imcSTUDIO and facilitates its storage in open source file formats
Stars: ✭ 20 (-31.03%)
Bigdata PlaygroundA complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
Stars: ✭ 177 (+510.34%)