Devops Python Tools80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
Stars: ✭ 406 (-63.36%)
autThe Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (-89.98%)
OapOptimized Analytics Package for Spark* Platform
Stars: ✭ 343 (-69.04%)
ODSC India 2018My presentation at ODSC India 2018 about Deep Learning with Apache Spark
Stars: ✭ 26 (-97.65%)
Pyspark Setup DemoDemo of PySpark and Jupyter Notebook with the Jupyter Docker Stacks
Stars: ✭ 24 (-97.83%)
pyspark-cheatsheetPySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Stars: ✭ 115 (-89.62%)
Spark GotchasSpark Gotchas. A subjective compilation of the Apache Spark tips and tricks
Stars: ✭ 308 (-72.2%)
lineageGenerate beautiful documentation for your data pipelines in markdown format
Stars: ✭ 16 (-98.56%)
basinBasin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser
Stars: ✭ 25 (-97.74%)
Pyspark Example ProjectExample project implementing best practices for PySpark ETL jobs and applications.
Stars: ✭ 633 (-42.87%)
dbddbd is a database prototyping tool that enables data analysts and engineers to quickly load and transform data in SQL databases.
Stars: ✭ 30 (-97.29%)
Live log analyzer sparkSpark Application for analysis of Apache Access logs and detect anamolies! Along with Medium Article.
Stars: ✭ 14 (-98.74%)
kafka-compose🎼 Docker compose files for various kafka stacks
Stars: ✭ 32 (-97.11%)
SkaleHigh performance distributed data processing engine
Stars: ✭ 390 (-64.8%)
QuiltQuilt is a self-organizing data hub for S3
Stars: ✭ 1,007 (-9.12%)
meepo异构存储数据迁移
Stars: ✭ 29 (-97.38%)
PystoreFast data store for Pandas time-series data
Stars: ✭ 325 (-70.67%)
dlsaDistributed least squares approximation (dlsa) implemented with Apache Spark
Stars: ✭ 25 (-97.74%)
Cluster PackA library on top of either pex or conda-pack to make your Python code easily available on a cluster
Stars: ✭ 23 (-97.92%)
kuwalaKuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data sc…
Stars: ✭ 474 (-57.22%)
RatatoolA tool for data sampling, data generation, and data diffing
Stars: ✭ 279 (-74.82%)
RoapiCreate full-fledged APIs for static datasets without writing a single line of code.
Stars: ✭ 253 (-77.17%)
ScriptisScriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.
Stars: ✭ 696 (-37.18%)
mmtf-workshop-2018Structural Bioinformatics Training Workshop & Hackathon 2018
Stars: ✭ 50 (-95.49%)
PucketBucketing and partitioning system for Parquet
Stars: ✭ 29 (-97.38%)
spark-extensionA library that provides useful extensions to Apache Spark and PySpark.
Stars: ✭ 25 (-97.74%)
Spark SyntaxThis is a repo documenting the best practices in PySpark.
Stars: ✭ 412 (-62.82%)
Node ParquetNodeJS module to access apache parquet format files
Stars: ✭ 46 (-95.85%)
incubator-linkisLinkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
Stars: ✭ 2,459 (+121.93%)
IcebergIceberg is a table format for large, slow-moving tabular data
Stars: ✭ 393 (-64.53%)
HybridBackendEfficient training of deep recommenders on cloud.
Stars: ✭ 30 (-97.29%)
Sparkling TitanicTraining models with Apache Spark, PySpark for Titanic Kaggle competition
Stars: ✭ 12 (-98.92%)
ChoetlETL Framework for .NET / c# (Parser / Writer for CSV, Flat, Xml, JSON, Key-Value, Parquet, Yaml, Avro formatted files)
Stars: ✭ 372 (-66.43%)
Gcs ToolsGCS support for avro-tools, parquet-tools and protobuf
Stars: ✭ 57 (-94.86%)
centurionKotlin Bigdata Toolkit
Stars: ✭ 320 (-71.12%)
experimentsCode examples for my blog posts
Stars: ✭ 21 (-98.1%)
big dataA collection of tutorials on Hadoop, MapReduce, Spark, Docker
Stars: ✭ 34 (-96.93%)
Optimus🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Stars: ✭ 986 (-11.01%)
DataEngineeringThis repo contains commands that data engineers use in day to day work.
Stars: ✭ 47 (-95.76%)
Elasticsearch loaderA tool for batch loading data files (json, parquet, csv, tsv) into ElasticSearch
Stars: ✭ 300 (-72.92%)
graphiqueGraphQL service for arrow tables and parquet data sets.
Stars: ✭ 28 (-97.47%)
sparklanesA lightweight data processing framework for Apache Spark
Stars: ✭ 17 (-98.47%)
Rumble⛈️ Rumble 1.11.0 "Banyan Tree"🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more
Stars: ✭ 58 (-94.77%)
Awesome SparkA curated list of awesome Apache Spark packages and resources.
Stars: ✭ 1,061 (-4.24%)
SparkmagicJupyter magics and kernels for working with remote Spark clusters
Stars: ✭ 954 (-13.9%)