Spark With PythonFundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (+108.33%)
big dataA collection of tutorials on Hadoop, MapReduce, Spark, Docker
Stars: ✭ 34 (-52.78%)
GeniA Clojure dataframe library that runs on Spark
Stars: ✭ 152 (+111.11%)
Tdigestt-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark
Stars: ✭ 274 (+280.56%)
autThe Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (+54.17%)
HelicalinsightHelical Insight software is world’s first Open Source Business Intelligence framework which helps you to make sense out of your data and make well informed decisions.
Stars: ✭ 214 (+197.22%)
dislibThe Distributed Computing library for python implemented using PyCOMPSs programming model for HPC.
Stars: ✭ 39 (-45.83%)
arrow-datafusionApache Arrow DataFusion SQL Query Engine
Stars: ✭ 2,360 (+3177.78%)
check-engineData validation library for PySpark 3.0.0
Stars: ✭ 29 (-59.72%)
pyspark-cheatsheetPySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Stars: ✭ 115 (+59.72%)
mmtf-workshop-2018Structural Bioinformatics Training Workshop & Hackathon 2018
Stars: ✭ 50 (-30.56%)
pyspark-ML-in-ColabPyspark in Google Colab: A simple machine learning (Linear Regression) model
Stars: ✭ 32 (-55.56%)
ElandPython Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
Stars: ✭ 235 (+226.39%)
HadoopDedup🍉基于Hadoop和HBase的大规模海量数据去重
Stars: ✭ 27 (-62.5%)
datalake-etl-pipelineSimplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (-45.83%)
SynapseMLSimple and Distributed Machine Learning
Stars: ✭ 3,355 (+4559.72%)
dlsaDistributed least squares approximation (dlsa) implemented with Apache Spark
Stars: ✭ 25 (-65.28%)
Data Science Ipython NotebooksData science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Stars: ✭ 22,048 (+30522.22%)
HazelcastOpen-source distributed computation and storage platform
Stars: ✭ 4,662 (+6375%)
ZeppelinWeb-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
Stars: ✭ 5,513 (+7556.94%)
IotdbApache IoTDB
Stars: ✭ 1,221 (+1595.83%)
Pyspark Setup DemoDemo of PySpark and Jupyter Notebook with the Jupyter Docker Stacks
Stars: ✭ 24 (-66.67%)
Bitcoin Value Predictor[NOT MAINTAINED] Predicting Bit coin price using Time series analysis and sentiment analysis of tweets on bitcoin
Stars: ✭ 91 (+26.39%)
Spark Py NotebooksApache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 1,338 (+1758.33%)
MoosefsMooseFS – Open Source, Petabyte, Fault-Tolerant, Highly Performing, Scalable Network Distributed File System (Software-Defined Storage)
Stars: ✭ 1,025 (+1323.61%)
MobiusC# and F# language binding and extensions to Apache Spark
Stars: ✭ 929 (+1190.28%)
MmlsparkSimple and Distributed Machine Learning
Stars: ✭ 2,899 (+3926.39%)
GimelBig Data Processing Framework - Unified Data API or SQL on Any Storage
Stars: ✭ 216 (+200%)
Data Algorithms Book MapReduce, Spark, Java, and Scala for Data Algorithms Book
Stars: ✭ 949 (+1218.06%)
nebulaA distributed block-based data storage and compute engine
Stars: ✭ 127 (+76.39%)
NakedtensorBare bone examples of machine learning in TensorFlow
Stars: ✭ 2,443 (+3293.06%)
Selinon An advanced distributed task flow management on top of Celery
Stars: ✭ 237 (+229.17%)
MLBDMaterials for "Machine Learning on Big Data" course
Stars: ✭ 20 (-72.22%)
ThrillThrill - An EXPERIMENTAL Algorithmic Distributed Big Data Batch Processing Framework in C++
Stars: ✭ 528 (+633.33%)
MetorikkuA simplified, lightweight ETL Framework based on Apache Spark
Stars: ✭ 361 (+401.39%)
isarn-sketches-sparkRoutines and data structures for using isarn-sketches idiomatically in Apache Spark
Stars: ✭ 28 (-61.11%)
AsakusafwAsakusa Framework
Stars: ✭ 114 (+58.33%)
KoalasKoalas: pandas API on Apache Spark
Stars: ✭ 3,044 (+4127.78%)
merkle-dbHigh-scalability analytics database built on immutable merkle-trees
Stars: ✭ 44 (-38.89%)
cdp-servicecdp数据平台,帮助企业充分了解客户,实现千人千面的精准营销。
Stars: ✭ 30 (-58.33%)
mesos-pinspiderA framework called "pinspider" on Apache mesos, to get basic user information from a pinterest page of a user.
Stars: ✭ 18 (-75%)
dynamodb-onetableDynamoDB access and management for one table designs with NodeJS
Stars: ✭ 508 (+605.56%)
elearningelearning linux/mac/db/cache/server/tools/人工智能
Stars: ✭ 72 (+0%)
Quantitative-Big-Imaging-2018(Latest semester at https://github.com/kmader/Quantitative-Big-Imaging-2019) The material for the Quantitative Big Imaging course at ETHZ for the Spring Semester 2018
Stars: ✭ 50 (-30.56%)
soda-sparkSoda Spark is a PySpark library that helps you with testing your data in Spark Dataframes
Stars: ✭ 58 (-19.44%)
metriqlThe metrics layer for your data. Join us at https://metriql.com/slack
Stars: ✭ 227 (+215.28%)
sgdAn R package for large scale estimation with stochastic gradient descent
Stars: ✭ 55 (-23.61%)
xslwebWeb application framework for XSLT and XQuery developers
Stars: ✭ 39 (-45.83%)
SANSA-StackBig Data RDF Processing and Analytics Stack built on Apache Spark and Apache Jena http://sansa-stack.github.io/SANSA-Stack/
Stars: ✭ 130 (+80.56%)
meeseeTask queue, Long lived workers for work based parallelization, with processes and Redis as back-end. For distributed computing.
Stars: ✭ 14 (-80.56%)