Cc PysparkProcess Common Crawl data with Python and Spark
Stars: ✭ 147 (+1125%)
SparkmagicJupyter magics and kernels for working with remote Spark clusters
Stars: ✭ 954 (+7850%)
Spark PracticeApache Spark (PySpark) Practice on Real Data
Stars: ✭ 200 (+1566.67%)
Pyspark Cheatsheet🐍 Quick reference guide to common patterns & functions in PySpark.
Stars: ✭ 108 (+800%)
autThe Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (+825%)
GimelBig Data Processing Framework - Unified Data API or SQL on Any Storage
Stars: ✭ 216 (+1700%)
Spark With PythonFundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (+1150%)
ODSC India 2018My presentation at ODSC India 2018 about Deep Learning with Apache Spark
Stars: ✭ 26 (+116.67%)
spark-extensionA library that provides useful extensions to Apache Spark and PySpark.
Stars: ✭ 25 (+108.33%)
LinkisLinkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
Stars: ✭ 2,323 (+19258.33%)
HnswlibJava library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs
Stars: ✭ 108 (+800%)
Spark Py NotebooksApache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 1,338 (+11050%)
Optimus🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Stars: ✭ 986 (+8116.67%)
W2vWord2Vec models with Twitter data using Spark. Blog:
Stars: ✭ 64 (+433.33%)
kafka-compose🎼 Docker compose files for various kafka stacks
Stars: ✭ 32 (+166.67%)
incubator-linkisLinkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
Stars: ✭ 2,459 (+20391.67%)
HandysparkHandySpark - bringing pandas-like capabilities to Spark dataframes
Stars: ✭ 158 (+1216.67%)
basinBasin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser
Stars: ✭ 25 (+108.33%)
Spark NlpState of the Art Natural Language Processing
Stars: ✭ 2,518 (+20883.33%)
ScriptisScriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.
Stars: ✭ 696 (+5700%)
Live log analyzer sparkSpark Application for analysis of Apache Access logs and detect anamolies! Along with Medium Article.
Stars: ✭ 14 (+16.67%)
Pyspark Example ProjectExample project implementing best practices for PySpark ETL jobs and applications.
Stars: ✭ 633 (+5175%)
MmlsparkSimple and Distributed Machine Learning
Stars: ✭ 2,899 (+24058.33%)
Devops Python Tools80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
Stars: ✭ 406 (+3283.33%)
HailScalable genomic data analysis.
Stars: ✭ 706 (+5783.33%)
MobiusC# and F# language binding and extensions to Apache Spark
Stars: ✭ 929 (+7641.67%)
FreestyleA cohesive & pragmatic framework of FP centric Scala libraries
Stars: ✭ 627 (+5125%)
Dev SetupmacOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.
Stars: ✭ 5,590 (+46483.33%)
Bigdata Interview🎯 🌟[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结
Stars: ✭ 857 (+7041.67%)
ChroniclerScala toolchain for InfluxDB
Stars: ✭ 24 (+100%)
Sparkling WaterSparkling Water provides H2O functionality inside Spark cluster
Stars: ✭ 887 (+7291.67%)
H2o 3H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Stars: ✭ 5,656 (+47033.33%)
DatafusionDataFusion has now been donated to the Apache Arrow project
Stars: ✭ 611 (+4991.67%)
ZeppelinWeb-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
Stars: ✭ 5,513 (+45841.67%)
Mongo SparkThe MongoDB Spark Connector
Stars: ✭ 588 (+4800%)
Pyspark Setup DemoDemo of PySpark and Jupyter Notebook with the Jupyter Docker Stacks
Stars: ✭ 24 (+100%)
Bigdataguide大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料
Stars: ✭ 817 (+6708.33%)
AlluxioAlluxio, data orchestration for analytics and machine learning in the cloud
Stars: ✭ 5,379 (+44725%)
SparklearningLearning Apache spark,including code and data .Most part can run local.
Stars: ✭ 558 (+4550%)
Goodreads etl pipelineAn end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Stars: ✭ 793 (+6508.33%)
Spark DariaEssential Spark extensions and helper methods ✨😲
Stars: ✭ 553 (+4508.33%)
MareMaRe leverages the power of Docker and Spark to run and scale your serial tools in MapReduce fashion.
Stars: ✭ 11 (-8.33%)
Dockerfiles50+ DockerHub public images for Docker & Kubernetes - Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak, TeamCity and DevOps tools built on the major Linux distros: Alpine, CentOS, Debian, Fedora, Ubuntu
Stars: ✭ 847 (+6958.33%)