A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL

Stars: ✭ 177 (+21.23%)

Mutual labels: apache-spark, hadoop

connected-component

Map Reduce Implementation of Connected Component on Apache Spark

Stars: ✭ 68 (-53.42%)

Mutual labels: apache-spark, mapreduce

leaflet heatmap

简单的可视化湖州通话数据假设数据量很大，没法用浏览器直接绘制热力图，把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后，再使用Apache Spark绘制热力图，然后用leafletjs加载OpenStreetMap图层和热力图图层，以达到良好的交互效果。现在使用Apache Spark实现绘制，可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法，并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .

Stars: ✭ 13 (-91.1%)

Mutual labels: apache-spark, hadoop

Spark With Python

Fundamentals of Spark with Python (using PySpark), code examples

Stars: ✭ 150 (+2.74%)

Mutual labels: apache-spark, hadoop

Asakusafw

Asakusa Framework

Stars: ✭ 114 (-21.92%)

Mutual labels: hadoop, mapreduce

rail

Scalable RNA-seq analysis

Stars: ✭ 74 (-49.32%)

Mutual labels: emr, mapreduce

basin

Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser

Stars: ✭ 25 (-82.88%)

Mutual labels: emr, hadoop

DaFlow

Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.

Stars: ✭ 24 (-83.56%)

Mutual labels: apache-spark, hadoop

sparkucx

A high-performance, scalable and efficient ShuffleManager plugin for Apache Spark, utilizing UCX communication layer

Stars: ✭ 32 (-78.08%)

Mutual labels: apache-spark, hadoop

aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

Stars: ✭ 111 (-23.97%)

Mutual labels: apache-spark, hadoop

datalake-etl-pipeline

Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

Stars: ✭ 39 (-73.29%)

Mutual labels: apache-spark, hadoop

Griffon Vm

Griffon Data Science Virtual Machine

Stars: ✭ 128 (-12.33%)

Mutual labels: apache-spark, hadoop

Bigdata Notes

大数据入门指南 ⭐

Stars: ✭ 10,991 (+7428.08%)

Mutual labels: hadoop, mapreduce

Avro Hadoop Starter

Example MapReduce jobs in Java, Hive, Pig, and Hadoop Streaming that work on Avro data.

Stars: ✭ 110 (-24.66%)

Mutual labels: hadoop, mapreduce

Sparkrdma

RDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark

Stars: ✭ 215 (+47.26%)

Mutual labels: apache-spark, hadoop

Spark

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

Stars: ✭ 1,721 (+1078.77%)

Mutual labels: emr, apache-spark

View All Similar Projects ➔

Learning Hadoop and Spark

This is the companion repo to my LinkedIn Learning Courses on Apache Hadoop and Apache Spark.

🐘 1. Learning Hadoop - link
- uses mostly GCP Dataproc
- for running Hadoop & associated libraries (i.e. Hive, Pig, Spark...) workloads

🌩️ 2. Cloud Hadoop: Scaling Apache Spark - link
- uses GCP DataProc, AWS EMR --or--
- Databricks on AWS

⛈️ 3. Azure Databricks Spark Essential Training - link
- uses Azure with Databricks
- for scaling Apache Spark workloads

Development Environment Setup Information

You have a number of options - although it is possible for you to set up a local Hadoop/Spark cluster, I do NOT recommended this approach as it's needlessly complex for initial study. Rather I do recommend that you use a partially or fully-managed cluster. For learning, I most often use a fully-managed (free tier) cluster.

1. SaaS - Databricks --> MANAGED

Databricks offers managed Apache Spark clusters. Databricks can run on AWS, Azure or GCP --> announced in 20201 - link. In this course, I use Databricks running on AWS, as the community editor is simple and fast to set up for learning purposes.

Use Databricks Community Edition (managed, hosted Apache Spark), run on AWS. Example notebook shown in screenshot above.
- uses Databricks (Jupyter-style) notebooks to connect to a one or more custom-sized and managed Spark clusters
- creates and manages your data files stored in cloud buckets as part of Databricks service
- uses DFS file system in cluster data operations
- use Databricks AWS community edition (simplest set up - free tier on AWS) - link --OR--
- use Databricks Azure trial edition - Azure may require a pay-as-you-go account to get needed CPU/GPU resources
- try Databricks on GCP beta - announced recently - link

2. PaaS Cloud on GCP (or AWS) --> PARTIALLY-MANAGED

Setup a Hadoop/Spark managed cloud-cluster via GCP Dataproc or AWS EMR
- see setup-hadoop folder in this Repo for instructions/scripts
  - create a GCS (or AWS) bucket for input/output job data
  - see example_datasets folder in this Repo for sample data files
- for GCP use DataProc includes Jupyter notebook interface --OR--
- for AWS use EMR you can use EMR Studio (which includes managed Jupyter instances) - link example screenshot shown above
- for Azure it is possible to use their HDInsight service. I prefer Databricks on Azure because I find it to be more feature complete and performant.

3. IaaS local or cloud --> MANUAL

Setup Hadoop/Spark locally or on a 'raw' cloud VM, such as AWS EC2
- NOT RECOMMENDED for learning - too complex to set up
- Cloudera Learning VM - also NOT recommended, changes too often, documentation not aligned

Example Jobs or Scripts

EXAMPLES from org.apache.hadoop_or_spark.examples - link for Spark examples

Run a Hadoop WordCount Job with Java (jar file)
Run a Hadoop and/or Spark CalculatePi (digits) Script with PySpark or other libraries
Run using Cloudera shared demo env
- at https://demo.gethue.com/
- login is user:demo, pwd:demo

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

lynnlangit / learning-hadoop-and-spark

Programming Languages

Labels

Projects that are alternatives of or similar to learning-hadoop-and-spark

Learning Hadoop and Spark

Contents

Development Environment Setup Information

1. SaaS - Databricks --> MANAGED

2. PaaS Cloud on GCP (or AWS) --> PARTIALLY-MANAGED

3. IaaS local or cloud --> MANUAL

Example Jobs or Scripts

Other LinkedIn Learning Courses on Hadoop or Spark