All Projects → anish749 → spark2-etl-examples

anish749 / spark2-etl-examples

Licence: other
A project with examples of using few commonly used data manipulation/processing/transformation APIs in Apache Spark 2.0.0

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to spark2-etl-examples

big data
A collection of tutorials on Hadoop, MapReduce, Spark, Docker
Stars: ✭ 34 (+47.83%)
Mutual labels:  spark-sql
Spark
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
Stars: ✭ 1,721 (+7382.61%)
Mutual labels:  spark-sql
albis
Albis: High-Performance File Format for Big Data Systems
Stars: ✭ 20 (-13.04%)
Mutual labels:  spark-sql
recsys spark
Spark SQL 实现 ItemCF,UserCF,Swing,推荐系统,推荐算法,协同过滤
Stars: ✭ 76 (+230.43%)
Mutual labels:  spark-sql
spark-structured-streaming-examples
Spark structured streaming examples with using of version 3.0.0
Stars: ✭ 23 (+0%)
Mutual labels:  spark-sql
spark-twitter-sentiment-analysis
Sentiment Analysis of a Twitter Topic with Spark Structured Streaming
Stars: ✭ 55 (+139.13%)
Mutual labels:  spark-sql
spark-sql-internals
The Internals of Spark SQL
Stars: ✭ 331 (+1339.13%)
Mutual labels:  spark-sql
spark-vcf
Spark VCF data source implementation for Dataframes
Stars: ✭ 15 (-34.78%)
Mutual labels:  spark-sql
Redash
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
Stars: ✭ 20,147 (+87495.65%)
Mutual labels:  spark-sql
dt-sql-parser
SQL Parsers for BigData, built with antlr4.
Stars: ✭ 135 (+486.96%)
Mutual labels:  spark-sql
spark-data-sources
Developing Spark External Data Sources using the V2 API
Stars: ✭ 36 (+56.52%)
Mutual labels:  spark-sql
MCW-Big-data-analytics-and-visualization
MCW Big data analytics and visualization
Stars: ✭ 172 (+647.83%)
Mutual labels:  spark-sql
geospark
bring sf to spark in production
Stars: ✭ 53 (+130.43%)
Mutual labels:  spark-sql
litemall-dw
基于开源Litemall电商项目的大数据项目,包含前端埋点(openresty+lua)、后端埋点;数据仓库(五层)、实时计算和用户画像。大数据平台采用CDH6.3.2(已使用vagrant+ansible脚本化),同时也包含了Azkaban的workflow。
Stars: ✭ 36 (+56.52%)
Mutual labels:  spark-sql
opaque-sql
An encrypted data analytics platform
Stars: ✭ 169 (+634.78%)
Mutual labels:  spark-sql
SparkProgrammingInScala
Apache Spark Course Material
Stars: ✭ 57 (+147.83%)
Mutual labels:  spark-sql
bigdatatutorial
bigdatatutorial
Stars: ✭ 34 (+47.83%)
Mutual labels:  spark-sql
databricks-notebooks
Collection of Databricks and Jupyter Notebooks
Stars: ✭ 19 (-17.39%)
Mutual labels:  spark-sql
datalake-etl-pipeline
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (+69.57%)
Mutual labels:  spark-sql
Tweet-Analysis-With-Kafka-and-Spark
A real time analytics dashboard to analyze the trending hashtags and @ mentions at any location using kafka and spark streaming.
Stars: ✭ 18 (-21.74%)
Mutual labels:  spark-sql

Transformations using Apache Spark 2.0.0

A project with examples of using few commonly used data manipulation/processing/transformation APIs in Apache Spark 2.0.0

Tech Stack used:

Framework: Spark v2.0.0

Programming Language: Scala v2.11.6

About the project

The project can be loaded in IntelliJ IDEA and the class org.anish.spark.etc.ProcessData can be directly run. This produces all the output.

Code File descriptions

org.anish.spark.etc.ProcessData.scala : Main object along with all transformations and aggregations to process data. Running this object (tested in local system) should produce all the required results. The input data has the following fields:

member_id, name, email, joined, ip_address, posts, bday_day, bday_month, bday_year, members_profile_views, referred_by

A given output is saved in SampleOutput.txt The output of the occurrence of IP address based on the first 3 octets group has been truncated at 500, to make it more presentable. The complete data frame is however saved in the hive tables.

Build with maven:

mvn clean install package

To run the main scala object: Data (for testing) should be in data/allData/

java -jar target/spark2-etl-examples-1.0-SNAPSHOT-jar-with-dependencies.jar 

org.anish.spark.etl.hive.Constants.scala : Configurations stored as Strings in a class. Can be made configurable later.

org.anish.spark.etl.hive.HiveSetup.scala : Creates Hive tables and loads the initial data.

org.anish.spark.etl.hive.LoadToHive.scala : Do incremental loads to Hive. Also has a function to do update else insert option on the whole data set in a Hive table.

org.anish.spark.etl.hive.DemoRunner.scala : Run a demo of loading an initial data to Hive and then 1 increment to Hive. All sources are taken from appropriate folders in the data/* directory. This reqires to be run from an edge node with Hive and Spark clients running and connected to a Hive Meta Store and Spark server.

org.anish.spark.etl.ProcessDataTest.scala : Test class testing all utility methods defined in the ProcessData and LoadToHive Objects

Avro Outputs:

For analysis which gave a single or a list of numbers as output like most birth days day, least birthdays month, years with most signups, the output from the provided sample is in SampleOutput.txt along with data frames truncated at 500 records.

All queries which produced a dataset as output are saved as avro files in the folder spark-warehouse/. This can be recreated by executing java -jar target/spark2-etl-examples-1.0-SNAPSHOT-jar-with-dependencies.jar

Running the project

  1. Run mvn clean install to build the project
  2. Scala tests
  3. Build is successful
  4. Run java -jar target/spark2-etl-examples-1.0-SNAPSHOT-jar-with-dependencies.jar to produce analysis results. This also shows the following outputs:
    • Most birthdays are on: 1 day(s)
    • Least birthdays are on: 11 month(s)
  5. Continuation of output:
    • Email providers with more than 10K
    • Posts by email providers
    • Year(s) with max sign ups: 2015.
    • Class C IP address frequency by 1st octet
  6. Continuation of output:
    • Frequency of IP address based on first 3 octets (truncated)
  7. Continuation of output:
    • Number of referral by members

Hive related Demo

For loading incremental data to hive tables: This creates a table in hive with already existing data. Loads the data already present.

Increment Load: Loads an increment data, updating the fields which are already present based on member_id. Appends data which is not already present. (New members will be added. Data for old members will be updated.) For the sample data I have not partitioned and bucketed the data since, frequency of incomming increments, size and query pattern of data is not known.

This assumes that Hive metastore is up and running. Also HiveServer2 should be running and hive client jars present. This should ideally be run from an 'edge node' of a cluster. I've tested it in Spark Local, and not on cluster mode.

java -cp target/spark2-etl-examples-1.0-SNAPSHOT-jar-with-dependencies.jar org.anish.spark.etl.hive.DemoRunner

Submitting to Spark Standalone

spark-submit --class org.anish.spark.etl.ProcessData --master local[4] \
--jars $(find '<***lib directory with spark jars***>' -name '*.jar' | xargs echo | tr ' ' ',') \
--packages com.databricks:spark-avro_2.11:3.1.0 \
spark2-etl-examples-1.0-SNAPSHOT.jar 

Currently the source is coded to take from local as data/all_data/ To read from HDFS, the path should be appropriately given. Eg - hdfs://data/all_data/ It would automatically take HDFS path if HDFS is running on the same node.

Submitting from "edge nodes" (Yarn Client Mode)

spark-submit --class org.anish.spark.etl.ProcessData --master yarn-client \
--jars $(find '<***lib directory with spark jars***>' -name '*.jar' | xargs echo | tr ' ' ',') \
--packages com.databricks:spark-avro_2.11:3.1.0 \
spark2-etl-examples-1.0-SNAPSHOT.jar

Use for educational purposes

If you are trying to run these examples to understand Spark, and you need data, kindly have a look at the 'data' branch


Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].