All Projects → Azure → Azure Cosmosdb Spark

Azure / Azure Cosmosdb Spark

Licence: mit
Apache Spark Connector for Azure Cosmos DB

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to Azure Cosmosdb Spark

Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (-9.09%)
Mutual labels:  jupyter-notebook, spark, apache-spark, pyspark
Spark Tdd Example
A simple Spark TDD example
Stars: ✭ 23 (-86.06%)
Mutual labels:  jupyter-notebook, spark, pyspark
Azure Event Hubs Spark
Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs
Stars: ✭ 140 (-15.15%)
Mutual labels:  spark, apache-spark, connector
Pyspark Learning
Updated repository
Stars: ✭ 147 (-10.91%)
Mutual labels:  jupyter-notebook, spark, pyspark
Spark Practice
Apache Spark (PySpark) Practice on Real Data
Stars: ✭ 200 (+21.21%)
Mutual labels:  jupyter-notebook, spark, pyspark
Spark Jupyter Aws
A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support
Stars: ✭ 259 (+56.97%)
Mutual labels:  jupyter-notebook, spark, apache-spark
aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (-32.73%)
Mutual labels:  spark, apache-spark, pyspark
Handyspark
HandySpark - bringing pandas-like capabilities to Spark dataframes
Stars: ✭ 158 (-4.24%)
Mutual labels:  jupyter-notebook, spark, pyspark
Optimus
🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Stars: ✭ 986 (+497.58%)
Mutual labels:  jupyter-notebook, spark, pyspark
Sparkmagic
Jupyter magics and kernels for working with remote Spark clusters
Stars: ✭ 954 (+478.18%)
Mutual labels:  jupyter-notebook, spark, pyspark
Mmlspark
Simple and Distributed Machine Learning
Stars: ✭ 2,899 (+1656.97%)
Mutual labels:  spark, pyspark, apache-spark
W2v
Word2Vec models with Twitter data using Spark. Blog:
Stars: ✭ 64 (-61.21%)
Mutual labels:  jupyter-notebook, spark, pyspark
Agile data code 2
Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
Stars: ✭ 413 (+150.3%)
Mutual labels:  jupyter-notebook, spark, apache-spark
Live log analyzer spark
Spark Application for analysis of Apache Access logs and detect anamolies! Along with Medium Article.
Stars: ✭ 14 (-91.52%)
Mutual labels:  spark, apache-spark, pyspark
Pysparkgeoanalysis
🌐 Interactive Workshop on GeoAnalysis using PySpark
Stars: ✭ 63 (-61.82%)
Mutual labels:  jupyter-notebook, spark, pyspark
Spark Py Notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 1,338 (+710.91%)
Mutual labels:  jupyter-notebook, spark, pyspark
Spark
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
Stars: ✭ 1,721 (+943.03%)
Mutual labels:  spark, apache-spark
Griffon Vm
Griffon Data Science Virtual Machine
Stars: ✭ 128 (-22.42%)
Mutual labels:  jupyter-notebook, apache-spark
Repo 2019
BERT, AWS RDS, AWS Forecast, EMR Spark Cluster, Hive, Serverless, Google Assistant + Raspberry Pi, Infrared, Google Cloud Platform Natural Language, Anomaly detection, Tensorflow, Mathematics
Stars: ✭ 133 (-19.39%)
Mutual labels:  jupyter-notebook, pyspark
Eat pyspark in 10 days
pyspark🍒🥭 is delicious,just eat it!😋😋
Stars: ✭ 116 (-29.7%)
Mutual labels:  spark, pyspark

  Azure Cosmos DB Connector for Apache Spark

Build Status

azure-cosmosdb-spark is the official connector for Azure CosmosDB and Apache Spark. The connector allows you to easily read to and write from Azure Cosmos DB via Apache Spark DataFrames in python and scala. It also allows you to easily create a lambda architecture for batch-processing, stream-processing, and a serving layer while being globally replicated and minimizing the latency involved in working with big data.

Table of Contents

Jump Start

Reading from Cosmos DB

Below are excerpts in Python and Scala on how to create a Spark DataFrame to read from Cosmos DB

# Read Configuration
readConfig = {
  "Endpoint" : "https://doctorwho.documents.azure.com:443/",
  "Masterkey" : "SPSVkSfA7f6vMgMvnYdzc1MaWb65v4VQNcI2Tp1WfSP2vtgmAwGXEPcxoYra5QBHHyjDGYuHKSkguHIz1vvmWQ==",
  "Database" : "DepartureDelays",
  "preferredRegions" : "Central US;East US2",
  "Collection" : "flights_pcoll",
  "SamplingRatio" : "1.0",
  "schema_samplesize" : "1000",
  "query_pagesize" : "2147483647",
  "query_custom" : "SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA'"
}

# Connect via azure-cosmosdb-spark to create Spark DataFrame
flights = spark.read.format("com.microsoft.azure.cosmosdb.spark").options(**readConfig).load()
flights.count()
Click for Scala Excerpt

// Import Necessary Libraries
import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark._
import com.microsoft.azure.cosmosdb.spark.config.Config

// Configure connection to your collection
val readConfig = Config(Map(
  "Endpoint" -> "https://doctorwho.documents.azure.com:443/",
  "Masterkey" -> "SPSVkSfA7f6vMgMvnYdzc1MaWb65v4VQNcI2Tp1WfSP2vtgmAwGXEPcxoYra5QBHHyjDGYuHKSkguHIz1vvmWQ==",
  "Database" -> "DepartureDelays",
  "PreferredRegions" -> "Central US;East US2;",
  "Collection" -> "flights_pcoll",
  "SamplingRatio" -> "1.0",
  "query_custom" -> "SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA'"
))

// Connect via azure-cosmosdb-spark to create Spark DataFrame
val flights = spark.read.cosmosDB(readConfig)
flights.count()

Writing to Cosmos DB

Below are excerpts in Python and Scala on how to write a Spark DataFrame to Cosmos DB

# Write configuration
writeConfig = {
 "Endpoint" : "https://doctorwho.documents.azure.com:443/",
 "Masterkey" : "SPSVkSfA7f6vMgMvnYdzc1MaWb65v4VQNcI2Tp1WfSP2vtgmAwGXEPcxoYra5QBHHyjDGYuHKSkguHIz1vvmWQ==",
 "Database" : "DepartureDelays",
 "Collection" : "flights_fromsea",
 "Upsert" : "true"
}

# Write to Cosmos DB from the flights DataFrame
flights.write.format("com.microsoft.azure.cosmosdb.spark").options(**writeConfig).save()
Click for Scala Excerpt

// Configure connection to the sink collection
val writeConfig = Config(Map(
  "Endpoint" -> "https://doctorwho.documents.azure.com:443/",
  "Masterkey" -> "SPSVkSfA7f6vMgMvnYdzc1MaWb65v4VQNcI2Tp1WfSP2vtgmAwGXEPcxoYra5QBHHyjDGYuHKSkguHIz1vvmWQ==",
  "Database" -> "DepartureDelays",
  "PreferredRegions" -> "Central US;East US2;",
  "Collection" -> "flights_fromsea",
  "WritingBatchSize" -> "100"
))

// Upsert the dataframe to Cosmos DB
import org.apache.spark.sql.SaveMode
flights.write.mode(SaveMode.Overwrite).cosmosDB(writeConfig)

 

Requirements

Review supported component versions

Component Versions Supported
Apache Spark 2.2.1, 2.3.X, 2.4.X
Scala 2.11
Python 2.7, 3.6

 

Working with the connector

You can build and/or use the maven coordinates to work with azure-cosmosdb-spark.

Review the connector's maven versions

Spark Scala Latest version
2.4.0 2.11 azure-cosmosdb-spark_lkg_version
2.3.0 2.11 azure-cosmosdb-spark_2.3.0_2.11_1.3.3
2.2.0 2.11 azure-cosmosdb-spark_2.2.0_2.11_1.1.1
2.1.0 2.11 azure-cosmosdb-spark_2.1.0_2.11_1.2.2

Using Databricks notebooks

Please create a library using within your Databricks workspace by following the guidance within the Azure Databricks Guide > Use the Azure Cosmos DB Spark connector

Note, the Databricks documentation at docs.azuredatabricks.net is not up to date. Instead of downloading the six separate jars into six different libraries, you can download the uber jar from maven at https://search.maven.org/artifact/com.microsoft.azure/azure-cosmosdb-spark_2.4.0_2.11/1.3.5/jar) and install this one jar/library.

Using spark-cli

To work with the connector using the spark-cli (i.e. spark-shell, pyspark, spark-submit), you can use the --packages parameter with the connector's maven coordinates.

spark-shell --master yarn --packages "com.microsoft.azure:azure-cosmosdb-spark_2.4.0_2.11:1.3.5"

Using Jupyter notebooks

If you're using Jupyter notebooks within HDInsight, you can use spark-magic %%configure cell to specify the connector's maven coordinates.

{ "name":"Spark-to-Cosmos_DB_Connector",
  "conf": {
    "spark.jars.packages": "com.microsoft.azure:azure-cosmosdb-spark_2.4.0_2.11:1.3.5",
    "spark.jars.excludes": "org.scala-lang:scala-reflect"
   }
   ...
}

Note, the inclusion of the spark.jars.excludes is specific to remove potential conflicts between the connector, Apache Spark, and Livy.

Build the connector

Currently, this connector project uses maven so to build without dependencies, you can run:

mvn clean package

 

Working with our samples

Included in this GitHub repository are a number of sample notebooks and scripts that you can utilize:

 

More Information

We have more information in the azure-cosmosdb-spark wiki including:

Configuration and Setup

Troubleshooting

Performance

Change Feed

Monitoring

 

Contributing & Feedback

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

See CONTRIBUTING.md for contribution guidelines.

To give feedback and/or report an issue, open a GitHub Issue.

Apache®, Apache Spark, and Spark® are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].