All Projects → hortonworks-spark → cloud-integration

hortonworks-spark / cloud-integration

Licence: Apache-2.0 license
Spark cloud integration: tests, cloud committers and more

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to cloud-integration

flyio
Input Output Files in R from Cloud or Local
Stars: ✭ 46 (+130%)
Mutual labels:  aws-s3, gcs
Goofys
a high-performance, POSIX-ish Amazon S3 file system written in Go
Stars: ✭ 3,932 (+19560%)
Mutual labels:  aws-s3, gcs
Spark Jupyter Aws
A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support
Stars: ✭ 259 (+1195%)
Mutual labels:  apache-spark, aws-s3
Sparkora
Powerful rapid automatic EDA and feature engineering library with a very easy to use API 🌟
Stars: ✭ 51 (+155%)
Mutual labels:  apache-spark
net.jgp.books.spark.ch07
Spark in Action, 2nd edition - chapter 7 - Ingestion from files
Stars: ✭ 13 (-35%)
Mutual labels:  apache-spark
documentai-notebooks
A centralized repository for AI Platform notebooks using the Google Cloud Document AI API.
Stars: ✭ 61 (+205%)
Mutual labels:  gcs
reaction-file-collections-sa-s3
An S3 storage adapter for Reaction Commerce's reaction-file-collections
Stars: ✭ 14 (-30%)
Mutual labels:  aws-s3
node-storage
📬 A unified file storage library for storage in cloud or on premise
Stars: ✭ 29 (+45%)
Mutual labels:  aws-s3
google-cloud
A collection of Google Cloud Platform (GCP) plugins
Stars: ✭ 34 (+70%)
Mutual labels:  gcs
s3tree
🌲 Access S3 like a tree.
Stars: ✭ 26 (+30%)
Mutual labels:  aws-s3
BigCLAM-ApacheSpark
Overlapping community detection in Large-Scale Networks using BigCLAM model build on Apache Spark
Stars: ✭ 40 (+100%)
Mutual labels:  apache-spark
fluent-plugin-gcs
Google Cloud Storage output plugin for Fluentd.
Stars: ✭ 39 (+95%)
Mutual labels:  gcs
BlobHelper
BlobHelper is a common, consistent storage interface for Microsoft Azure, Amazon S3, Komodo, Kvpbase, and local filesystem written in C#.
Stars: ✭ 23 (+15%)
Mutual labels:  aws-s3
black-postoffice
[무신사 신입] 익명으로 편하게 고민, 일상을 공유하는 소셜 네트워크 서비스입니다.
Stars: ✭ 31 (+55%)
Mutual labels:  aws-s3
sparkucx
A high-performance, scalable and efficient ShuffleManager plugin for Apache Spark, utilizing UCX communication layer
Stars: ✭ 32 (+60%)
Mutual labels:  apache-spark
django-s3file
A lightweight file upload input for Django and Amazon S3
Stars: ✭ 66 (+230%)
Mutual labels:  aws-s3
simple-flask-s3-uploader
Simple and easy to use Flask app to upload files to Amazon S3. Based on Python, Flask, and using Boto3. Securely storing your AWS credentials as environment variables. Quick AWS S3 Flask uploader example.
Stars: ✭ 24 (+20%)
Mutual labels:  aws-s3
tug
Private Composer registry for private PHP packages on AWS Serverless
Stars: ✭ 33 (+65%)
Mutual labels:  aws-s3
moments v2 backend
backend for a sharing app using SpringBoot, Redis, MySQL, and AWS S3.
Stars: ✭ 54 (+170%)
Mutual labels:  aws-s3
spark-records
Bulletproof Apache Spark jobs with fast root cause analysis of failures.
Stars: ✭ 67 (+235%)
Mutual labels:  apache-spark

Cloud Integration for Apache Spark

The cloud-integration repository provides modules to improve Apache Spark's integration with cloud infrastructures.

Module spark-cloud-integration

Classes and Tools to make Spark work better in-cloud

  • Committer integration with the s3a committers.
  • Proof of concept cloud-first distcp replacement.
  • Serialization for Hadoop Configuration: class ConfigSerDeser. Use this to get a configuration into an RDD method
  • Trait HConf to manipulate the hadoop options in a spark config.
  • Anything else which turns out to be useful.
  • Variant of FileInputStream for cloud storage, org.apache.spark.streaming.cloudera.CloudInputDStream

See Spark Cloud Integration

Module cloud-examples

This does the packaging/integration tests for Spark and cloud against AWS, Azure and openstack.

These are basic tests of the core functionality of I/O, streaming, and verify that the commmitters work.

As well as running as unit tests, they have CLI entry points which can be used for scalable functional testing.

Module minimal-integration-test

This is a minimal JAR for integration tests

Usage

spark-submit --class com.cloudera.spark.cloud.integration.Generator \
--master yarn \
--num-executors 2 \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 1 \
minimal-integration-test-1.0-SNAPSHOT.jar \
adl://example.azuredatalakestore.net/output/dest/1 \
2 2 15
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].