All Projects → criteo → Cluster Pack

criteo / Cluster Pack

Licence: apache-2.0
A library on top of either pex or conda-pack to make your Python code easily available on a cluster

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Cluster Pack

Rumble
⛈️ Rumble 1.11.0 "Banyan Tree"🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more
Stars: ✭ 58 (+152.17%)
Mutual labels:  s3, hdfs
Seaweedfs
SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files! Blob store has O(1) disk seek, cloud tiering. Filer supports Cloud Drive, cross-DC active-active replication, Kubernetes, POSIX FUSE mount, S3 API, S3 Gateway, Hadoop, WebDAV, encryption, Erasure Coding.
Stars: ✭ 13,380 (+58073.91%)
Mutual labels:  s3, hdfs
Tiledb
The Universal Storage Engine
Stars: ✭ 1,072 (+4560.87%)
Mutual labels:  s3, hdfs
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (+552.17%)
Mutual labels:  pyspark, hdfs
jobAnalytics and search
JobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.
Stars: ✭ 25 (+8.7%)
Mutual labels:  s3, pyspark
Smart open
Utils for streaming large files (S3, HDFS, gzip, bz2...)
Stars: ✭ 2,306 (+9926.09%)
Mutual labels:  s3, hdfs
Tiledb Py
Python interface to the TileDB storage manager
Stars: ✭ 78 (+239.13%)
Mutual labels:  s3, hdfs
Kafka Connect Ui
Web tool for Kafka Connect |
Stars: ✭ 388 (+1586.96%)
Mutual labels:  s3, hdfs
kafka-connect-fs
Kafka Connect FileSystem Connector
Stars: ✭ 107 (+365.22%)
Mutual labels:  s3, hdfs
Storagetapper
StorageTapper is a scalable realtime MySQL change data streaming, logical backup and logical replication service
Stars: ✭ 232 (+908.7%)
Mutual labels:  s3, hdfs
Juicefs
JuiceFS is a distributed POSIX file system built on top of Redis and S3.
Stars: ✭ 4,262 (+18430.43%)
Mutual labels:  s3, hdfs
Devops Python Tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
Stars: ✭ 406 (+1665.22%)
Mutual labels:  pyspark, hdfs
Pyspark Example Project
Example project implementing best practices for PySpark ETL jobs and applications.
Stars: ✭ 633 (+2652.17%)
Mutual labels:  pyspark
Hasura Backend Plus
🔑Auth and 📦Storage for Hasura. The quickest way to get Auth and Storage working for your next app based on Hasura.
Stars: ✭ 776 (+3273.91%)
Mutual labels:  s3
Stock Analysis Engine
Backtest 1000s of minute-by-minute trading algorithms for training AI with automated pricing data from: IEX, Tradier and FinViz. Datasets and trading performance automatically published to S3 for building AI training datasets for teaching DNNs how to trade. Runs on Kubernetes and docker-compose. >150 million trading history rows generated from +5000 algorithms. Heads up: Yahoo's Finance API was disabled on 2019-01-03 https://developer.yahoo.com/yql/
Stars: ✭ 605 (+2530.43%)
Mutual labels:  s3
Kodexplorer
A web based file manager,web IDE / browser based code editor
Stars: ✭ 5,490 (+23769.57%)
Mutual labels:  s3
Hadoop For Geoevent
ArcGIS GeoEvent Server sample Hadoop connector for storing GeoEvents in HDFS.
Stars: ✭ 5 (-78.26%)
Mutual labels:  hdfs
Pgbackrest
Reliable PostgreSQL Backup & Restore
Stars: ✭ 766 (+3230.43%)
Mutual labels:  s3
S3fs Fuse
FUSE-based file system backed by Amazon S3
Stars: ✭ 5,733 (+24826.09%)
Mutual labels:  s3
Django S3direct
Directly upload files to S3 compatible services with Django.
Stars: ✭ 570 (+2378.26%)
Mutual labels:  s3

cluster-pack

cluster-pack is a library on top of either pex or conda-pack to make your Python code easily available on a cluster.

Its goal is to make your prod/dev Python code & libraries easiliy available on any cluster. cluster-pack supports HDFS/S3 as a distributed storage.

The first examples use Skein (a simple library for deploying applications on Apache YARN) and PySpark with HDFS storage. We intend to add more examples for other applications (like Dask, Ray) and S3 storage.

An introducing blog post can be found here.

cluster-pack

Installation

Install with Pip

$ pip install cluster-pack

Install from source

$ git clone https://github.com/criteo/cluster-pack
$ cd cluster-pack
$ pip install .

Prerequisites

cluster-pack supports Python ≥3.6.

Features

  • Ships a package with all the dependencies from your current virtual environment or your conda environment

  • Stores metadata for an environment

  • Supports "under development" mode by taking advantage of pip's editable installs mode, all editable requirements will be uploaded all the time, making local changes directly visible on the cluster

  • Interactive (Jupyter notebook) mode

  • Provides config helpers to directly use the uploaded zip file inside your application

  • Launching jobs from jobs by propagating all artifacts

Basic examples with skein

  1. Interactive mode

  2. Self shipping project

Basic examples with PySpark

  1. PySpark with HDFS on Yarn

  2. Docker with PySpark on S3

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].