All Projects → zero-one-group → Geni

zero-one-group / Geni

Licence: apache-2.0
A Clojure dataframe library that runs on Spark

Programming Languages

clojure
4091 projects

Projects that are alternatives of or similar to Geni

Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (-1.32%)
Mutual labels:  parallel-computing, dataframe, spark, big-data, distributed-computing
Koalas
Koalas: pandas API on Apache Spark
Stars: ✭ 3,044 (+1902.63%)
Mutual labels:  dataframe, data-science, spark, big-data
Setl
A simple Spark-powered ETL framework that just works 🍺
Stars: ✭ 79 (-48.03%)
Mutual labels:  data-science, spark, big-data, data-engineering
Accelerator
The Accelerator is a tool for fast and reproducible processing of large amounts of data.
Stars: ✭ 137 (-9.87%)
Mutual labels:  data-science, big-data, data-engineering, high-performance-computing
aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (-26.97%)
Mutual labels:  big-data, spark, dataframe
pyspark-algorithms
PySpark Algorithms Book: https://www.amazon.com/dp/B07X4B2218/ref=sr_1_2
Stars: ✭ 72 (-52.63%)
Mutual labels:  big-data, distributed-computing, dataframe
Metorikku
A simplified, lightweight ETL Framework based on Apache Spark
Stars: ✭ 361 (+137.5%)
Mutual labels:  spark, big-data, distributed-computing
Feast
Feature Store for Machine Learning
Stars: ✭ 2,576 (+1594.74%)
Mutual labels:  spark, big-data, data-engineering
ParallelUtilities.jl
Fast and easy parallel mapreduce on HPC clusters
Stars: ✭ 28 (-81.58%)
Mutual labels:  parallel-computing, distributed-computing, high-performance-computing
Data Science Ipython Notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Stars: ✭ 22,048 (+14405.26%)
Mutual labels:  data-science, spark, big-data
Spark Alchemy
Collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive
Stars: ✭ 122 (-19.74%)
Mutual labels:  data-science, spark, data-engineering
Rsparkling
RSparkling: Use H2O Sparkling Water from R (Spark + R + Machine Learning)
Stars: ✭ 65 (-57.24%)
Mutual labels:  data-science, spark, big-data
Spark Py Notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 1,338 (+780.26%)
Mutual labels:  data-science, spark, big-data
H2o 3
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Stars: ✭ 5,656 (+3621.05%)
Mutual labels:  data-science, spark, big-data
Pyspark Example Project
Example project implementing best practices for PySpark ETL jobs and applications.
Stars: ✭ 633 (+316.45%)
Mutual labels:  data-science, spark, data-engineering
Just Dashboard
📊 📋 Dashboards using YAML or JSON files
Stars: ✭ 1,511 (+894.08%)
Mutual labels:  data-science, big-data, data-engineering
Butterfree
A tool for building feature stores.
Stars: ✭ 126 (-17.11%)
Mutual labels:  data-science, data-engineering
Cape Python
Collaborate on privacy-preserving policy for data science projects in Pandas and Apache Spark
Stars: ✭ 125 (-17.76%)
Mutual labels:  data-science, spark
Spark.jl
Julia binding for Apache Spark
Stars: ✭ 153 (+0.66%)
Mutual labels:  spark, big-data
Datasciencevm
Tools and Docs on the Azure Data Science Virtual Machine (http://aka.ms/dsvm)
Stars: ✭ 153 (+0.66%)
Mutual labels:  data-science, big-data

Geni (/gɜni/ or "gurney" without the r) is a Clojure dataframe library that runs on Apache Spark. The name means "fire" in Javanese.

CI Code Coverage Clojars Project License

Overview

Geni provides an idiomatic Spark interface for Clojure without the hassle of Java or Scala interop. Geni uses Clojure's -> threading macro as the main way to compose Spark's Dataset and Column operations in place of the usual method chaining in Scala. It also provides a greater degree of dynamism by allowing args of mixed types such as columns, strings and keywords in a single function invocation. See the docs section on Geni semantics for more details.

Resources

Docs Cookbook
  1. Getting Started with Clojure, Geni and Spark
  2. Reading and Writing Datasets
  3. Selecting Rows and Columns
  4. Grouping and Aggregating
  5. Combining Datasets with Joins and Unions
  6. String Operations
  7. Cleaning up Messy Data
  8. Timestamps and Dates
  9. Window Functions
  10. Reading from and Writing to SQL Databases
  11. Avoiding Repeated Computations with Caching
  12. Basic ML Pipelines
  13. Customer Segmentation with NMF

cljdoc slack zulip

Basic Examples

All examples below use the Statlib California housing prices data available for free on Kaggle.

Spark SQL API for data wrangling:

(require '[zero-one.geni.core :as g])

(def dataframe (g/read-parquet! "test/resources/housing.parquet"))

(g/count dataframe)
=> 5000

(g/print-schema dataframe)
; root
;  |-- longitude: double (nullable = true)
;  |-- latitude: double (nullable = true)
;  |-- housing_median_age: double (nullable = true)
;  |-- total_rooms: double (nullable = true)
;  |-- total_bedrooms: double (nullable = true)
;  |-- population: double (nullable = true)
;  |-- households: double (nullable = true)
;  |-- median_income: double (nullable = true)
;  |-- median_house_value: double (nullable = true)
;  |-- ocean_proximity: string (nullable = true)

(-> dataframe (g/limit 5) g/show)
; +---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
; |longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
; +---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
; |-122.23  |37.88   |41.0              |880.0      |129.0         |322.0     |126.0     |8.3252       |452600.0          |NEAR BAY       |
; |-122.22  |37.86   |21.0              |7099.0     |1106.0        |2401.0    |1138.0    |8.3014       |358500.0          |NEAR BAY       |
; |-122.24  |37.85   |52.0              |1467.0     |190.0         |496.0     |177.0     |7.2574       |352100.0          |NEAR BAY       |
; |-122.25  |37.85   |52.0              |1274.0     |235.0         |558.0     |219.0     |5.6431       |341300.0          |NEAR BAY       |
; |-122.25  |37.85   |52.0              |1627.0     |280.0         |565.0     |259.0     |3.8462       |342200.0          |NEAR BAY       |
; +---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+

(-> dataframe (g/describe :housing_median_age :total_rooms :population) g/show)
; +-------+------------------+------------------+-----------------+
; |summary|housing_median_age|total_rooms       |population       |
; +-------+------------------+------------------+-----------------+
; |count  |5000              |5000              |5000             |
; |mean   |30.9842           |2393.2132         |1334.9684        |
; |stddev |12.969656616832669|1812.4457510408017|954.0206427949117|
; |min    |1.0               |1000.0            |100.0            |
; |max    |9.0               |999.0             |999.0            |
; +-------+------------------+------------------+-----------------+

(-> dataframe
    (g/group-by :ocean_proximity)
    (g/agg {:count        (g/count "*")
            :mean-rooms   (g/mean :total_rooms)
            :distinct-lat (g/count-distinct (g/int :latitude))})
    (g/order-by (g/desc :count))
    g/show)
; +---------------+-----+------------------+------------+
; |ocean_proximity|count|mean-rooms        |distinct-lat|
; +---------------+-----+------------------+------------+
; |INLAND         |1823 |2358.181020296215 |10          |
; |<1H OCEAN      |1783 |2467.5361749859785|7           |
; |NEAR BAY       |1287 |2368.72027972028  |2           |
; |NEAR OCEAN     |107  |2046.1869158878505|2           |
; +---------------+-----+------------------+------------+

(-> dataframe
    (g/select {:ocean :ocean_proximity
               :house (g/struct {:rooms (g/struct :total_rooms :total_bedrooms)
                                 :age   :housing_median_age})
               :coord (g/struct {:lat :latitude :long :longitude})})
    (g/limit 3)
    g/collect)
=> ({:ocean "NEAR BAY",
     :house {:rooms {:total_rooms 880.0, :total_bedrooms 129.0}, 
             :age 41.0},
     :coord {:lat 37.88, :long -122.23}}
    {:ocean "NEAR BAY",
     :house {:rooms {:total_rooms 7099.0, :total_bedrooms 1106.0}, 
             :age 21.0},
     :coord {:lat 37.86, :long -122.22}}
    {:ocean "NEAR BAY",
     :house {:rooms {:total_rooms 1467.0, :total_bedrooms 190.0}, 
             :age 52.0},
     :coord {:lat 37.85, :long -122.24}})

Spark ML example translated from Spark's programming guide:

(require '[zero-one.geni.core :as g])
(require '[zero-one.geni.ml :as ml])

(def training-set
  (g/table->dataset
    [[0 "a b c d e spark"  1.0]
     [1 "b d"              0.0]
     [2 "spark f g h"      1.0]
     [3 "hadoop mapreduce" 0.0]]
    [:id :text :label]))

(def pipeline
  (ml/pipeline
    (ml/tokenizer {:input-col :text
                   :output-col :words})
    (ml/hashing-tf {:num-features 1000
                    :input-col :words
                    :output-col :features})
    (ml/logistic-regression {:max-iter 10
                             :reg-param 0.001})))

(def model (ml/fit training-set pipeline))

(def test-set
  (g/table->dataset
    [[4 "spark i j k"]
     [5 "l m n"]
     [6 "spark hadoop spark"]
     [7 "apache hadoop"]]
    [:id :text]))

(-> test-set
    (ml/transform model)
    (g/select :id :text :probability :prediction)
    g/show)
;; +---+------------------+----------------------------------------+----------+
;; |id |text              |probability                             |prediction|
;; +---+------------------+----------------------------------------+----------+
;; |4  |spark i j k       |[0.1596407738787411,0.8403592261212589] |1.0       |
;; |5  |l m n             |[0.8378325685476612,0.16216743145233883]|0.0       |
;; |6  |spark hadoop spark|[0.0692663313297627,0.9307336686702373] |1.0       |
;; |7  |apache hadoop     |[0.9821575333444208,0.01784246665557917]|0.0       |
;; +---+------------------+----------------------------------------+----------+

More detailed examples can be found here.

Quick Start

Install Geni

Install the geni script to /usr/local/bin with:

wget https://raw.githubusercontent.com/zero-one-group/geni/develop/scripts/geni
chmod a+x geni
sudo mv geni /usr/local/bin/

The command geni downloads the latest Geni uberjar and places it in ~/.geni/geni-repl-uberjar.jar, and runs it with java -jar.

Uberjar

Download the latest Geni REPL uberjar from the release page. Run the uberjar as follows:

java -jar <uberjar-name>

The uberjar app prints the default SparkSession instance, starts an nREPL server with an .nrepl-port file for easy text-editor connection and steps into a Clojure REPL(-y).

Leiningen Template

Use Leiningen to create a template of a Geni project:

lein new geni <project-name>

cd into the project directory and do lein run. The templated app runs a Spark ML example, and then steps into a Clojure REPL-y with an .nrepl-port file.

Screencast Demos

Install Uberjar Leiningen

Installation

Add the following to your project.clj dependency:

Clojars Project

You would also need to add Spark as provided dependencies. For instance, have the following key-value pair for the :profiles map:

:provided
{:dependencies [;; Spark
                [org.apache.spark/spark-avro_2.12 "3.1.1"]
                [org.apache.spark/spark-core_2.12 "3.1.1"]
                [org.apache.spark/spark-hive_2.12 "3.1.1"]
                [org.apache.spark/spark-mllib_2.12 "3.1.1"]
                [org.apache.spark/spark-sql_2.12 "3.1.1"]
                [org.apache.spark/spark-streaming_2.12 "3.1.1"]
                [com.github.fommil.netlib/all "1.1.2" :extension "pom"]
                ; Arrow
                [org.apache.arrow/arrow-memory-netty "2.0.0"]
                [org.apache.arrow/arrow-memory-core "2.0.0"]
                [org.apache.arrow/arrow-vector "2.0.0"
                :exclusions [commons-codec com.fasterxml.jackson.core/jackson-databind]]
                ;; Databases
                [mysql/mysql-connector-java "8.0.23"]
                [org.postgresql/postgresql "42.2.19"]
                [org.xerial/sqlite-jdbc "3.34.0"]
                ;; Optional: Spark XGBoost
                [ml.dmlc/xgboost4j-spark_2.12 "1.2.0"]
                [ml.dmlc/xgboost4j_2.12 "1.2.0"]]}

You may also need to install libatlas3-base and libopenblas-base to use a native BLAS, and install libgomp1 to train XGBoost4J models. When the optional dependencies are not present, the vars to the corresponding functions (such as ml/xgboost-classifier) will be left unbound.

License

Copyright 2020 Zero One Group.

Geni is licensed under Apache License v2.0, see LICENSE.

Mentions

Some parts of the project have been taken from or inspired by:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].