All Projects → Kotlin → dataframe

Kotlin / dataframe

Licence: Apache-2.0 license
Structured data processing in Kotlin

Programming Languages

kotlin
9241 projects

Projects that are alternatives of or similar to dataframe

Ballista
Distributed compute platform implemented in Rust, and powered by Apache Arrow.
Stars: ✭ 2,274 (+612.85%)
Mutual labels:  dataframe
Koalas
Koalas: pandas API on Apache Spark
Stars: ✭ 3,044 (+854.23%)
Mutual labels:  dataframe
h3ron
Rust crates for the H3 geospatial indexing system
Stars: ✭ 52 (-83.7%)
Mutual labels:  dataframe
Morpheus Core
The foundational library of the Morpheus data science framework
Stars: ✭ 203 (-36.36%)
Mutual labels:  dataframe
Tablesaw
Java dataframe and visualization library
Stars: ✭ 2,785 (+773.04%)
Mutual labels:  dataframe
tablexplore
Table analysis and plotting application written in PySide2/PyQt5
Stars: ✭ 89 (-72.1%)
Mutual labels:  dataframe
Inspectdf
🛠️ 📊 Tools for Exploring and Comparing Data Frames
Stars: ✭ 195 (-38.87%)
Mutual labels:  dataframe
scipp
Multi-dimensional data arrays with labeled dimensions
Stars: ✭ 55 (-82.76%)
Mutual labels:  dataframe
Eland
Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
Stars: ✭ 235 (-26.33%)
Mutual labels:  dataframe
pyspark-algorithms
PySpark Algorithms Book: https://www.amazon.com/dp/B07X4B2218/ref=sr_1_2
Stars: ✭ 72 (-77.43%)
Mutual labels:  dataframe
Datatable
A go in-memory table
Stars: ✭ 215 (-32.6%)
Mutual labels:  dataframe
Technical
Different indicators developed or collected for the Freqtrade
Stars: ✭ 222 (-30.41%)
Mutual labels:  dataframe
isarn-sketches-spark
Routines and data structures for using isarn-sketches idiomatically in Apache Spark
Stars: ✭ 28 (-91.22%)
Mutual labels:  dataframe
Tech.ml.dataset
A Clojure high performance data processing system
Stars: ✭ 205 (-35.74%)
Mutual labels:  dataframe
ladybug-pandas
🐞 <3 🐼 A ladybug extension powered by pandas
Stars: ✭ 15 (-95.3%)
Mutual labels:  dataframe
Peroxide
Rust numeric library with R, MATLAB & Python syntax
Stars: ✭ 191 (-40.13%)
Mutual labels:  dataframe
Styleframe
A library that wraps pandas and openpyxl and allows easy styling of dataframes in excel
Stars: ✭ 252 (-21%)
Mutual labels:  dataframe
bow
Go data analysis / manipulation library built on top of Apache Arrow
Stars: ✭ 20 (-93.73%)
Mutual labels:  dataframe
Julia-data-science
Data science and numerical computing with Julia
Stars: ✭ 54 (-83.07%)
Mutual labels:  dataframe
daany
Daany - .NET DAta ANalYtics .NET library with the implementation of DataFrame, Time series decompositions and Linear Algebra routines BLASS and LAPACK.
Stars: ✭ 49 (-84.64%)
Mutual labels:  dataframe

Kotlin Dataframe: typesafe in-memory structured data processing for JVM

JetBrains incubator project Kotlin component alpha stability Kotlin Maven Central GitHub License

Kotlin Dataframe aims to reconcile Kotlin static typing with dynamic nature of data by utilizing both the full power of Kotlin language and opportunities provided by intermittent code execution in Jupyter notebooks and REPL.

  • Hierarchical — represents hierarchical data structures, such as JSON or a tree of JVM objects.
  • Functional — data processing pipeline is organized in a chain of DataFrame transformation operations. Every operation returns a new instance of DataFrame reusing underlying storage wherever it's possible.
  • Readable — data transformation operations are defined in DSL close to natural language.
  • Practical — provides simple solutions for common problems and ability to perform complex tasks.
  • Minimalistic — simple, yet powerful data model of three column kinds.
  • Interoperable — convertable with Kotlin data classes and collections.
  • Generic — can store objects of any type, not only numbers or strings.
  • Typesafe — on-the-fly generation of extension properties for type safe data access with Kotlin-style care for null safety.
  • Polymorphic — type compatibility derives from column schema compatibility. You can define a function that requires a special subset of columns in dataframe but doesn't care about other columns.

Integrates with Kotlin kernel for Jupyter. Inspired by krangl, Kotlin Collections and pandas

Explore documentation for details.

Setup

Gradle

repositories {
    mavenCentral()
}
dependencies {
    implementation 'org.jetbrains.kotlinx:dataframe:0.8.1'
}

Jupyter Notebook

Install Kotlin kernel for Jupyter

Import stable dataframe version into notebook:

%use dataframe

or specific version:

%use dataframe(<version>)

Data model

  • DataFrame is a list of columns with equal sizes and distinct names.
  • DataColumn is a named list of values. Can be one of three kinds:
    • ValueColumn — contains data
    • ColumnGroup — contains columns
    • FrameColumn — contains dataframes

Usage example

Create:

// create columns
val fromTo by columnOf("LoNDon_paris", "MAdrid_miLAN", "londON_StockhOlm", "Budapest_PaRis", "Brussels_londOn")
val flightNumber by columnOf(10045.0, Double.NaN, 10065.0, Double.NaN, 10085.0)
val recentDelays by columnOf("23,47", null, "24, 43, 87", "13", "67, 32")
val airline by columnOf("KLM(!)", "{Air France} (12)", "(British Airways. )", "12. Air France", "'Swiss Air'")

// create dataframe
val df = dataFrameOf(fromTo, flightNumber, recentDelays, airline)

Clean:

// typed accessors for columns
// that will appear during
// dataframe transformation
val origin by column<String>()
val destination by column<String>()

val clean = df
    // fill missing flight numbers
    .fillNA { flightNumber }.with { prev()!!.flightNumber + 10 }

    // convert flight numbers to int
    .convert { flightNumber }.toInt()

    // clean 'airline' column
    .update { airline }.with { "([a-zA-Z\\s]+)".toRegex().find(it)?.value ?: "" }

    // split 'fromTo' column into 'origin' and 'destination'
    .split { fromTo }.by("_").into(origin, destination)

    // clean 'origin' and 'destination' columns
    .update { origin and destination }.with { it.lowercase().replaceFirstChar(Char::uppercase) }

    // split lists of delays in 'recentDelays' into separate columns
    // 'delay1', 'delay2'... and nest them inside original column `recentDelays`
    .split { recentDelays }.inward { "delay$it" }

    // convert string values in `delay1`, `delay2` into ints
    .parse { recentDelays }

Aggregate:

clean
    // group by the flight origin renamed into "from"
    .groupBy { origin named "from" }.aggregate {
        // we are in the context of single data group

        // total number of flights from origin
        count() into "count"

        // list of flight numbers
        flightNumber into "flight numbers"

        // counts of flights per airline
        airline.valueCounts() into "airlines"

        // max delay across all delays in `delay1` and `delay2`
        recentDelays.maxOrNull { delay1 and delay2 } into "major delay"

        // separate lists of recent delays for `delay1`, `delay2` and `delay3`
        recentDelays.implode(dropNulls = true) into "recent delays"

        // total delay per destination
        pivot { destination }.sum { recentDelays.intCols() } into "total delays to"
    }

Try it in Datalore and explore more examples here.

Code of Conduct

This project and the corresponding community are governed by the JetBrains Open Source and Community Code of Conduct. Please make sure you read it.

License

Kotlin Dataframe is licensed under the Apache 2.0 License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].