Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → nevi-me → Rust Dataframe

nevi-me / Rust Dataframe

Licence: apache-2.0

A Rust DataFrame implementation, built on Apache Arrow

Programming Languages

rust

11053 projects

Labels

dataframe

Projects that are alternatives of or similar to Rust Dataframe

hamilton

A scalable general purpose micro-framework for defining dataflows. You can use it to create dataframes, numpy matrices, python objects, ML models, etc.

Stars: ✭ 612 (+125.83%)

Mutual labels: dataframe

bioinf-commons

Bioinformatics library in Kotlin

Stars: ✭ 21 (-92.25%)

Mutual labels: dataframe

Dominando-Pandas

Este repositório está destinado ao processo de aprendizagem da biblioteca Pandas.

Stars: ✭ 22 (-91.88%)

Mutual labels: dataframe

torch-dataframe

Utility class to manipulate dataset from CSV file

Stars: ✭ 67 (-75.28%)

Mutual labels: dataframe

woodwork

Woodwork is a Python library that provides robust methods for managing and communicating data typing information.

Stars: ✭ 97 (-64.21%)

Mutual labels: dataframe

polars

Fast multi-threaded DataFrame library in Rust | Python | Node.js

Stars: ✭ 6,368 (+2249.82%)

Mutual labels: dataframe

heidi

heidi : tidy data in Haskell

Stars: ✭ 24 (-91.14%)

Mutual labels: dataframe

connector-x

Fastest library to load data from DB to DataFrames in Rust and Python

Stars: ✭ 550 (+102.95%)

Mutual labels: dataframe

dflib

In-memory Java DataFrame library

Stars: ✭ 50 (-81.55%)

Mutual labels: dataframe

raccoon

Python DataFrame with fast insert and appends

Stars: ✭ 64 (-76.38%)

Mutual labels: dataframe

saddle

SADDLE: Scala Data Library

Stars: ✭ 23 (-91.51%)

Mutual labels: dataframe

cognipy

In-memory Graph Database and Knowledge Graph with Natural Language Interface, compatible with Pandas

Stars: ✭ 31 (-88.56%)

Mutual labels: dataframe

DataFrame

DataFrame Library for Java

Stars: ✭ 51 (-81.18%)

Mutual labels: dataframe

📺(tv) Tidy Viewer is a cross-platform CLI csv pretty printer that uses column styling to maximize viewer enjoyment.

Stars: ✭ 1,763 (+550.55%)

Mutual labels: dataframe

pywedge

Makes Interactive Chart Widget, Cleans raw data, Runs baseline models, Interactive hyperparameter tuning & tracking

Stars: ✭ 49 (-81.92%)

Mutual labels: dataframe

pyjanitor

Clean APIs for data cleaning. Python implementation of R package Janitor

Stars: ✭ 970 (+257.93%)

Mutual labels: dataframe

tablecloth

Dataset manipulation library built on the top of tech.ml.dataset

Stars: ✭ 167 (-38.38%)

Mutual labels: dataframe

Nimdata

DataFrame API written in Nim, enabling fast out-of-core data processing

Stars: ✭ 261 (-3.69%)

Mutual labels: dataframe

aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

Stars: ✭ 111 (-59.04%)

Mutual labels: dataframe

tableau-scraping

Tableau scraper python library. R and Python scripts to scrape data from Tableau viz

Stars: ✭ 91 (-66.42%)

Mutual labels: dataframe

View All Similar Projects ➔

Rust DataFrame

A dataframe implementation in Rust, powered by Apache Arrow.

What is a dataframe?

A dataframe is a 2-dimensional tabular data structure that is often used for computations and other data transformations. A dataframe often has columns of the same data type, similar to a SQL table.

Functionality

This project is inspired by Pandas and other dataframe libraries, but specifically currently borrows functions from Apache Spark.

It mainly focuses on computation, and aims to include:

Scalar functions
Aggregate function
Window functions
Array functions

As a point of reference, we use Apache Spark Python functions for function parity, and aim to be compatible with Apache Spark functions.

Eager vs Lazy Evaluation

The initial experiments of this project were to see if it's possible to create some form of dataframe. We're happy that this condition is met, however the initial version relied on eager evaluation, which would make it difficult to use in a REPL fashion, and make it slow.

We are mainly focusing on creating a process for lazy evaluation (the current LazyFrame), which involves reading an input's schema, then applying transformations on that schema until a materialising action is required. While still figuring this out, there might not be much progress on the surface, as most of this exercise is happening offline.

The plan is to provide a reasonable API for lazily transforming data, and the ability to apply some optimisations on the computation graph (e.g. predicate pushdown, rearranging computations).

In the future, LazyFrame will probably be renamed to DataFrame, and the current DataFrame with eager evaluation removed/made private.

The ongoing experiments on lazy evaluation are in the master branch, and we would appreciate some help 🙏🏾.

Non-Goals

Although we use Apache Spark as a reference, we do not intend on supporting distributed computation beyond a single machine.

Spark is a convenience to reduce bikeshedding, but we will probably provide a more Rust idiomatic API in future.

Status

A low-level API can already be used for simple tasks that do not require aggregations, joins or sorts. A simpler API is currently not a priority until we have more capabilities to transform data.

One good potential immediate use of the library would be copying data from one supported data source to another (e.g. PostgreSQL to Arrow or CSV with minimal transformations).

Roadmap

[ ] Lazy evaluation (H1 2020)
- [ ] Aggregations
- [ ] Joins
- [ ] Sorting
[ ] Adding compute fns (H1 2020)
[ ] Bindings to other languages (H2 2020)

IO

We are working on IO support, with priority for SQL read and write. PostgreSQL IO is supported using the binary protocol, although not all data types are supported (lists, structs, numeric, and a few other non-primitive types)

IO Support
- [X] CSV
  - [X] Read
  - [X] Write
- [ ] JSON
  - [X] Read
  - [ ] Write
- [X] Arrow IPC
  - [X] Read File
  - [X] Write FIle
- [ ] Parquet
  - [ ] Read File
  - [ ] Write File
- [ ] SQL (part of an effort to create generic DB traits)
  - [X] PostgreSQL (Primitive and temporal types supported, PRs welcome for other types)
    - [X] Read
    - [X] Write
  - [ ] MSSQL (using tiberius)
    - [ ] Read
    - [ ] Write
  - [ ] MySQL
    - [ ] Read
    - [ ] Write

Functionality

DataFrame Operations
- [X] Select single column
- [X] Select subset of columns, drop columns
- [X] Add or remove columns
- [X] Rename columns
- [X] Create dataframe from record batches (a Vec<RecordBatch> as well as an iterator)
- [ ] Sort dataframes
- [ ] Grouped operations
- [ ] Filter dataframes
- [ ] Join dataframes
Scalar Functions
- [X] Trig functions (sin, cos, tan, asin, asinh, ...) (using the num crate where possible)
- [X] Basic arithmetic (add, mul, divide, subtract) Implemented from Arrow
- [ ] Date/Time functions
- [ ] String functions
  - [-] Basic string manipulation
  - [ ] Regular expressions (leveraging regex)
  - [ ] Casting to and from strings (using Arrow compute's cast kernel)
- [ ] Crypto/hash functions (md5, crc32, sha{x}, ...)
- [ ] Other functions (that we haven't classified)
Aggregate Functions
- [X] Sum, max, min
- [X] Count
- [ ] Statistical aggregations (mean, mode, median, stddev, ...)
Window Functions
- [ ] Lead, lag
- [ ] Rank, percent rank
- [ ] Other
Array Functions
- [ ] Compatibility with Spark 2.4 functions
- [ ] Compatibility with Spark 3.0 functions

Performance

We plan on providing simple benchmarks in the near future. The current blockers are:

[ ] IO
- [X] Text format (CSV)
- [X] Binary format (Arrow IPC)
- [ ] SQL
[-] Lazy operations
[ ] Aggregation
[ ] Joins

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 271

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (12) 🔗