All Projects â†’ scicloj â†’ tablecloth

scicloj / tablecloth

Licence: MIT license
Dataset manipulation library built on the top of tech.ml.dataset

Programming Languages

clojure
4091 projects
emacs lisp
2029 projects

Projects that are alternatives of or similar to tablecloth

Vaex
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀
Stars: ✭ 6,793 (+3967.66%)
Mutual labels:  machinelearning, dataframe
Awesome Cybersecurity Datasets
A curated list of amazingly awesome Cybersecurity datasets
Stars: ✭ 380 (+127.54%)
Mutual labels:  machinelearning, dataframe
type4py
Type4Py: Deep Similarity Learning-Based Type Inference for Python
Stars: ✭ 41 (-75.45%)
Mutual labels:  machinelearning
bioinf-commons
Bioinformatics library in Kotlin
Stars: ✭ 21 (-87.43%)
Mutual labels:  dataframe
Rethink-BiasVariance-Tradeoff
Rethinking Bias-Variance Trade-off for Generalization of Neural Networks
Stars: ✭ 46 (-72.46%)
Mutual labels:  machinelearning
cognipy
In-memory Graph Database and Knowledge Graph with Natural Language Interface, compatible with Pandas
Stars: ✭ 31 (-81.44%)
Mutual labels:  dataframe
best AI papers 2021
A curated list of the latest breakthroughs in AI (in 2021) by release date with a clear video explanation, link to a more in-depth article, and code.
Stars: ✭ 2,740 (+1540.72%)
Mutual labels:  machinelearning
saddle
SADDLE: Scala Data Library
Stars: ✭ 23 (-86.23%)
Mutual labels:  dataframe
ml-time-series-analysis-on-sales-data
Time Series Decomposition techniques and random forest algorithm on sales data
Stars: ✭ 34 (-79.64%)
Mutual labels:  machinelearning
human-in-the-loop-machine-learning-tool-tornado
Tornado is a human-in-the-loop machine learning framework that helps you exploit your unlabelled data to train models through a simple and easy to use web interface.
Stars: ✭ 37 (-77.84%)
Mutual labels:  machinelearning
mlf-core
CPU and GPU deterministic and therefore fully reproducible machine learning pipelines using MLflow.
Stars: ✭ 32 (-80.84%)
Mutual labels:  machinelearning
awesome-open-mlops
The Fuzzy Labs guide to the universe of open source MLOps
Stars: ✭ 304 (+82.04%)
Mutual labels:  machinelearning
Groundbreaking-Papers
ML Research paper summaries, annotated papers and implementation walkthroughs
Stars: ✭ 90 (-46.11%)
Mutual labels:  machinelearning
Nutshell-Machine-Learning
This is a repository built by the community for the community.
Stars: ✭ 77 (-53.89%)
Mutual labels:  machinelearning
incogly
Incogly is a video conferencing app aimed to remove any implicit bias in an interview and easing the process of remote collaboration.
Stars: ✭ 24 (-85.63%)
Mutual labels:  machinelearning
word2vec-tsne
Google News and Leo Tolstoy: Visualizing Word2Vec Word Embeddings using t-SNE.
Stars: ✭ 59 (-64.67%)
Mutual labels:  machinelearning
arrow-datafusion
Apache Arrow DataFusion SQL Query Engine
Stars: ✭ 2,360 (+1313.17%)
Mutual labels:  dataframe
ML-For-Beginners
12 weeks, 26 lessons, 52 quizzes, classic Machine Learning for all
Stars: ✭ 40,023 (+23865.87%)
Mutual labels:  machinelearning
dst
yet another custom data science template via cookiecutter
Stars: ✭ 59 (-64.67%)
Mutual labels:  machinelearning
Anomaly Detection
anomaly detection with anomalize and Google Trends data
Stars: ✭ 38 (-77.25%)
Mutual labels:  machinelearning

Versions

tech.ml.dataset 6.x (master branch)

tech.ml.dataset 4.x (4.0 branch)

[scicloj/tablecloth "4.04"]

Introduction

tech.ml.dataset is a great and fast library which brings columnar dataset to the Clojure. Chris Nuernberger has been working on this library for last year as a part of bigger tech.ml stack.

I’ve started to test the library and help to fix uncovered bugs. My main goal was to compare functionalities with the other standards from other platforms. I focused on R solutions: dplyr, tidyr and data.table.

During conversions of the examples I’ve come up how to reorganized existing tech.ml.dataset functions into simple to use API. The main goals were:

  • Focus on dataset manipulation functionality, leaving other parts of tech.ml like pipelines, datatypes, readers, ML, etc.
  • Single entry point for common operations - one function dispatching on given arguments.
  • group-by results with special kind of dataset - a dataset containing subsets created after grouping as a column.
  • Most operations recognize regular dataset and grouped dataset and process data accordingly.
  • One function form to enable thread-first on dataset.

Important! This library is not the replacement of tech.ml.dataset nor a separate library. It should be considered as a addition on the top of tech.ml.dataset.

If you want to know more about tech.ml.dataset and dtype-next please refer their documentation:

Join the discussion on Zulip

Documentation

Please refer detailed documentation with examples

Usage example

(require '[tablecloth.api :as tc])
(-> "https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv"
    (tc/dataset {:key-fn keyword})
    (tc/group-by (fn [row]
                    {:symbol (:symbol row)
                     :year (tech.v3.datatype.datetime/long-temporal-field :years (:date row))}))
    (tc/aggregate #(tech.v3.datatype.functional/mean (% :price)))
    (tc/order-by [:symbol :year])
    (tc/head 10))

_unnamed [10 3]:

summary :year :symbol
21.74833333 2000 AAPL
10.17583333 2001 AAPL
9.40833333 2002 AAPL
9.34750000 2003 AAPL
18.72333333 2004 AAPL
48.17166667 2005 AAPL
72.04333333 2006 AAPL
133.35333333 2007 AAPL
138.48083333 2008 AAPL
150.39333333 2009 AAPL

Contributing

Tablecloth is open for contribution. The best way to start is discussion on Zulip.

Development tools for documentation

Documentation is written in RMarkdown, that means that you need R to create html/md/pdf files. Documentation contains around 600 code snippets which are run during build. There are two files:

  • README.Rmd
  • docs/index.Rmd

Prepare following software:

  1. Install R
  2. Install rep, nRepl client
  3. Install pandoc
  4. Run nRepl
  5. Run R and install R packages: install.packages(c("rmarkdown","knitr"), dependencies=T)
  6. Load rmarkdown: library(rmarkdown)
  7. Render readme: render("README.Rmd","md_document")
  8. Render documentation: render("docs/index.Rmd","all")

API file generation

tablecloth.api namespace is generated out of api-template, please run it before making documentation

(exporter/write-api! 'tablecloth.api.api-template
                     'tablecloth.api
                     "src/tablecloth/api.clj"
                     '[group-by drop concat rand-nth first last shuffle])

Guideline

  1. Before commiting changes please perform tests. I ususally do: lein do clean, check, test and build documentation as described above (which also tests whole library).
  2. Keep API as simple as possible:
    • first argument should be a dataset
    • if parametrizations is complex, last argument should accept a map with not obligatory function arguments
    • avoid variadic associative destructuring for function arguments
    • usually function should working on grouped dataset as well, accept parallel? argument then (if applied).
  3. Follow potemkin pattern and import functions to the API namespace using tech.v3.datatype.export-symbols/export-symbols function
  4. Functions which are composed out of API function to cover specific case(s) should go to tablecloth.utils namespace.
  5. Always update README.Rmd, CHANGELOG.md, docs/index.Rmd, tests and function docs are highly welcomed
  6. Always discuss changes and PRs first

TODO

  • tests
  • tutorials

Licence

Copyright (c) 2020 Scicloj

The MIT Licence

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].