All Projects → techascent → Tech.ml.dataset

techascent / Tech.ml.dataset

Licence: epl-1.0
A Clojure high performance data processing system

Programming Languages

clojure
4091 projects

Projects that are alternatives of or similar to Tech.ml.dataset

Volbx
Graphical tool for data manipulation written in C++/Qt
Stars: ✭ 187 (-8.78%)
Mutual labels:  csv, xlsx
Universal Data Tool
Collaborate & label any type of data, images, text, or documents, in an easy web interface or desktop app.
Stars: ✭ 1,356 (+561.46%)
Mutual labels:  csv, dataset
Dbwebapi
(Migrated from CodePlex) DbWebApi is a .Net library that implement an entirely generic Web API (RESTful) for HTTP clients to call database (Oracle & SQL Server) stored procedures or functions in a managed way out-of-the-box without any configuration or coding.
Stars: ✭ 84 (-59.02%)
Mutual labels:  csv, xlsx
Fast Excel
🦉 Fast Excel import/export for Laravel
Stars: ✭ 1,183 (+477.07%)
Mutual labels:  csv, xlsx
Sonar Cnes Report
Generates analysis reports from SonarQube web API.
Stars: ✭ 145 (-29.27%)
Mutual labels:  csv, xlsx
Myexcel
MyExcel, a new way to operate excel!
Stars: ✭ 1,198 (+484.39%)
Mutual labels:  csv, xlsx
Filecontextcore
FileContextCore is a "Database"-Provider for Entity Framework Core and adds the ability to store information in files instead of being limited to databases.
Stars: ✭ 91 (-55.61%)
Mutual labels:  csv, xlsx
Knyfe
knyfe is a python utility for rapid exploration of datasets.
Stars: ✭ 54 (-73.66%)
Mutual labels:  dataset, datascience
D2 Admin
An elegant dashboard
Stars: ✭ 11,012 (+5271.71%)
Mutual labels:  csv, xlsx
Iso 3166 Countries With Regional Codes
ISO 3166-1 country lists merged with their UN Geoscheme regional codes in ready-to-use JSON, XML, CSV data sets
Stars: ✭ 1,372 (+569.27%)
Mutual labels:  csv, dataset
Csvpack
csvpack library / gem - tools 'n' scripts for working with tabular data packages using comma-separated values (CSV) datafiles in text with meta info (that is, schema, datatypes, ..) in datapackage.json; download, read into and query CSV datafiles with your SQL database (e.g. SQLite, PostgreSQL, ...) of choice and much more
Stars: ✭ 71 (-65.37%)
Mutual labels:  csv, dataset
Covid 19 Uk Data
Coronavirus (COVID-19) UK Historical Data
Stars: ✭ 169 (-17.56%)
Mutual labels:  csv, dataset
Spreadsheet architect
Spreadsheet Architect is a library that allows you to create XLSX, ODS, or CSV spreadsheets super easily from ActiveRecord relations, plain Ruby objects, or tabular data.
Stars: ✭ 1,160 (+465.85%)
Mutual labels:  csv, xlsx
Sinkholes
🐛 Malware Sinkhole List in various formats
Stars: ✭ 84 (-59.02%)
Mutual labels:  csv, xlsx
Fifa Fut Data
Web-scraping script that writes the data of all players from FutHead and FutBin to a CSV file or a DB
Stars: ✭ 55 (-73.17%)
Mutual labels:  csv, dataset
Tabtoy
高性能表格数据导出器
Stars: ✭ 1,302 (+535.12%)
Mutual labels:  csv, xlsx
Mobius
C# and F# language binding and extensions to Apache Spark
Stars: ✭ 929 (+353.17%)
Mutual labels:  dataframe, dataset
Clevercsv
CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.
Stars: ✭ 887 (+332.68%)
Mutual labels:  csv, datascience
Android Gradle Localization Plugin
Gradle plugin for generating localized string resources
Stars: ✭ 100 (-51.22%)
Mutual labels:  csv, xlsx
Weihanli.npoi
NPOI Extensions, excel/csv importer/exporter for IEnumerable<T>/DataTable, fluentapi(great flexibility)/attribute configuration
Stars: ✭ 157 (-23.41%)
Mutual labels:  csv, dataset

tech.ml.dataset

Clojars Project

tech.ml.dataset is a Clojure library for data processing and machine learning. Datasets are currently in-memory columnwise databases and we support parsing from file or input-stream. We support these formats: raw/gzipped csv/tsv, xls, xlsx, json, and sequences of maps as input sources. SQL bindings are provided as a separate library.

Data size in memory is minimized (primitive arrays), datetime types are often converted to an integer representation and strings are loaded into string tables. These features together dramatically decrease the working set size in memory. Because data is stored in columnar fashion columnwise operations on the dataset are very fast.

Conversion back into sequences of maps is very efficient and we have support for writing the dataset back out to csv, tsv, and gzipped varieties of those.

Upgraded support for Apache Arrow. We support copying pathway using the standard api -- data is copied from disk into buffers. We also support a more or less from-scratch implementation of an in-place pathway built expressly to enable both datasets that are larger than machine RAM and purely for performance on top of the 'tech.v3.datatype.mmap' namespace.

  • An alternative cutting-edge api with some important extra features is available via tablecloth.
  • Simple regression/classification machine learning pathways are available in tech.ml.
  • It's damn fast even with Parquet - independent benchmarks.

Mini Walkthrough

user> (require '[tech.v3.dataset :as ds])
nil
;; We support many file formats
user> (def csv-data (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv"))
#'user/csv-data
user> (ds/head csv-data)
test/data/stocks.csv [5 3]:

| symbol |       date | price |
|--------|------------|-------|
|   MSFT | 2000-01-01 | 39.81 |
|   MSFT | 2000-02-01 | 36.35 |
|   MSFT | 2000-03-01 | 43.22 |
|   MSFT | 2000-04-01 | 28.37 |
|   MSFT | 2000-05-01 | 25.45 |

user> (require '[tech.v3.libs.poi])
nil
user> (def xls-data (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/file_example_XLS_1000.xls"))
#'user/xls-data
user> (ds/head xls-data)
https://github.com/techascent/tech.v3.dataset/raw/master/test/data/file_example_XLS_1000.xls [5 8]:

| column-0 | First Name | Last Name | Gender |       Country |  Age |       Date |     Id |
|----------|------------|-----------|--------|---------------|------|------------|--------|
|      1.0 |      Dulce |     Abril | Female | United States | 32.0 | 15/10/2017 | 1562.0 |
|      2.0 |       Mara | Hashimoto | Female | Great Britain | 25.0 | 16/08/2016 | 1582.0 |
|      3.0 |     Philip |      Gent |   Male |        France | 36.0 | 21/05/2015 | 2587.0 |
|      4.0 |   Kathleen |    Hanner | Female | United States | 25.0 | 15/10/2017 | 3549.0 |
|      5.0 |    Nereida |   Magwood | Female | United States | 58.0 | 16/08/2016 | 2468.0 |

;;And you have fine grained control over parsing

user> (ds/head (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/file_example_XLS_1000.xls"
                             {:parser-fn {"Date" [:local-date "dd/MM/yyyy"]}}))
https://github.com/techascent/tech.v3.dataset/raw/master/test/data/file_example_XLS_1000.xls [5 8]:

| column-0 | First Name | Last Name | Gender |       Country |  Age |       Date |     Id |
|----------|------------|-----------|--------|---------------|------|------------|--------|
|      1.0 |      Dulce |     Abril | Female | United States | 32.0 | 2017-10-15 | 1562.0 |
|      2.0 |       Mara | Hashimoto | Female | Great Britain | 25.0 | 2016-08-16 | 1582.0 |
|      3.0 |     Philip |      Gent |   Male |        France | 36.0 | 2015-05-21 | 2587.0 |
|      4.0 |   Kathleen |    Hanner | Female | United States | 25.0 | 2017-10-15 | 3549.0 |
|      5.0 |    Nereida |   Magwood | Female | United States | 58.0 | 2016-08-16 | 2468.0 |
user>


;;Loading from the web is no problem
user>
user> (def airports (ds/->dataset "https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat"
                                  {:header-row? false :file-type :csv}))
#'user/airports
user> (ds/head airports)
https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat [5 14]:

| column-0 |                                    column-1 |     column-2 |         column-3 | column-4 | column-5 |    column-6 |     column-7 | column-8 | column-9 | column-10 |            column-11 | column-12 |   column-13 |
|----------|---------------------------------------------|--------------|------------------|----------|----------|-------------|--------------|----------|----------|-----------|----------------------|-----------|-------------|
|        1 |                              Goroka Airport |       Goroka | Papua New Guinea |      GKA |     AYGA | -6.08168983 | 145.39199829 |     5282 |     10.0 |         U | Pacific/Port_Moresby |   airport | OurAirports |
|        2 |                              Madang Airport |       Madang | Papua New Guinea |      MAG |     AYMD | -5.20707989 | 145.78900147 |       20 |     10.0 |         U | Pacific/Port_Moresby |   airport | OurAirports |
|        3 |                Mount Hagen Kagamuga Airport |  Mount Hagen | Papua New Guinea |      HGU |     AYMH | -5.82678986 | 144.29600525 |     5388 |     10.0 |         U | Pacific/Port_Moresby |   airport | OurAirports |
|        4 |                              Nadzab Airport |       Nadzab | Papua New Guinea |      LAE |     AYNZ | -6.56980300 | 146.72597700 |      239 |     10.0 |         U | Pacific/Port_Moresby |   airport | OurAirports |
|        5 | Port Moresby Jacksons International Airport | Port Moresby | Papua New Guinea |      POM |     AYPY | -9.44338036 | 147.22000122 |      146 |     10.0 |         U | Pacific/Port_Moresby |   airport | OurAirports |

;;At any point you can get a sequence of maps back.  We implement a special version
;;of Clojure's APersistentMap that is much more efficient than even records and shares
;;the backing store with the dataset.

user> (take 2 (ds/mapseq-reader csv-data))
({"date" #object[java.time.LocalDate 0x4a998af0 "2000-01-01"],
  "symbol" "MSFT",
  "price" 39.81}
 {"date" #object[java.time.LocalDate 0x6d8c0bcd "2000-02-01"],
  "symbol" "MSFT",
  "price" 36.35})

;;Datasets are comprised of named columns, and provide a Clojure hashmap-compatible
;;collection.  Datasets allow reading and updating column data associated with a column name,
;;and provide a sequential view of [column-name column] entries.

;;You can look up columns via `get`, keyword lookup, and invoking the dataset as a function on
;;a key (a column name). `keys` and `vals` retrieve respective sequences of column names and columns.
;;The functions `assoc` and `dissoc` work to define new associations to conveniently
;;add, update, or remove columns, with add/update semantics defined by`tech.v3.dataset/add-or-update-column`.

;;Column data is stored in primitive arrays (even most datetimes!) and strings are stored
;;in string tables.  You can load really large datasets with this thing!

;;Columns themselves are sequences of their entries.
user> (csv-data "symbol")
#tech.v3.dataset.column<string>[560]
symbol
[MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, ...]
user> (xls-data "Gender")
#tech.v3.dataset.column<string>[1000]
Gender
[Female, Female, Male, Female, Female, Male, Female, Female, Female, Female, Female, Male, Female, Male, Female, Female, Female, Female, Female, Female, ...]
user> (take 5 (xls-data "Gender"))
("Female" "Female" "Male" "Female" "Female")


;;Datasets and columns implement the clojure metadata interfaces (`meta`, `with-meta`, `vary-meta`)

;;You can access a sequence of columns of a dataset with `ds/columns`, or `vals` like a map,
;;and access the metadata with `meta`:

user> (->> csv-data
           vals  ;synonymous with ds/columns
           (map (fn [column]
                  (meta column))))
({:categorical? true, :name "symbol", :size 560, :datatype :string}
 {:name "date", :size 560, :datatype :packed-local-date}
 {:name "price", :size 560, :datatype :float32})

;;We can similarly destructure datasets like normal clojure
;;maps:

user> (for [[k column] csv-data]
        [k (meta column)])
(["symbol" {:categorical? true, :name "symbol", :size 560, :datatype :string}]
 ["date" {:name "date", :size 560, :datatype :packed-local-date}]
 ["price" {:name "price", :size 560, :datatype :float64}])

user> (let [{:strs [symbol date]} csv-data]
        [symbol (meta date)])
[#tech.v3.dataset.column<string>[560]
symbol
[MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, ...]
 {:name "date", :size 560, :datatype :packed-local-date}]

;;We can get a brief description of the dataset:

user> (ds/brief csv-data)
({:min #object[java.time.LocalDate 0x5b2ea1d5 "2000-01-01"],
  :n-missing 0,
  :col-name "date",
  :mean #object[java.time.LocalDate 0x729b7395 "2005-05-12"],
  :datatype :packed-local-date,
  :quartile-3 #object[java.time.LocalDate 0x6c75fa43 "2007-11-23"],
  :n-valid 560,
  :quartile-1 #object[java.time.LocalDate 0x13d9aabe "2002-11-08"],
  :max #object[java.time.LocalDate 0x493bf7ef "2010-03-01"]}
 {:min 5.97,
  :n-missing 0,
  :col-name "price",
  :mean 100.7342857142857,
  :datatype :float64,
  :skew 2.4130946430619233,
  :standard-deviation 132.55477114107083,
  :quartile-3 100.88,
  :n-valid 560,
  :quartile-1 24.169999999999998,
  :max 707.0}
 {:mode "MSFT",
  :values ["MSFT" "AMZN" "IBM" "AAPL" "GOOG"],
  :n-values 5,
  :n-valid 560,
  :col-name "symbol",
  :n-missing 0,
  :datatype :string,
  :histogram (["MSFT" 123] ["AMZN" 123] ["IBM" 123] ["AAPL" 123] ["GOOG" 68])})

;;Another view of that brief:

user> (ds/descriptive-stats csv-data)
https://github.com/techascent/tech.v3.dataset/raw/master/test/data/stocks.csv: descriptive-stats [3 10]:

| :col-name |          :datatype | :n-valid | :n-missing |       :min |      :mean | :mode |       :max | :standard-deviation |      :skew |
|-----------|--------------------|----------|------------|------------|------------|-------|------------|---------------------|------------|
|      date | :packed-local-date |      560 |          0 | 2000-01-01 | 2005-05-12 |       | 2010-03-01 |                     |            |
|     price |           :float64 |      560 |          0 |      5.970 |      100.7 |       |      707.0 |        132.55477114 | 2.41309464 |
|    symbol |            :string |      560 |          0 |            |            |  MSFT |            |                     |            |


;;There are analogues of the clojure.core functions that apply to dataset:
;;filter, group-by, sort-by.  These are all implemented efficiently.

;;You can add/remove/update columns, or use the map idioms of `assoc` and `dissoc`

user> (-> csv-data
          (assoc "always-ten" 10) ;scalar values are expanded as needed
          (assoc "random"   (repeatedly (ds/row-count csv-data) #(rand-int 100)))
          ds/head)
https://github.com/techascent/tech.v3.dataset/raw/master/test/data/stocks.csv [5 5]:

| symbol |       date | price | always-ten | random |
|--------|------------|-------|------------|--------|
|   MSFT | 2000-01-01 | 39.81 |         10 |     47 |
|   MSFT | 2000-02-01 | 36.35 |         10 |     35 |
|   MSFT | 2000-03-01 | 43.22 |         10 |     54 |
|   MSFT | 2000-04-01 | 28.37 |         10 |      6 |
|   MSFT | 2000-05-01 | 25.45 |         10 |     52 |

user> (-> csv-data
          (dissoc "price")
          ds/head)
https://github.com/techascent/tech.v3.dataset/raw/master/test/data/stocks.csv [5 2]:

| symbol |       date |
|--------|------------|
|   MSFT | 2000-01-01 |
|   MSFT | 2000-02-01 |
|   MSFT | 2000-03-01 |
|   MSFT | 2000-04-01 |
|   MSFT | 2000-05-01 |


;;since `conj` works as with clojure maps and sequences of map-entries or pairs,
;;you can use idioms like `reduce conj` or `into` to construct new datasets on the
;;fly with familiar clojure idioms:

user> (let [new-cols [["always-ten" 10] ["new-price" (map inc (csv-data "price"))]]
            new-data (into (dissoc csv-data "price") new-cols)]
            (ds/head new-data))
https://github.com/techascent/tech.v3.dataset/raw/master/test/data/stocks.csv [5 4]:

| symbol |       date | always-ten | new-price |
|--------|------------|------------|-----------|
|   MSFT | 2000-01-01 |         10 |     40.81 |
|   MSFT | 2000-02-01 |         10 |     37.35 |
|   MSFT | 2000-03-01 |         10 |     44.22 |
|   MSFT | 2000-04-01 |         10 |     29.37 |
|   MSFT | 2000-05-01 |         10 |     26.45 |

;;You can write out the result back to csv, tsv, and gzipped variations of those.

;;Joins (left, right, inner) are all implemented.

;;Columnwise arithmetic manipulations (+,-, and many more) are provided via the
;;tech.v2.datatype.functional namespace.

;;Datetime columns can be operated on - plus,minus, get-years, get-days, and
;;many more - uniformly via the tech.v2.datatype.datetime.operations namespace.

;;There is much more.  Please checkout the walkthough and try it out!

Arrow Dependencies

;; project-clj -

;;Unsafe works with graal native
[org.apache.arrow/arrow-memory-unsafe "2.0.0"]
[org.apache.arrow/arrow-memory-core "2.0.0"]
[org.apache.arrow/arrow-vector "2.0.0" :exclusions [commons-codec]]

;;require -
(require '[tech.v3.libs.arrow])

Parquet Support

Parquet now has first class support. That means we should be able to load most Parquet files. Failure to load a parquet file quickly and accurately with correct missing values is a issue.

Parquet Dependencies

;;namespace needs to be required for parquet support
(require '[tech.v3.libs.parquet])

;;Also works with 1.10.X
org.apache.parquet/parquet-hadoop {:mvn/version "1.11.0"}
org.apache.hadoop/hadoop-common {:mvn/version "3.1.1"}

More Documentation

Questions, Community

Further Reading

Keywords

  • csv, xlsx, pandas, dataframe, dplyr, data.table, modelling

License

Copyright © 2019 Complements of TechAscent, LLC

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].