All Projects → AlexMili → torch-dataframe

AlexMili / torch-dataframe

Licence: MIT license
Utility class to manipulate dataset from CSV file

Programming Languages

lua
6591 projects
Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to torch-dataframe

Tech.ml.dataset
A Clojure high performance data processing system
Stars: ✭ 205 (+205.97%)
Mutual labels:  csv, dataframe
tv
📺(tv) Tidy Viewer is a cross-platform CLI csv pretty printer that uses column styling to maximize viewer enjoyment.
Stars: ✭ 1,763 (+2531.34%)
Mutual labels:  csv, dataframe
MPowerTCX
Share stationary bike data with Strava, Garmin Connect and Golden Cheetah
Stars: ✭ 22 (-67.16%)
Mutual labels:  csv
uzbekistan-regions-data
Full Database of regions Uzbekistan available in JSON, SQL & CSV Format All Regions, Districts & Quarters with Latin, Cyrillic and Russian versions. (Районы (туманы) Республики Узбекистан и Города областного (республиканского) подчинения)
Stars: ✭ 46 (-31.34%)
Mutual labels:  csv
flambeau
Nim bindings to libtorch
Stars: ✭ 60 (-10.45%)
Mutual labels:  torch
convey
CSV processing and web related data types mutual conversion
Stars: ✭ 16 (-76.12%)
Mutual labels:  csv
tabtools
🔧 SQL for csv file in UNIX command line with awk.
Stars: ✭ 16 (-76.12%)
Mutual labels:  csv
pandoc-placetable
Pandoc filter to include CSV data (from file or URL)
Stars: ✭ 35 (-47.76%)
Mutual labels:  csv
dogETL
A lib to transform data from jdbc,csv,json to ecah other.
Stars: ✭ 15 (-77.61%)
Mutual labels:  csv
hamilton
A scalable general purpose micro-framework for defining dataflows. You can use it to create dataframes, numpy matrices, python objects, ML models, etc.
Stars: ✭ 612 (+813.43%)
Mutual labels:  dataframe
burp-suite-http-proxy-history-converter
Python script that converts Burp Suite HTTP proxy history files to CSV or HTML
Stars: ✭ 63 (-5.97%)
Mutual labels:  csv
mdtable2csv
convert tables in .md to .csv
Stars: ✭ 91 (+35.82%)
Mutual labels:  csv
csvlixir
A CSV reading/writing application for Elixir.
Stars: ✭ 32 (-52.24%)
Mutual labels:  csv
elastic-query-export
🚚 Export Data from ElasticSearch to CSV/JSON using a Lucene Query (e.g. from Kibana) or a raw JSON Query string
Stars: ✭ 56 (-16.42%)
Mutual labels:  csv
sentence2vec
Deep sentence embedding using Sequence to Sequence learning
Stars: ✭ 23 (-65.67%)
Mutual labels:  torch
fastapi-csv
🏗️ Create APIs from CSV files within seconds, using fastapi
Stars: ✭ 46 (-31.34%)
Mutual labels:  csv
eec
A fast and lower memory excel write/read tool.一个非POI底层,支持流式处理的高效且超低内存的Excel读写工具
Stars: ✭ 93 (+38.81%)
Mutual labels:  csv
artemis cli
A command-line application for tutors to more productively grade programming excises on ArTEMiS
Stars: ✭ 12 (-82.09%)
Mutual labels:  csv
Emma
Emma Memory and Mapfile Analyser
Stars: ✭ 21 (-68.66%)
Mutual labels:  csv
strapi-plugin-import-export-content
Csv and Json import / export content plugin to Strapi
Stars: ✭ 129 (+92.54%)
Mutual labels:  csv

Licence MIT Build Status codecov

Dataframe

Dataframe is a Torch7 class to load and manipulate tabular data (e.g. Kaggle-style CSVs) inspired from R's and pandas' data frames.

As of release 1.5 it fully supports the torchnet data structure. It also has custom iterators to convenient integration with torchnet's engines, see the mnist example. As of release 1.6 it has changed the internal storage to tensor

For a more detailed look at the changes between the versions have a look at the NEWS file.

Requirements

Installation

You can clone this repository or directly install it through luarocks:

git clone https://github.com/AlexMili/torch-dataframe
cd torch-dataframe
luarocks make rocks/torch-dataframe-scm-1.rockspec

the same in one line :

luarocks install torch-dataframe scm-1

or

luarocks install torch-dataframe

Changelog

Version: 1.7

  • Added faster torch.Tensor functions to fill/stat functions for speed
  • Added mutate function to Dataseries
  • __index__ access for Df_Array
  • More complete documentation for Df_Array and specs
  • Df_Dict elements can be accessed using myDict[index] or myDict["$colname"]
  • Df_Dict key property available. It list the Df_Dict's keys
  • Df_Dict length property available. It list by key, the length of its content
  • Df_Dict check_length() checks if all elements have the same length
  • Df_Dict set_keys(table) replaces every keys by the given table (must be the same size)
  • More complete documentation for Df_Dict and specs
  • More complete documentation for Df_Tbl and specs
  • Internal methods _infer_csvigo_schema() and _infer_data_schema() renamed to _infer_schema()
  • Type inference is now based on type frequences but if it encounter a single double/float in a integer column it will consider the column as double/float
  • it is now possible to directly set a schema for a Dataframe without any checks with set_schema(). Use it wisely
  • Possibility to init a Dataframe with a schema, a column order and a number of rows with internal method _init_with_schema()
  • Added bulk_load_csv() method wich loads large CSVs files using threads but without checking missing values or data integrity. To use with caution. See #28
  • Added load_threadcsv()
  • Added the possiblity to create empty Dataseries
  • Added Dataseries load() method to directly load a tensor or tds.Vec in memory without any check
  • Added iris dataset in /specs/data
  • New specs structure
  • Fixed csv loading when no header and test case according to it
  • Changed assert_is_index return value to true on success instead of self

See NEWS.md file for previous changes.

Usage

Named arguments

The Dataframe relies on argcheck for parsing arguments. This means that you can used named parameters using the function{arg_name=value} syntax. Named arguments are supported by all functions except the constructor and is in certain functions mandatory in order to avoid ambiguity.

The argcheck package also works as the API documentation. It checks arguments and if you happen to provide the function with invalid arguments it will automatically output the function documentation.

Important: Due to limitations in the Lua language the package uses helper classes for separating regular table arguments from tables passed into as arguments. The three classes are:

  • Df_Array - contains only values and no keys
  • Df_Dict - a dictionary table that has named keys that map to all values
  • Df_Tbl - a raw table wrapper that does a shallow argument copy

Load data

Initiate the object:

require 'Dataframe'
df = Dataframe()

Load CSV file:

df:load_csv{path='./data/training.csv', header=true}

Load from table:

df:load_table{data=Df_Dict{firstColumn={1,2,3},
                           secondColumn={4,5,6}}}

You can also instantiate the object with a csv-filename or a table by passing the table or filename as an argument:

require 'Dataframe'
df = Dataframe('./data/training.csv')

Data inspection

You can discover your dataset with the following functions:

-- you can either view the data as a plain text output or itorch html table
df:output() -- prints html if in itorch otherwise prints plain table
df:output{html=true} -- forces html output

df:show() -- prints the head + tail of the table

-- You can also directly call print() on the object
-- and it will print the ascii-table
print(df)

General dataset information can be found using:

df:shape() -- print {rows=3, cols=3}
#df -- gets the number of rows
df:size() -- returns a tensor with the size rows, columns
df.column_order -- table of columns names
df:count_na() -- print all the missing values by column name

If you want to inspect random elements you can use the get_random():

df:get_random(10):output()

Manipulate

You can manipulate it:

df:insert(Df_Dict({['first_column']={7,8,9},['second_column']={10,11,12}}))
df:remove_index(3) -- remove line 3 of the entire dataset

df:has_column('x') -- return true if the column exist
df:get_column('y') -- return column x as table
df["$y"] -- alias for get_column

df:add_column('z', 0) -- Add column with default value 0 at the end (right side of the table)
df:add_column('first_column', 1, 2) -- Add column with default value 2 at the beginning (left side of the table)
df:drop('x') -- delete column
df:rename_column('x', 'y') -- rename column 'x' in 'y'

df:reset_column('my_col', 0) -- reset the given column with 0
df:fill_na('x', 0) -- replace missing values in 'x' column with 0
df:fill_all_na(0) -- replace all missing values with the value 0

df:unique('col_name') -- return table with unique values of the given column
df:unique('col_name', true) -- return table with unique values of the given column as keys

df:where('column_name','my_value') -- find the first row where the column has the given value

-- Customly update all rows filling the condition defined in first lambda
df:update(function(row) row['column'] == 'test' end,
          function(row) row['other_column'] = 'new_value' return row end)

Categorical variables

You can define categorical variables that will be treated internally as numbers ranging from 1 to n levels while displayed as strings. The numeric representation is retained when exporting to_tensor allowing a simpler understanding of a classifier's output:

df:as_categorical('my string column') -- converts a column to categorical
df:get_cat_keys('my string column') -- retreives the keys used to converts
df:to_categorical(Df_Array({1,2,1}), 'my string column') -- converts numbers to the categories

Subsetting

You can subset your data using:

df:head(20) -- print 20 first elements (10 by default)
df:tail(5) -- print 5 last elements (10 by default)
df:show() -- print 10 first and 10 last elements

df[13] -- returns a table with the row values
df["13:17"] -- returns a Dataframe with values in that span
df["13:"] -- returns a Dataframe with values starting from index 13
df[Df_Array(1,3,4)] -- returns a Dataframe with values index 1,3 and 4

Exporting

Finally, you can save your dataset to tensor (only numerical/categorical columns will be taken):

df:to_tensor{filename = './data/train.th7'} -- saves data
data = df:to_tensor{columns = Df_Array('first_column', 'my string column')} -- Converts the two columns into tensor

or to CSV:

df:to_csv('data.csv')

Batch loading

The Dataframe provides a built-in system for handling batch loading. It also has an extensive set of samplers that you can use. See API docs for more on which that are available.

The gist of it is:

  • The main Dataframe is initialized for batch loading via calling the create_subsets. This creates random subsets that have their own samplers. The default is a train 70%, validate 20%, and a test 10% split in the data but you can choose any split and any names.
  • Each subset is a separate dataframe subclass that has two columns, (1) indexes with the corresponding index in the main dataframe, (2) labels that some of the samplers require.
  • When you want to retrieve a batch from a subset you call the subset using my_dataframe:get_subset('train'):get_batch(30) or my_dataframe['/train']:get_batch(30).
  • The batch returned is also a subclass that has a custom to_tensor function that returns the data and corresponding label tensors. You can provide custom functions that will get the full row as an argument allowing you to use e.g. a filename that permits load an external resource.

A simple example:

local df = Dataframe('my_csv'):
	create_subsets()

local batch = df["/train"]:get_batch(10)
local data, label = batch:to_tensor{
	load_data_fn = my_image_loader
}

As of version 1.5 you may also want to consider using th iterators that integrate with the torchnet infrastructure. Take a look at the iterator API and the mnist example for how an implementation may look.

Tests

The package contains an extensive test suite and tries to apply a behavior driven development approach. All features should be accompanied by a test-case.

To launch the tests you need to install busted (See: Olivine-Labs/busted) via luarocks:

luarocks install busted

then you can run all tests via command line:

cd specs/
./run_all.sh

Documentation

The package relies on self-documenting functions via the argcheck package that reside in the doc folder. The GitHub Wiki is intended for more extensive in detail documentation.

To generate the documentation please run:

th doc.lua > /dev/null

Contributing

See CONTRIBUTING.md for further details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].