All Projects â†’ nikolaydubina â†’ go-featureprocessing

nikolaydubina / go-featureprocessing

Licence: MIT license
🔥 Fast, simple sklearn-like feature processing for Go

Programming Languages

go
31211 projects - #10 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to go-featureprocessing

kaggle-berlin
Material of the Kaggle Berlin meetup group!
Stars: ✭ 36 (-55.56%)
Mutual labels:  feature-engineering
PubMed-Best-Match
Machine-learning based pipeline relying on LambdaMART currently used in PubMed for relevance (Best Match) searches
Stars: ✭ 36 (-55.56%)
Mutual labels:  feature-engineering
Quora-Paraphrase-Question-Identification
Paraphrase question identification using Feature Fusion Network (FFN).
Stars: ✭ 19 (-76.54%)
Mutual labels:  feature-engineering
skrobot
skrobot is a Python module for designing, running and tracking Machine Learning experiments / tasks. It is built on top of scikit-learn framework.
Stars: ✭ 22 (-72.84%)
Mutual labels:  feature-engineering
exemplary-ml-pipeline
Exemplary, annotated machine learning pipeline for any tabular data problem.
Stars: ✭ 23 (-71.6%)
Mutual labels:  feature-engineering
zca
ZCA whitening in python
Stars: ✭ 29 (-64.2%)
Mutual labels:  feature-engineering
feng
feng - feature engineering for machine-learning champions
Stars: ✭ 27 (-66.67%)
Mutual labels:  feature-engineering
Feature-Engineering-for-Fraud-Detection
Implementation of feature engineering from Feature engineering strategies for credit card fraud
Stars: ✭ 31 (-61.73%)
Mutual labels:  feature-engineering
ReinforcementLearning Sutton-Barto Solutions
Solutions and figures for problems from Reinforcement Learning: An Introduction Sutton&Barto
Stars: ✭ 20 (-75.31%)
Mutual labels:  feature-engineering
dominance-analysis
This package can be used for dominance analysis or Shapley Value Regression for finding relative importance of predictors on given dataset. This library can be used for key driver analysis or marginal resource allocation models.
Stars: ✭ 111 (+37.04%)
Mutual labels:  feature-engineering
EvolutionaryForest
An open source python library for automated feature engineering based on Genetic Programming
Stars: ✭ 56 (-30.86%)
Mutual labels:  feature-engineering
msda
Library for multi-dimensional, multi-sensor, uni/multivariate time series data analysis, unsupervised feature selection, unsupervised deep anomaly detection, and prototype of explainable AI for anomaly detector
Stars: ✭ 80 (-1.23%)
Mutual labels:  feature-engineering
AutoTabular
Automatic machine learning for tabular data. ⚡🔥⚡
Stars: ✭ 51 (-37.04%)
Mutual labels:  feature-engineering
AutoTS
Automated Time Series Forecasting
Stars: ✭ 665 (+720.99%)
Mutual labels:  feature-engineering
50-days-of-Statistics-for-Data-Science
This repository consist of a 50-day program. All the statistics required for the complete understanding of data science will be uploaded in this repository.
Stars: ✭ 19 (-76.54%)
Mutual labels:  feature-engineering
clink
Clink is a library that provides APIs and infrastructure to facilitate the development of parallelizable feature engineering operators that can be used in both C++ and Java runtime.
Stars: ✭ 24 (-70.37%)
Mutual labels:  feature-engineering
anovos
Anovos - An Open Source Library for Scalable feature engineering Using Apache-Spark
Stars: ✭ 77 (-4.94%)
Mutual labels:  feature-engineering
Bike-Sharing-Demand-Kaggle
Top 5th percentile solution to the Kaggle knowledge problem - Bike Sharing Demand
Stars: ✭ 33 (-59.26%)
Mutual labels:  feature-engineering
hamilton
A scalable general purpose micro-framework for defining dataflows. You can use it to create dataframes, numpy matrices, python objects, ML models, etc.
Stars: ✭ 612 (+655.56%)
Mutual labels:  feature-engineering
NVTabular
NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Stars: ✭ 797 (+883.95%)
Mutual labels:  feature-engineering

go-featureprocessing

Tests Go Report Card codecov Go Reference Mentioned in Awesome Go

Fast, simple sklearn-like feature processing for Go

  • Does not cross cgo boundary
  • No memory allocation
  • No reflection
  • Convenient serialization
  • Generated code has 100% test coverage and benchmarks
  • Fitting
  • UTF-8
  • Parallel batch transform
  • Faster than sklearn in batch mode
//go:generate go run github.com/nikolaydubina/go-featureprocessing/cmd/generate -struct=Employee

type Employee struct {
	Age         int     `feature:"identity"`
	Salary      float64 `feature:"minmax"`
	Kids        int     `feature:"maxabs"`
	Weight      float64 `feature:"standard"`
	Height      float64 `feature:"quantile"`
	City        string  `feature:"onehot"`
	Car         string  `feature:"ordinal"`
	Income      float64 `feature:"kbins"`
	Description string  `feature:"tfidf"`
	SecretValue float64
}

Code above will generate a new struct as well benchmarks and tests using google/gofuzz.

employee := Employee{
   Age:         22,
   Salary:      1000.0,
   Kids:        2,
   Weight:      85.1,
   Height:      160.0,
   City:        "Pangyo",
   Car:         "Tesla",
   Income:      9000.1,
   SecretValue: 42,
   Description: "large text fields is not a problem neither, tf-idf can help here too! more advanced NLP will be added later!",
}

var fp EmployeeFeatureTransformer

config, _ := ioutil.ReadAll("employee_feature_processor.json")
json.Unmarshal(config, &fp)

features := fp.Transform(&employee)
// []float64{22, 1, 0.5, 1.0039999999999998, 1, 1, 0, 0, 0, 1, 5, 0.7674945674619879, 0.4532946552278861, 0.4532946552278861}

names := fp.FeatureNames()
// []string{"Age", "Salary", "Kids", "Weight", "Height", "City_Pangyo", "City_Seoul", "City_Daejeon", "City_Busan", "Car", "Income", "Description_text", "Description_problem", "Description_help"}

You can also fit transformer based on data

fp := EmployeeFeatureTransformer{}
fp.Fit([]Employee{...})

config, _ := json.Marshal(data)
_ = ioutil.WriteFile("employee_feature_processor.json", config, 0644)

This transformer can be serialized and de-serialized by standard Go routines. Serialized transformer is easy to read, update, and integrate with other tools.

{
   "Age_identity": {},
   "Salary_minmax": {"Min": 500, "Max": 900},
   "Kids_maxabs": {"Max": 4},
   "Weight_standard": {"Mean": 60, "STD": 25},
   "Height_quantile": {"Quantiles": [20, 100, 110, 120, 150]},
   "City_onehot": {"Mapping": {"Pangyo": 0, "Seoul": 1, "Daejeon": 2, "Busan": 3},
   "Car_ordinal": {"Mapping": {"BMW": 90000, "Tesla": 1}},
   "Income_kbins": {"Quantiles": [1000, 1100, 2000, 3000, 10000]},
   "Description_tfidf": {
      "Mapping": {"help": 2, "problem": 1, "text": 0},
      "Separator": " ",
      "DocCount": [1, 2, 2],
      "NumDocuments": 2,
      "Normalizer": {}
   }
}

Or you can manually initialize it.

fp := EmployeeFeatureTransformer{
   Salary: MinMaxScaler{Min: 500, Max: 900},
   Kids:   MaxAbsScaler{Max: 4},
   Weight: StandardScaler{Mean: 60, STD: 25},
   Height: QuantileScaler{Quantiles: []float64{20, 100, 110, 120, 150}},
   City:   OneHotEncoder{Mapping: map[string]uint{"Pangyo": 0, "Seoul": 1, "Daejeon": 2, "Busan": 3}},
   Car:    OrdinalEncoder{Mapping: map[string]uint{"Tesla": 1, "BMW": 90000}},
   Income: KBinsDiscretizer{QuantileScaler: QuantileScaler{Quantiles: []float64{1000, 1100, 2000, 3000, 10000}}},
   Description: TFIDFVectorizer{
      NumDocuments:    2,
      DocCount:        []uint{1, 2, 2},
      CountVectorizer: CountVectorizer{Mapping: map[string]uint{"text": 0, "problem": 1, "help": 2}, Separator: " "},
   },
}

Benchmarks

For typical use, with this struct encoder you can get ~100ns processing time for a single sample. How fast you need to get? Here are some numbers:

                       0 - C++ FlatBuffers decode
                     ...
                   200ps - 4.6GHz single cycle time
                1ns      - L1 cache latency
               10ns      - L2/L3 cache SRAM latency
               20ns      - DDR4 CAS, first byte from memory latency
               20ns      - C++ raw hardcoded structs access
               80ns      - C++ FlatBuffers decode/traverse/dealloc
 ---------->  100ns      - go-featureprocessing typical processing
              150ns      - PCIe bus latency
              171ns      - Go cgo call boundary, 2015
              200ns      - some High Frequency Trading FPGA claims
              800ns      - Go Protocol Buffers Marshal
              837ns      - Go json-iterator/go json decode
           1µs           - Go Protocol Buffers Unmarshal
           1µs           - High Frequency Trading FPGA
           3µs           - Go JSON Marshal
           7µs           - Go JSON Unmarshal
           9µs           - Go XML Marshal
          10µs           - PCIe/NVLink startup time
          17µs           - Python JSON encode or decode times
          30µs           - UNIX domain socket, eventfd, fifo pipes latency
          30µs           - Go XML Unmarshal
         100µs           - Redis intrinsic latency
         100µs           - AWS DynamoDB + DAX
         100µs           - KDB+ queries
         100µs           - High Frequency Trading direct market access range
         200µs           - 1GB/s network air latency
         200µs           - Go garbage collector latency 2018
         500µs           - NGINX/Kong added latency
     10ms                - AWS DynamoDB
     10ms                - WIFI6 "air" latency
     15ms                - AWS Sagemaker latency
     30ms                - 5G "air" latency
    100ms                - typical roundtrip from mobile to backend
    200ms                - AWS RDS MySQL/PostgreSQL or AWS Aurora
 10s                     - AWS Cloudfront 1MB transfer time

This is significantly faster than sklearn, or calling sklearn from Go, for few samples. And it performs similarly or faster than sklearn for large number of samples. bench_log bench_lin

For full benchmarks go to /docs/benchmarks, some extract for typical struct:

goos: darwin
goarch: amd64
pkg: github.com/nikolaydubina/go-featureprocessing/cmd/generate/tests
BenchmarkEmployeeFeatureTransformer_Transform-8                                  	62135674	        206 ns/op	       208 B/op	       1 allocs/op
BenchmarkEmployeeFeatureTransformer_Transform_Inplace-8                          	89993084	        123 ns/op	         0 B/op	       0 allocs/op
BenchmarkEmployeeFeatureTransformer_TransformAll_10elems-8                       	 5921253	       1881 ns/op	      2048 B/op	       1 allocs/op
BenchmarkEmployeeFeatureTransformer_TransformAll_100elems-8                      	  528890	      20532 ns/op	     21760 B/op	       1 allocs/op
BenchmarkEmployeeFeatureTransformer_TransformAll_1000elems-8                     	   53524	     238542 ns/op	    221185 B/op	       1 allocs/op
BenchmarkEmployeeFeatureTransformer_TransformAll_10000elems-8                    	    4879	    2267683 ns/op	   2007048 B/op	       1 allocs/op
BenchmarkEmployeeFeatureTransformer_TransformAll_100000elems-8                   	     475	   23257147 ns/op	  20004876 B/op	       1 allocs/op
BenchmarkEmployeeFeatureTransformer_TransformAll_1000000elems-8                  	      46	  284763749 ns/op	 192004098 B/op	       1 allocs/op
BenchmarkEmployeeFeatureTransformer_TransformAll_10elems_8workers-8              	 1552704	       7362 ns/op	      2064 B/op	       2 allocs/op
BenchmarkEmployeeFeatureTransformer_TransformAll_100elems_8workers-8             	  412455	      29814 ns/op	     21776 B/op	       2 allocs/op
BenchmarkEmployeeFeatureTransformer_TransformAll_1000elems_8workers-8            	   63822	     177183 ns/op	    213008 B/op	       2 allocs/op
BenchmarkEmployeeFeatureTransformer_TransformAll_10000elems_8workers-8           	    8704	    1505994 ns/op	   2162707 B/op	       2 allocs/op
BenchmarkEmployeeFeatureTransformer_TransformAll_100000elems_8workers-8          	     800	   15840396 ns/op	  21602323 B/op	       2 allocs/op
BenchmarkEmployeeFeatureTransformer_TransformAll_1000000elems_8workers-8         	      72	  139700740 ns/op	 192004112 B/op	       2 allocs/op
BenchmarkEmployeeFeatureTransformer_TransformAll_5000000elems_8workers-8         	       9	 1720488586 ns/op       1040007184 B/op	       2 allocs/op
BenchmarkEmployeeFeatureTransformer_TransformAll_15000000elems_8workers-8        	       1	14009776007 ns/op       3240001552 B/op	       2 allocs/op

[beta] Reflection based version

If you can't use go:gencode version, you can try relfection based version. Note, that reflection version intrudes overhead that is particularly noticeable if your struct has a lot of fields. You would get ~2x time increase for struct with large composite transformers. And you would get ~20x time increase for struct with 32 fields. Note, some features like serialization and de-serialization are not supported yet.

Benchmarks:

goos: darwin
goarch: amd64

// reflection
pkg: github.com/nikolaydubina/go-featureprocessing/structtransformer
BenchmarkStructTransformerTransform_32fields-4                           1732573              2079 ns/op             512 B/op          2 allocs/op

// non-reflection
pkg: github.com/nikolaydubina/go-featureprocessing/cmd/generate/tests
BenchmarkWith32FieldsFeatureTransformer_Transform-8                     31678317	       116 ns/op	     256 B/op	       1 allocs/op
BenchmarkWith32FieldsFeatureTransformer_Transform_Inplace-8           	80729049	        43 ns/op	       0 B/op	       0 allocs/op

Profiling

From profiling benchmarks for struct with 32 fields, we see that reflect version takes much longer and spends time on what looks like reflection related code. Meanwhile go:generate version is fast enough to compar to testing routines themselves and spends 50% of the time on allocating single output slice, which is good since means memory access is a bottleneck. Run make profile to make profiles. Flamegraphs were produced from pprof output by https://www.speedscope.app/.

gencode: gencode gencode_selected

reflect: reflect

Reference

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].