Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → spotify → Featran

spotify / Featran

Licence: apache-2.0

A Scala feature transformation library for data science and machine learning

Programming Languages

scala

5932 projects

Labels

tensorflow data spark ml xgboost flink

Projects that are alternatives of or similar to Featran

Data science blogs

A repository to keep track of all the code that I end up writing for my blog posts.

Stars: ✭ 139 (-66.9%)

Mutual labels: spark, data, xgboost

Benchm Ml

A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).

Stars: ✭ 1,835 (+336.9%)

Mutual labels: spark, xgboost

Datacompy

Pandas and Spark DataFrame comparison for humans

Stars: ✭ 147 (-65%)

Mutual labels: spark, data

Sparkstreaming

💥 🚀 封装sparkstreaming动态调节batch time(有数据就执行计算)；🚀 支持运行过程中增删topic；🚀 封装sparkstreaming 1.6 - kafka 010 用以支持 SSL。

Stars: ✭ 179 (-57.38%)

Mutual labels: spark, flink

Quicksql

A Flexible, Fast, Federated(3F) SQL Analysis Middleware for Multiple Data Sources

Stars: ✭ 1,821 (+333.57%)

Mutual labels: spark, flink

Ecommercerecommendsystem

商品大数据实时推荐系统。前端：Vue + TypeScript + ElementUI，后端 Spring + Spark

Stars: ✭ 139 (-66.9%)

Mutual labels: spark, flink

Transmogrifai

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning

Stars: ✭ 2,084 (+396.19%)

Mutual labels: spark, ml

Java learning practice

java 进阶之路：面试高频算法、akka、多线程、NIO、Netty、SpringBoot、Spark&&Flink 等

Stars: ✭ 110 (-73.81%)

Mutual labels: spark, flink

neptune-client

📒 Experiment tracking tool and model registry

Stars: ✭ 348 (-17.14%)

Mutual labels: ml, xgboost

fastdata-cluster

Fast Data Cluster (Apache Cassandra, Kafka, Spark, Flink, YARN and HDFS with Vagrant and VirtualBox)

Stars: ✭ 20 (-95.24%)

Mutual labels: spark, flink

Agile data code 2

Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition

Stars: ✭ 413 (-1.67%)

Mutual labels: spark, data

Feast

Feature Store for Machine Learning

Stars: ✭ 2,576 (+513.33%)

Mutual labels: spark, ml

Hadoopcryptoledger

Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive

Stars: ✭ 126 (-70%)

Mutual labels: spark, flink

Sk Dist

Distributed scikit-learn meta-estimators in PySpark

Stars: ✭ 260 (-38.1%)

Mutual labels: spark, ml

Waterdrop

Production Ready Data Integration Product, documentation：

Stars: ✭ 1,856 (+341.9%)

Mutual labels: spark, flink

Big Whale

Spark、Flink等离线任务的调度以及实时任务的监控

Stars: ✭ 163 (-61.19%)

Mutual labels: spark, flink

Flink Learning

flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例，还有 Flink 落地应用的大型项目案例（PVUV、日志存储、百亿数据实时去重、监控告警）分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》

Stars: ✭ 11,378 (+2609.05%)

Mutual labels: spark, flink

Pyspark Cheatsheet

🐍 Quick reference guide to common patterns & functions in PySpark.

Stars: ✭ 108 (-74.29%)

Mutual labels: spark, data

Mmlspark

Simple and Distributed Machine Learning

Stars: ✭ 2,899 (+590.24%)

Mutual labels: spark, ml

awesome-AI-kubernetes

❄️ 🐳 Awesome tools and libs for AI, Deep Learning, Machine Learning, Computer Vision, Data Science, Data Analytics and Cognitive Computing that are baked in the oven to be Native on Kubernetes and Docker with Python, R, Scala, Java, C#, Go, Julia, C++ etc

Stars: ✭ 95 (-77.38%)

Mutual labels: spark, ml

View All Similar Projects ➔

featran

Featran, also known as Featran77 or F77 (get it?), is a Scala library for feature transformation. It aims to simplify the time consuming task of feature engineering in data science and machine learning processes. It supports various collection types for feature extraction and output formats for feature representation.

Introduction

Most feature transformation logic requires two steps, one global aggregation to summarize data followed by one element-wise mapping to transform them. For example:

Min-Max Scaler
- Aggregation: global min & max
- Mapping: scale each value to [min, max]
One-Hot Encoder
- Aggregation: distinct labels
- Mapping: convert each label to a binary vector

We can implement this in a naive way using reduce and map.

case class Point(score: Double, label: String)
val data = Seq(Point(1.0, "a"), Point(2.0, "b"), Point(3.0, "c"))

val a = data
  .map(p => (p.score, p.score, Set(p.label)))
  .reduce((x, y) => (math.min(x._1, y._1), math.max(x._2, y._2), x._3 ++ y._3))

val features = data.map { p =>
  (p.score - a._1) / (a._2 - a._1) :: a._3.toList.sorted.map(s => if (s == p.label) 1.0 else 0.0)
}

But this is unmanageable for complex feature sets. The above logic can be easily expressed in Featran.

import com.spotify.featran._
import com.spotify.featran.transformers._

val fs = FeatureSpec.of[Point]
  .required(_.score)(MinMaxScaler("min-max"))
  .required(_.label)(OneHotEncoder("one-hot"))

val fe = fs.extract(data)
val names = fe.featureNames
val features = fe.featureValues[Seq[Double]]

Featran also supports these additional features.

Extract from Scala collections, Flink DataSets, Scalding TypedPipes, Scio SCollections and Spark RDDs
Output as Scala collections, Breeze dense and sparse vectors, TensorFlow Example Protobuf, XGBoost LabeledPoint and NumPy .npy file
Import aggregation from a previous extraction for training, validation and test sets
Compose feature specifications and separate outputs

See Examples (source) for detailed examples. See transformers package for a complete list of available feature transformers.

See ScalaDocs for current API documentation.

Presentations

Featran - Type safe and generic feature transformation in Scala - NABD Conf Palo Alto 2017 talk

Artifacts

Feature includes the following artifacts:

featran-core - core library, support for extraction from Scala collections and output as Scala collections, Breeze dense and sparse vectors
featran-java - Java interface, see JavaExample.java
featran-flink - support for extraction from Flink DataSet
featran-scalding - support for extraction from Scalding TypedPipe
featran-scio - support for extraction from Scio SCollection
featran-spark - support for extraction from Spark RDD
featran-tensorflow - support for output as TensorFlow Example Protobuf
featran-xgboost - support for output as XGBoost LabeledPoint
featran-numpy - support for output as NumPy .npy file

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 420

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (10) 🔗