All Projects → crackcell → Mlfeature

crackcell / Mlfeature

Licence: apache-2.0
Feature engineering toolkit for Spark MLlib.

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to Mlfeature

Goodreads etl pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Stars: ✭ 793 (+6508.33%)
Mutual labels:  spark
Kylo
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
Stars: ✭ 916 (+7533.33%)
Mutual labels:  spark
Tiledb Vcf
Efficient variant-call data storage and retrieval library using the TileDB storage library.
Stars: ✭ 26 (+116.67%)
Mutual labels:  spark
Szt Bigdata
深圳地铁大数据客流分析系统🚇🚄🌟
Stars: ✭ 826 (+6783.33%)
Mutual labels:  spark
Yandex Big Data Engineering
Stars: ✭ 17 (+41.67%)
Mutual labels:  spark
Spark Tdd Example
A simple Spark TDD example
Stars: ✭ 23 (+91.67%)
Mutual labels:  spark
Sparklyr
R interface for Apache Spark
Stars: ✭ 775 (+6358.33%)
Mutual labels:  spark
Sparkjni
A heterogeneous Apache Spark framework.
Stars: ✭ 11 (-8.33%)
Mutual labels:  spark
Spark Scala Tutorial
A free tutorial for Apache Spark.
Stars: ✭ 907 (+7458.33%)
Mutual labels:  spark
Spark Swagger
Spark (http://sparkjava.com/) support for Swagger (https://swagger.io/)
Stars: ✭ 25 (+108.33%)
Mutual labels:  spark
Sparkling Water
Sparkling Water provides H2O functionality inside Spark cluster
Stars: ✭ 887 (+7291.67%)
Mutual labels:  spark
Parquet Generator
Parquet file generator
Stars: ✭ 16 (+33.33%)
Mutual labels:  spark
Chronicler
Scala toolchain for InfluxDB
Stars: ✭ 24 (+100%)
Mutual labels:  spark
Bigdataguide
大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料
Stars: ✭ 817 (+6708.33%)
Mutual labels:  spark
Dockerfiles
50+ DockerHub public images for Docker & Kubernetes - Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak, TeamCity and DevOps tools built on the major Linux distros: Alpine, CentOS, Debian, Fedora, Ubuntu
Stars: ✭ 847 (+6958.33%)
Mutual labels:  spark
Spark Redis
A connector for Spark that allows reading and writing to/from Redis cluster
Stars: ✭ 773 (+6341.67%)
Mutual labels:  spark
Digitrecognizer
Java Convolutional Neural Network example for Hand Writing Digit Recognition
Stars: ✭ 23 (+91.67%)
Mutual labels:  spark
Mare
MaRe leverages the power of Docker and Spark to run and scale your serial tools in MapReduce fashion.
Stars: ✭ 11 (-8.33%)
Mutual labels:  spark
Bigdata Interview
🎯 🌟[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结
Stars: ✭ 857 (+7041.67%)
Mutual labels:  spark
Mobius
C# and F# language binding and extensions to Apache Spark
Stars: ✭ 929 (+7641.67%)
Mutual labels:  spark

MLfeature

Feature engineering toolkit for Spark MLlib:

  • Data preprocessing:
    • Handle imbalanced dataset: DataBalancer
    • Handle missing values: (Implemented in Spark 2.2, SPARK-13568)
      • Impute continuous missing values with mean: MissingValueMeanImputor
  • Feature selection:
    • VarianceSelector: remove fetures with low variance
    • UnivariateSelector: feature selection with univariate metrics
    • ByModelSelector: feature selection with a model
  • Feature transformers:
    • Enhanced Bucketizer: MyBucketizer (Waiting to be merged, SPARK-19781)
    • Enhanced StringIndexer: MyStringIndexer (Merged with Spark 2.2, SPARK-17233)

Handle imbalcned dataset

  • DataBalancer: Make an balanced dataset with multiple strategies:
    • Re-sampling:
      • over-sampling
      • under-sampling
      • middle-sampling
    • SMOTE: TODO

Example:

val data = Array("a", "a", "b", "c")
val dataFrame = data.toSeq.toDF("feature")

val balancer = new DataBalancer()
  .setStrategy("oversampling")
  .setInputCol("feature")

val result = balacner.transform(dataFrame)
result.show(100)
val data: Seq[String] = Seq("a", "a", "a", "a", "b", "b","b", "c")
val dataFrame = data.toDF("feature")

val balancer = new DataBalancer()
  .setStrategy("undersampling")
  .setInputCol("feature")

val result = balancer.transform(dataFrame)
result.show(100)
val data: Seq[String] = Seq("a", "a", "a", "a", "b", "b","b", "c")
val dataFrame = data.toDF("feature")

val balancer = new DataBalancer()
  .setStrategy("middlesampling")
  .setInputCol("feature")

val result = balancer.transform(dataFrame)
result.show(100)

Handle missing values

  • MissingValueMeanImputer: Impute continuous missing values with mean

Feature Selection

VarianceSelector

VarianceSelector is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.

val data = Array(
  Vectors.dense(0, 1.0, 0),
  Vectors.dense(0, 3.0, 0),
  Vectors.dense(0, 4.0, 0),
  Vectors.dense(0, 5.0, 0),
  Vectors.dense(1, 6.0, 0)
)

val expected = Array(
  Vectors.dense(1.0),
  Vectors.dense(3.0),
  Vectors.dense(4.0),
  Vectors.dense(5.0),
  Vectors.dense(6.0)
)

val df = data.zip(expected).toSeq.toDF("features", "expected")

val selector = new VarianceSelector()
  .setInputCol("features")
  .setOutputCol("selected")
  .setThreshold(3)

val result = selector.transform(df)

result.select("expected", "selected").collect()
  .foreach { case Row(vector1: Vector, vector2: Vector) =>
    assert(vector1.equals(vector2), "Transformed vector is different with expected.")
  }

UnivariateSelector

TODO

ByModelSelector

TODO

Feature transform

MyBucketizer: Enhanced Bucketizer

Put NULLs and values out of bounds into a special bucket as well as NaN.

Example:

val splits = Array(-0.5, 0.0, 0.5)
val validData = Array(-0.5, -0.3, 0.0, 0.2)
val expectedBuckets = Array(0.0, 0.0, 1.0, 1.0)
val dataFrame: DataFrame = validData.zip(expectedBuckets).toSeq.toDF("feature", "expected")

val bucketizer: MyBucketizer = new MyBucketizer()
  .setInputCol("feature")
  .setOutputCol("result")
  .setSplits(splits)

val transformed = bucketizer.transform(dataFrame)

MyStringIndxer: Enhanced StringIndexer

Give NULLs and unseen lables a special index.

Example:

val data = Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c"))
val df = data.toDF("id", "label")
val indexer = new MyStringIndexer()
  .setInputCol("label")
  .setOutputCol("labelIndex")
  .fit(df)

val transformed = indexer.transform(df)
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].