All Projects → sramirez → Spark Infotheoretic Feature Selection

sramirez / Spark Infotheoretic Feature Selection

Licence: apache-2.0
This package contains a generic implementation of greedy Information Theoretic Feature Selection (FS) methods. The implementation is based on the common theoretic framework presented by Gavin Brown. Implementations of mRMR, InfoGain, JMI and other commonly used FS filters are provided.

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to Spark Infotheoretic Feature Selection

Cube.js
📊 Cube — Open-Source Analytics API for Building Data Apps
Stars: ✭ 11,983 (+9642.28%)
Mutual labels:  spark
Teddy
Spark Streaming监控平台,支持任务部署与告警、自启动
Stars: ✭ 120 (-2.44%)
Mutual labels:  spark
Ng2 Smart Table
Angular Smart Data Table component
Stars: ✭ 1,590 (+1192.68%)
Mutual labels:  filter
Hosts Blocklists
Automatically updated, moderated and optimized lists for blocking ads, trackers, malware and other garbage
Stars: ✭ 1,749 (+1321.95%)
Mutual labels:  filter
Elassandra
Elassandra = Elasticsearch + Apache Cassandra
Stars: ✭ 1,610 (+1208.94%)
Mutual labels:  spark
Example Spark Kafka
Apache Spark and Apache Kafka integration example
Stars: ✭ 120 (-2.44%)
Mutual labels:  spark
Spark Lucenerdd
Spark RDD with Lucene's query and entity linkage capabilities
Stars: ✭ 114 (-7.32%)
Mutual labels:  spark
Fungen
Replace boilerplate code with functional patterns using 'go generate'
Stars: ✭ 122 (-0.81%)
Mutual labels:  filter
Kinesis Sql
Kinesis Connector for Structured Streaming
Stars: ✭ 120 (-2.44%)
Mutual labels:  spark
Glslsmartdenoise
Fast glsl deNoise spatial filter, with circular gaussian kernel, full configurable
Stars: ✭ 121 (-1.63%)
Mutual labels:  filter
Ibis
A pandas-like deferred expression system, with first-class SQL support
Stars: ✭ 1,630 (+1225.2%)
Mutual labels:  spark
Wavelets.jl
A Julia package for fast discrete wavelet transforms and utilities
Stars: ✭ 118 (-4.07%)
Mutual labels:  filter
Eat pyspark in 10 days
pyspark🍒🥭 is delicious,just eat it!😋😋
Stars: ✭ 116 (-5.69%)
Mutual labels:  spark
Jsonapi.rb
Lightweight, simple and maintained JSON:API support for your next Ruby HTTP API.
Stars: ✭ 116 (-5.69%)
Mutual labels:  filter
Deequ
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Stars: ✭ 2,020 (+1542.28%)
Mutual labels:  spark
Active hash relation
ActiveHash Relation: Simple gem that allows you to run multiple ActiveRecord::Relation using hash. Perfect for APIs.
Stars: ✭ 115 (-6.5%)
Mutual labels:  filter
Vue2 Bootstrap Table
A sortable and searchable table, as a Vue2 component, using bootstrap styling.
Stars: ✭ 120 (-2.44%)
Mutual labels:  filter
Sieve
A simple, clean and elegant way to filter Eloquent models.
Stars: ✭ 123 (+0%)
Mutual labels:  filter
Spark Alchemy
Collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive
Stars: ✭ 122 (-0.81%)
Mutual labels:  spark
Zparkio
Boiler plate framework to use Spark and ZIO together.
Stars: ✭ 121 (-1.63%)
Mutual labels:  spark

An Information Theoretic Feature Selection Framework

The present framework implements Feature Selection (FS) on Spark for its application on Big Data problems. This package contains a generic implementation of greedy Information Theoretic Feature Selection methods. The implementation is based on the common theoretic framework presented in [1]. Implementations of mRMR, InfoGain, JMI and other commonly used FS filters are provided. In addition, the framework can be extended with other criteria provided by the user as long as the process complies with the framework proposed in [1].

Spark package: http://spark-packages.org/package/sramirez/spark-infotheoretic-feature-selection

Please cite as: S. Ramírez-Gallego; H. Mouriño-Talín; D. Martínez-Rego; V. Bolón-Canedo; J. M. Benítez; A. Alonso-Betanzos; F. Herrera, "An Information Theory-Based Feature Selection Framework for Big Data Under Apache Spark," in IEEE Transactions on Systems, Man, and Cybernetics: Systems, in press, pp.1-13, doi: 10.1109/TSMC.2017.2670926 URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7970198&isnumber=6376248

Main features:

  • Version for new ml library.
  • Support for sparse data and high-dimensional datasets (millions of features).
  • Improved performance (less than 1 minute per iteration for datasets like ECBDL14 and kddb with 400 cores).

This work has associated two submitted contributions to international journals which will be attached to this request as soon as they are accepted. This software has been proved with two large real-world datasets such as:

Example (ml):

import org.apache.spark.ml.feature._
val selector = new InfoThSelector()
	.setSelectCriterion("mrmr")
      	.setNPartitions(100)
      	.setNumTopFeatures(10)
      	.setFeaturesCol("features")
      	.setLabelCol("class")
      	.setOutputCol("selectedFeatures")

val result = selector.fit(df).transform(df)

Example (MLLIB):

import org.apache.spark.mllib.feature._
val criterion = new InfoThCriterionFactory("mrmr")
val nToSelect = 100
val nPartitions = 100

println("*** FS criterion: " + criterion.getCriterion.toString)
println("*** Number of features to select: " + nToSelect)
println("*** Number of partitions: " + nPartitions)

val featureSelector = new InfoThSelector(criterion, nToSelect, nPartitions).fit(data)

val reduced = data.map(i => LabeledPoint(i.label, featureSelector.transform(i.features)))
reduced.first()

Design doc: https://docs.google.com/document/d/1HOaPL_HJzTbL2tVdzbTjhr5wxVvPe9e-23S7rc2VcsY/edit?usp=sharing

Prerequisites:

LabeledPoint data must be discretized as integer values in double representation, ranging from 0 to 255. By doing so, double values can be transformed to byte directly thus making the overall selection process much more efficient (communication overhead is deeply reduced).

Please refer to the MDLP package if you need to discretize your dataset:

https://spark-packages.org/package/sramirez/spark-MDLP-discretization

Contributors

References

[1] Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). "Conditional likelihood maximisation: a unifying framework for information theoretic feature selection." The Journal of Machine Learning Research, 13(1), 27-66.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].