Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → sramirez → Spark Infotheoretic Feature Selection

sramirez / Spark Infotheoretic Feature Selection

Licence: apache-2.0

This package contains a generic implementation of greedy Information Theoretic Feature Selection (FS) methods. The implementation is based on the common theoretic framework presented by Gavin Brown. Implementations of mRMR, InfoGain, JMI and other commonly used FS filters are provided.

Programming Languages

scala

5932 projects

Labels

spark filter

Projects that are alternatives of or similar to Spark Infotheoretic Feature Selection

Cube.js

📊 Cube — Open-Source Analytics API for Building Data Apps

Stars: ✭ 11,983 (+9642.28%)

Mutual labels: spark

Teddy

Spark Streaming监控平台，支持任务部署与告警、自启动

Stars: ✭ 120 (-2.44%)

Mutual labels: spark

Ng2 Smart Table

Angular Smart Data Table component

Stars: ✭ 1,590 (+1192.68%)

Mutual labels: filter

Hosts Blocklists

Automatically updated, moderated and optimized lists for blocking ads, trackers, malware and other garbage

Stars: ✭ 1,749 (+1321.95%)

Mutual labels: filter

Elassandra

Elassandra = Elasticsearch + Apache Cassandra

Stars: ✭ 1,610 (+1208.94%)

Mutual labels: spark

Example Spark Kafka

Apache Spark and Apache Kafka integration example

Stars: ✭ 120 (-2.44%)

Mutual labels: spark

Spark Lucenerdd

Spark RDD with Lucene's query and entity linkage capabilities

Stars: ✭ 114 (-7.32%)

Mutual labels: spark

Fungen

Replace boilerplate code with functional patterns using 'go generate'

Stars: ✭ 122 (-0.81%)

Mutual labels: filter

Kinesis Sql

Kinesis Connector for Structured Streaming

Stars: ✭ 120 (-2.44%)

Mutual labels: spark

Glslsmartdenoise

Fast glsl deNoise spatial filter, with circular gaussian kernel, full configurable

Stars: ✭ 121 (-1.63%)

Mutual labels: filter

Ibis

A pandas-like deferred expression system, with first-class SQL support

Stars: ✭ 1,630 (+1225.2%)

Mutual labels: spark

Wavelets.jl

A Julia package for fast discrete wavelet transforms and utilities

Stars: ✭ 118 (-4.07%)

Mutual labels: filter

Eat pyspark in 10 days

pyspark🍒🥭 is delicious，just eat it!😋😋

Stars: ✭ 116 (-5.69%)

Mutual labels: spark

Jsonapi.rb

Lightweight, simple and maintained JSON:API support for your next Ruby HTTP API.

Stars: ✭ 116 (-5.69%)

Mutual labels: filter

Deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

Stars: ✭ 2,020 (+1542.28%)

Mutual labels: spark

Active hash relation

ActiveHash Relation: Simple gem that allows you to run multiple ActiveRecord::Relation using hash. Perfect for APIs.

Stars: ✭ 115 (-6.5%)

Mutual labels: filter

Vue2 Bootstrap Table

A sortable and searchable table, as a Vue2 component, using bootstrap styling.

Stars: ✭ 120 (-2.44%)

Mutual labels: filter

Sieve

A simple, clean and elegant way to filter Eloquent models.

Stars: ✭ 123 (+0%)

Mutual labels: filter

Spark Alchemy

Collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive

Stars: ✭ 122 (-0.81%)

Mutual labels: spark

Zparkio

Boiler plate framework to use Spark and ZIO together.

Stars: ✭ 121 (-1.63%)

Mutual labels: spark

View All Similar Projects ➔

An Information Theoretic Feature Selection Framework

The present framework implements Feature Selection (FS) on Spark for its application on Big Data problems. This package contains a generic implementation of greedy Information Theoretic Feature Selection methods. The implementation is based on the common theoretic framework presented in [1]. Implementations of mRMR, InfoGain, JMI and other commonly used FS filters are provided. In addition, the framework can be extended with other criteria provided by the user as long as the process complies with the framework proposed in [1].

Spark package: http://spark-packages.org/package/sramirez/spark-infotheoretic-feature-selection

Please cite as: S. Ramírez-Gallego; H. Mouriño-Talín; D. Martínez-Rego; V. Bolón-Canedo; J. M. Benítez; A. Alonso-Betanzos; F. Herrera, "An Information Theory-Based Feature Selection Framework for Big Data Under Apache Spark," in IEEE Transactions on Systems, Man, and Cybernetics: Systems, in press, pp.1-13, doi: 10.1109/TSMC.2017.2670926 URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7970198&isnumber=6376248

Main features:

Version for new ml library.
Support for sparse data and high-dimensional datasets (millions of features).
Improved performance (less than 1 minute per iteration for datasets like ECBDL14 and kddb with 400 cores).

This work has associated two submitted contributions to international journals which will be attached to this request as soon as they are accepted. This software has been proved with two large real-world datasets such as:

A dataset selected for the GECCO-2014 in Vancouver, July 13th, 2014 competition, which comes from the Protein Structure Prediction field (http://cruncher.ncl.ac.uk/bdcomp/). We have created a oversampling version of this dataset with 64 million instances, 631 attributes, 2 classes.
kddb dataset: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20%28bridge%20to%20algebra%29. 20M instances and almost 30M of attributes.

Example (ml):

import org.apache.spark.ml.feature._
val selector = new InfoThSelector()
	.setSelectCriterion("mrmr")
      	.setNPartitions(100)
      	.setNumTopFeatures(10)
      	.setFeaturesCol("features")
      	.setLabelCol("class")
      	.setOutputCol("selectedFeatures")

val result = selector.fit(df).transform(df)

Example (MLLIB):

import org.apache.spark.mllib.feature._
val criterion = new InfoThCriterionFactory("mrmr")
val nToSelect = 100
val nPartitions = 100

println("*** FS criterion: " + criterion.getCriterion.toString)
println("*** Number of features to select: " + nToSelect)
println("*** Number of partitions: " + nPartitions)

val featureSelector = new InfoThSelector(criterion, nToSelect, nPartitions).fit(data)

val reduced = data.map(i => LabeledPoint(i.label, featureSelector.transform(i.features)))
reduced.first()

Design doc: https://docs.google.com/document/d/1HOaPL_HJzTbL2tVdzbTjhr5wxVvPe9e-23S7rc2VcsY/edit?usp=sharing

Prerequisites:

LabeledPoint data must be discretized as integer values in double representation, ranging from 0 to 255. By doing so, double values can be transformed to byte directly thus making the overall selection process much more efficient (communication overhead is deeply reduced).

Please refer to the MDLP package if you need to discretize your dataset:

https://spark-packages.org/package/sramirez/spark-MDLP-discretization

Contributors

Sergio Ramírez-Gallego ([email protected]) (main contributor and maintainer).
Héctor Mouriño-Talín ([email protected])
David Martínez-Rego ([email protected])

References

[1] Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). "Conditional likelihood maximisation: a unifying framework for information theoretic feature selection." The Journal of Machine Learning Research, 13(1), 27-66.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 123

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗