All Projects → sramirez → Fast Mrmr

sramirez / Fast Mrmr

An improved implementation of the classical feature selection method: minimum Redundancy and Maximum Relevance (mRMR).

Labels

Projects that are alternatives of or similar to Fast Mrmr

Mrthumb
【拇指先生】 a simple easy video thumbnail provider,顺滑的获取视频缩略图,支持本地和网络视频,有问题大胆提Issues
Stars: ✭ 60 (-10.45%)
Mutual labels:  fast
Spark Doc Zh
Apache Spark 官方文档中文版
Stars: ✭ 1,126 (+1580.6%)
Mutual labels:  spark
Rsparkling
RSparkling: Use H2O Sparkling Water from R (Spark + R + Machine Learning)
Stars: ✭ 65 (-2.99%)
Mutual labels:  spark
Data Science Cookbook
🎓 Jupyter notebooks from UFC data science course
Stars: ✭ 60 (-10.45%)
Mutual labels:  spark
Csstree
A tool set for CSS including fast detailed parser, walker, generator and lexer based on W3C specs and browser implementations
Stars: ✭ 1,121 (+1573.13%)
Mutual labels:  fast
Pyspark Twitter Stream Mining
Real-time Machine Learning with Apache Spark on Twitter Public Stream
Stars: ✭ 64 (-4.48%)
Mutual labels:  spark
Rumble
⛈️ Rumble 1.11.0 "Banyan Tree"🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more
Stars: ✭ 58 (-13.43%)
Mutual labels:  spark
Thingsboard
Open-source IoT Platform - Device management, data collection, processing and visualization.
Stars: ✭ 10,526 (+15610.45%)
Mutual labels:  spark
Roffildlibrary
Library for MQL5 (MetaTrader) with Python, Java, Apache Spark, AWS
Stars: ✭ 63 (-5.97%)
Mutual labels:  spark
Spark Bigquery
Google BigQuery support for Spark, Structured Streaming, SQL, and DataFrames with easy Databricks integration.
Stars: ✭ 65 (-2.99%)
Mutual labels:  spark
Waimak
Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.
Stars: ✭ 60 (-10.45%)
Mutual labels:  spark
Silex
something to help you spark
Stars: ✭ 61 (-8.96%)
Mutual labels:  spark
W2v
Word2Vec models with Twitter data using Spark. Blog:
Stars: ✭ 64 (-4.48%)
Mutual labels:  spark
Zemberek Nlp Server
Zemberek Türkçe NLP Java Kütüphanesi üzerine REST Docker Sunucu
Stars: ✭ 60 (-10.45%)
Mutual labels:  spark
Torch Models
Stars: ✭ 65 (-2.99%)
Mutual labels:  fast
Pyspark Examples
Code examples on Apache Spark using python
Stars: ✭ 58 (-13.43%)
Mutual labels:  spark
Pysparkgeoanalysis
🌐 Interactive Workshop on GeoAnalysis using PySpark
Stars: ✭ 63 (-5.97%)
Mutual labels:  spark
Kontextfrei
Writing application logic for Spark jobs that can be unit-tested without a SparkContext
Stars: ✭ 67 (+0%)
Mutual labels:  spark
Zzzjson
The fastest JSON parser written in pure C
Stars: ✭ 66 (-1.49%)
Mutual labels:  fast
Reenv
dotenv-cli implementation in native ReasonML providing near-instant startup times
Stars: ✭ 65 (-2.99%)
Mutual labels:  fast

Welcome to the fast-mRMR wiki!

This is an improved implementation of the classical feature selection method: minimum Redundancy and Maximum Relevance (mRMR); presented by Peng in [1].

Main features

Several optimizations have been introduced in this improved version in order to speed up the costliest computation of the original algorithm: Mutual Information (MI) calculations. These optimizations are described in the followings:

  • Cache marginal probabilities: Instead of obtaining the marginal probabilities in each MI computation, those are calculated only once at the beginning of the program, and cached to be re-used in the next iterations.

  • Accumulating redundancy: The most important optimization is the greedy nature of the algorithm. Instead of computing the mutual information between every pair of features, now redundancy is accumulated in each iteration and the computations are performed between the last selected feature in S and each feature in non-selected set of attributes.

  • Data access pattern: The access pattern of mRMR to the dataset is thought to be feature-wise, in contrast to many other ML (machine learning) algorithms, in which access pattern is row-wise. Although being a low-level technical nuance, this aspect can significantly degrade mRMR performance since random access has a much greater cost than block-wise access. This is specially important in the case of GPU, since data has to be transferred from CPU memory to GPU global memory. Here, we reorganize the way in which data is stored in memory, changing it to a columnar format.

Implementations

Here, we include several implementations for different platforms, in order to ease the application of our proposal. These are:

  1. Sequential version: we provide a basic implementation in C++ for CPU processing. This is designed to be executed in a single machine. This version includes all aforementioned optimizations.
  2. CUDA implementation: we also provide a GPU implementation (using CUDA) with the aim of speeding up the previous version through GPU's thread-parallelism.
  3. Apache Spark: a Apache Spark's implementation of the algorithm is also included for large-scale problems, which require a distributed processing in order to obtain efficient solutions.

Please, for further information refer to our wiki: https://github.com/sramirez/fast-mRMR/wiki

Project structure

The code is organized as follows:

  • cpu: C++ code for CPU (+info).
  • gpu: CUDA code for GPU implementation (+info).
  • spark: Scala code for distributed processing in Apache Spark platform (+info).
  • utils: this folder contains a data reader program that transforms data in CSV format to the format required by fast-mRMR algorithm (in binary and columnar-wise format) (+info). It also includes a data generator method in case we want to generate synthetic data specifying the structure of this data.
  • examples: a folder with examples for all versions implemented.

License

Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

References

[1] "Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy," Hanchuan Peng, Fuhui Long, and Chris Ding IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 8, pp.1226-1238, 2005.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].