All Projects → punit-naik → MLHadoop

punit-naik / MLHadoop

Licence: Apache-2.0 license
This repository contains Machine-Learning MapReduce codes for Hadoop which are written from scratch (without using any package or library). E.g. Prediction (Linear and Logistic Regression), Clustering (K-Means), Classification (KNN) etc.

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to MLHadoop

rastercube
rastercube is a python library for big data analysis of georeferenced time series data (e.g. MODIS NDVI)
Stars: ✭ 15 (-70%)
Mutual labels:  hadoop
wasp
WASP is a framework to build complex real time big data applications. It relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark. If you need to ingest huge amount of heterogeneous data and analyze them through complex pipelines, this is the framework for you.
Stars: ✭ 19 (-62%)
Mutual labels:  hadoop
darwin
Avro Schema Evolution made easy
Stars: ✭ 26 (-48%)
Mutual labels:  hadoop
sparkucx
A high-performance, scalable and efficient ShuffleManager plugin for Apache Spark, utilizing UCX communication layer
Stars: ✭ 32 (-36%)
Mutual labels:  hadoop
hadoop-ecosystem
Visualizations of the Hadoop Ecosystem
Stars: ✭ 20 (-60%)
Mutual labels:  hadoop
hadoop-crypto
Library for per-file client-side encyption in Hadoop FileSystems such as HDFS or S3.
Stars: ✭ 38 (-24%)
Mutual labels:  hadoop
oci-cloudera
Terraform module to deploy Cloudera on Oracle Cloud Infrastructure (OCI)
Stars: ✭ 20 (-60%)
Mutual labels:  hadoop
DaFlow
Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
Stars: ✭ 24 (-52%)
Mutual labels:  hadoop
presto
Teradata Distribution of Presto -- A Distributed SQL Query Engine for Big Data
Stars: ✭ 91 (+82%)
Mutual labels:  hadoop
UBA
UEBA Solution for Insider Security. This repo is archived. Thanks!
Stars: ✭ 36 (-28%)
Mutual labels:  hadoop
memex-gate
General Architecture for Text Engineering
Stars: ✭ 47 (-6%)
Mutual labels:  hadoop
liquibase-impala
Liquibase extension to add Impala Database support
Stars: ✭ 23 (-54%)
Mutual labels:  hadoop
implyr
SQL backend to dplyr for Impala
Stars: ✭ 74 (+48%)
Mutual labels:  hadoop
hadoopoffice
HadoopOffice - Analyze Office documents using the Hadoop ecosystem (Spark/Flink/Hive)
Stars: ✭ 56 (+12%)
Mutual labels:  hadoop
aaocp
一个对用户行为日志进行分析的大数据项目
Stars: ✭ 53 (+6%)
Mutual labels:  hadoop
learning-spark
Tidy up Spark and Hadoop tutorials.
Stars: ✭ 28 (-44%)
Mutual labels:  hadoop
datasqueeze
Hadoop utility to compact small files
Stars: ✭ 18 (-64%)
Mutual labels:  hadoop
clickhouse hadoop
Import data from clickhouse to hadoop with pure SQL
Stars: ✭ 26 (-48%)
Mutual labels:  hadoop
Movies-Analytics-in-Spark-and-Scala
Data cleaning, pre-processing, and Analytics on a million movies using Spark and Scala.
Stars: ✭ 47 (-6%)
Mutual labels:  hadoop
hive-jdbc-driver
An alternative to the "hive standalone" jar for connecting Java applications to Apache Hive via JDBC
Stars: ✭ 31 (-38%)
Mutual labels:  hadoop

MLHadoop

This repository contains Machine-Learning MapReduce codes for Hadoop which are written from scratch (without using any package or library). So you'll find codes written right from the basic Mathematics required for all of these Algorithms. e.g. Prediction Algorithms (Linear and Logistic Regression - Iterative Version), Clustering Algorithm (K-Means Clustering), Classification Algorithm (KNN Classifier), MBA, Common Friends etc.

NOTE: I think some of the algorithms implemented here can be improved in time as well as space by controlling the shuffle-sort phase between a MapReduce job i.e by writing and implementing your own custom Secondary Sort class as the shuffle-sort phase takes up a lot of time. If you have a sort order of key-value pairs in mind and if you are running multiple jobs or extra sorting methods inside mappers and reducers just to get the correct sort order, then, secondary sorting might come in handy as it will speed up the jobs and will use lesser RAM.

Language used: Java

IDE used: Eclipse IDE with HDT (Hadoop Development Tools) plugin installed.

Hadoop version used: 1.2.1

I wrote these codes when I was just a novice (in terms of MapReduce programming as well as programming in general) and therefore I am certain the code is very inefficient and there are a lot of optimisations yet to be done in this. So feel free to point out the mistakes or create PRs if you are interested.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].