punit-naik / MLHadoop

Licence: Apache-2.0 license

This repository contains Machine-Learning MapReduce codes for Hadoop which are written from scratch (without using any package or library). E.g. Prediction (Linear and Logistic Regression), Clustering (K-Means), Classification (KNN) etc.

Programming Languages

java

68154 projects - #9 most used programming language

Projects that are alternatives of or similar to MLHadoop

rastercube

rastercube is a python library for big data analysis of georeferenced time series data (e.g. MODIS NDVI)

Stars: ✭ 15 (-70%)

Mutual labels: hadoop

wasp

WASP is a framework to build complex real time big data applications. It relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark. If you need to ingest huge amount of heterogeneous data and analyze them through complex pipelines, this is the framework for you.

Stars: ✭ 19 (-62%)

Mutual labels: hadoop

darwin

Avro Schema Evolution made easy

Stars: ✭ 26 (-48%)

Mutual labels: hadoop

sparkucx

A high-performance, scalable and efficient ShuffleManager plugin for Apache Spark, utilizing UCX communication layer

Stars: ✭ 32 (-36%)

Mutual labels: hadoop

hadoop-ecosystem

Visualizations of the Hadoop Ecosystem

Stars: ✭ 20 (-60%)

Mutual labels: hadoop

hadoop-crypto

Library for per-file client-side encyption in Hadoop FileSystems such as HDFS or S3.

Stars: ✭ 38 (-24%)

Mutual labels: hadoop

oci-cloudera

Terraform module to deploy Cloudera on Oracle Cloud Infrastructure (OCI)

Stars: ✭ 20 (-60%)

Mutual labels: hadoop

DaFlow

Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.

Stars: ✭ 24 (-52%)

Mutual labels: hadoop

presto

Teradata Distribution of Presto -- A Distributed SQL Query Engine for Big Data

Stars: ✭ 91 (+82%)

Mutual labels: hadoop

UBA

UEBA Solution for Insider Security. This repo is archived. Thanks!

Stars: ✭ 36 (-28%)

Mutual labels: hadoop

memex-gate

General Architecture for Text Engineering

Stars: ✭ 47 (-6%)

Mutual labels: hadoop

liquibase-impala

Liquibase extension to add Impala Database support

Stars: ✭ 23 (-54%)

Mutual labels: hadoop

implyr

SQL backend to dplyr for Impala

Stars: ✭ 74 (+48%)

Mutual labels: hadoop

hadoopoffice

HadoopOffice - Analyze Office documents using the Hadoop ecosystem (Spark/Flink/Hive)

Stars: ✭ 56 (+12%)

Mutual labels: hadoop

aaocp

一个对用户行为日志进行分析的大数据项目

Stars: ✭ 53 (+6%)

Mutual labels: hadoop

learning-spark

Tidy up Spark and Hadoop tutorials.

Stars: ✭ 28 (-44%)

Mutual labels: hadoop

datasqueeze

Hadoop utility to compact small files

Stars: ✭ 18 (-64%)

Mutual labels: hadoop

clickhouse hadoop

Import data from clickhouse to hadoop with pure SQL

Stars: ✭ 26 (-48%)

Mutual labels: hadoop

Movies-Analytics-in-Spark-and-Scala

Data cleaning, pre-processing, and Analytics on a million movies using Spark and Scala.

Stars: ✭ 47 (-6%)

Mutual labels: hadoop

hive-jdbc-driver

An alternative to the "hive standalone" jar for connecting Java applications to Apache Hive via JDBC

Stars: ✭ 31 (-38%)

Mutual labels: hadoop

View All Similar Projects ➔

MLHadoop

This repository contains Machine-Learning MapReduce codes for Hadoop which are written from scratch (without using any package or library). So you'll find codes written right from the basic Mathematics required for all of these Algorithms. e.g. Prediction Algorithms (Linear and Logistic Regression - Iterative Version), Clustering Algorithm (K-Means Clustering), Classification Algorithm (KNN Classifier), MBA, Common Friends etc.

NOTE: I think some of the algorithms implemented here can be improved in time as well as space by controlling the shuffle-sort phase between a MapReduce job i.e by writing and implementing your own custom Secondary Sort class as the shuffle-sort phase takes up a lot of time. If you have a sort order of key-value pairs in mind and if you are running multiple jobs or extra sorting methods inside mappers and reducers just to get the correct sort order, then, secondary sorting might come in handy as it will speed up the jobs and will use lesser RAM.

Language used: Java

IDE used: Eclipse IDE with HDT (Hadoop Development Tools) plugin installed.

Hadoop version used: 1.2.1

I wrote these codes when I was just a novice (in terms of MapReduce programming as well as programming in general) and therefore I am certain the code is very inefficient and there are a lot of optimisations yet to be done in this. So feel free to point out the mistakes or create PRs if you are interested.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

punit-naik / MLHadoop

Programming Languages

Labels

Projects that are alternatives of or similar to MLHadoop

MLHadoop