All Projects → Qihoo360 → Xlearning Xdml

Qihoo360 / Xlearning Xdml

extremely distributed machine learning

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to Xlearning Xdml

Ytk Learn
Ytk-learn is a distributed machine learning library which implements most of popular machine learning algorithms(GBDT, GBRT, Mixture Logistic Regression, Gradient Boosting Soft Tree, Factorization Machines, Field-aware Factorization Machines, Logistic Regression, Softmax).
Stars: ✭ 337 (+198.23%)
Mutual labels:  spark, hadoop, distributed
Bigdl
Building Large-Scale AI Applications for Distributed Big Data
Stars: ✭ 3,813 (+3274.34%)
Mutual labels:  ai, spark, hadoop
H2o 3
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Stars: ✭ 5,656 (+4905.31%)
Mutual labels:  spark, hadoop, distributed
Docker Hadoop
A Docker container with a full Hadoop cluster setup with Spark and Zeppelin
Stars: ✭ 54 (-52.21%)
Mutual labels:  spark, hadoop
Data Algorithms Book
MapReduce, Spark, Java, and Scala for Data Algorithms Book
Stars: ✭ 949 (+739.82%)
Mutual labels:  spark, hadoop
Learning Spark
零基础学习spark,大数据学习
Stars: ✭ 37 (-67.26%)
Mutual labels:  spark, hadoop
Kylo
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
Stars: ✭ 916 (+710.62%)
Mutual labels:  spark, hadoop
Apache Spark Hands On
Educational notes,Hands on problems w/ solutions for hadoop ecosystem
Stars: ✭ 74 (-34.51%)
Mutual labels:  spark, hadoop
Docker Spark Cluster
A Spark cluster setup running on Docker containers
Stars: ✭ 57 (-49.56%)
Mutual labels:  spark, hadoop
Dataspherestudio
DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.
Stars: ✭ 1,195 (+957.52%)
Mutual labels:  spark, hadoop
Repository
个人学习知识库涉及到数据仓库建模、实时计算、大数据、Java、算法等。
Stars: ✭ 92 (-18.58%)
Mutual labels:  spark, hadoop
Interview Questions Collection
按知识领域整理面试题,包括C++、Java、Hadoop、机器学习等
Stars: ✭ 21 (-81.42%)
Mutual labels:  spark, hadoop
Bigdata Interview
🎯 🌟[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结
Stars: ✭ 857 (+658.41%)
Mutual labels:  spark, hadoop
Weblogsanalysissystem
A big data platform for analyzing web access logs
Stars: ✭ 37 (-67.26%)
Mutual labels:  spark, hadoop
Dockerfiles
50+ DockerHub public images for Docker & Kubernetes - Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak, TeamCity and DevOps tools built on the major Linux distros: Alpine, CentOS, Debian, Fedora, Ubuntu
Stars: ✭ 847 (+649.56%)
Mutual labels:  spark, hadoop
Waimak
Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.
Stars: ✭ 60 (-46.9%)
Mutual labels:  spark, hadoop
Hadoop cookbook
Cookbook to install Hadoop 2.0+ using Chef
Stars: ✭ 82 (-27.43%)
Mutual labels:  spark, hadoop
Bigdata Notes
大数据入门指南 ⭐
Stars: ✭ 10,991 (+9626.55%)
Mutual labels:  spark, hadoop
Waterdrop
Production Ready Data Integration Product, documentation:
Stars: ✭ 1,856 (+1542.48%)
Mutual labels:  spark, hadoop
Szt Bigdata
深圳地铁大数据客流分析系统🚇🚄🌟
Stars: ✭ 826 (+630.97%)
Mutual labels:  spark, hadoop

license Release Version PRs Welcome

XDML是一款基于参数服务器(Parameter Server),采用专门缓存机制的分布式机器学习平台。 XDML内化了学界最新研究成果,在效果保持稳定的同时,能大幅加速收敛进程,显著提升模型与算法的性能。同时,XDML还对接了一些优秀的开源成果和360公司自研成果,站在巨人的肩膀上,博采众长。 XDML还兼容hadoop生态,提供更好的大数据框架使用体验,将开发者从繁杂的工作中解脱出来。XDML已经在360内部海量规模数据上进行了大量测试和调优,在大规模数据量和超高维特征的机器学习任务上,具有良好的稳定性,扩展性和兼容性。

欢迎对机器学习或分布式有兴趣的同仁一起贡献代码,提交Issues或者Pull Requests。

架构设计

architecture

针对超大规模机器学习的场景,奇虎360开源了内部的超大规模机器学习计算框架XDML。XDML是一款基于参数服务器(Parameter Server),采用专门缓存机制的分布式机器学习平台。它在360内部海量规模数据上进行了测试和调优,在大规模数据量和超高维特征的机器学习任务上,具有良好的稳定性,扩展性和兼容性。

功能特性

1.提供特征预处理/分析,离线训练,模型管理等功能模块

2.实现常用的大规模数据量场景下的机器学习算法

3.充分利用现有的成熟技术,保证整个框架的高效稳定

4.完全兼容hadoop生态,和现有的大数据工具实现无缝对接,提升处理海量数据的能力

5.在系统架构和算法层面实现深度的工程优化,在不损失精度的前提下,大幅提高性能

代码结构

1.ps

XDML的核心参数服务器架构,包括以下组件:

2.conf

XDML的配置包,包括对参数服务器的配置和对作业及模型相关的配置。包括以下组件:

3.task

XDML向PS提交的作业,包括拉取和推送。包括以下任务:

  • Task
  • PullTask
  • PushTask

4.optimization

XDML模型的优化算法包。包括以下优化算法:

5.ml

XDML中已经实现的部分机器学习模型。包括以下模型:

6.feature

XDML中特征分析和特征处理模块。

  • 特征分析

    特征分析覆盖常见的分析指标,如数值型特征的偏度、峰度、分位数,与label相关的auc、ndcg、互信息、相关系数等指标。

  • 特征处理

    特征处理覆盖常见的数值型、类别型特征预处理方法。包括以下算子:

    • CategoryEncoder
    • MultiCategoryEncoder
    • NumericBuckter
    • NumericStandardizer

7.model

XDML中包含用南京大学李武军老师提出的Scope优化算法进行训练的线性模型,以及部分H2O模型的spark pipeline封装。具体包括以下模型:

Model:

  • LinearScope
  • MultiLinearScope
  • OVRLinearScope
  • H2ODRF
  • H2OGBM
  • H2OGLM
  • H2OMLP

8.example

XDML中作业提交实例,可以参考Example.

编译&部署指南

XDML是基于Kudu、HazelCast以及Hadoop生态圈的一款基于参数服务器的,采用专门缓存机制的分布式机器学习平台。

环境依赖

  • centos >= 7
  • Jdk >= 1.8
  • Maven >= 3.5.4
  • scala >= 2.11
  • hadoop >= 2.7.3
  • spark >= 2.3.0
  • sparkling-water-core >= 2.3.0
  • kudu >= 1.9
  • HazelCast >= 3.9.3

Kudu安装部署

XDML基于Kudu,请首先部署Kudu。Kudu的安装部署请参考Kudu

源码下载

git clone https://github.com/Qihoo360/XLearning-XDML

编译

mvn clean package -Dmaven.test.skip=true 编译完成后,在源码根目录的target目录下会生成:xdml-1.0.jarxdml-1.0-jar-with-dependencies.jar等多个文件,xdml-1.0.jar为未加spark、kudu等第三方依赖,xdml-1.0-jar-with-dependencies.jar添加了spark、kudu等依赖包。

运行示例

提交参数

  • 算法参数
    • spark.xdml.learningRate:学习率
  • 训练参数
    • spark.xdml.job.type:作业类型
    • spark.xdml.train.data.path:训练数据路径
    • spark.xdml.train.data.partitionNum:训练数据分区
    • spark.xdml.model.path:模型存储路径
    • spark.xdml.train.iter:训练迭代次数
    • spark.xdml.train.batchsize:训练数据batch大小
  • PS相关参数
    • spark.xdml.hz.clusterNum:hazelcast集群机器数目
    • spark.xdml.table.name:kudu表名称

提交命令

可以通过以下命令提交示例训练作业:

  $SPARK_HOME/bin/spark-submit \   
    --master yarn-cluster \    
    --class net.qihoo.xitong.xdml.example.LRTest \   
    --num-executors 50 \   
    --executor-memory 40g \   
    --executor-cores 2 \   
    --driver-memory 4g \   
    --conf "spark.xdml.table.name=lrtest" \   
    --conf "spark.xdml.job.type=train" \   
    --conf "spark.xdml.train.data.path=$trainpath" \   
    --conf "spark.xdml.train.data.partitionNum=50" \   
    --conf "spark.xdml.hz.clusterNum=50" \   
    --conf "spark.xdml.model.path=$modelpath" \   
    --conf "spark.xdml.train.iter=5" \   
    --conf "spark.xdml.train.batchsize=10000" \   
    --conf "spark.xdml.learningRate=0.1" \   
    --jars xdml-1.0-jar-with-dependencies.jar \   
    xdml-1.0-jar-with-dependencies.jar   

注:提交命令中的设置有$SPARK_HOME$trainpath$modelpath 分别代表spark客户端路径、训练数据HDFS路径、模型存储HDFS路径

FAQ

XDML常见问题

参考文献

XDML参考了学界及工业界诸多优秀成果,对此表示感谢!

联系我们

Mail: [email protected]
QQ群:874050710
qq

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].