All Projects → feng-li → dlsa

feng-li / dlsa

Licence: GPL-3.0 License
Distributed least squares approximation (dlsa) implemented with Apache Spark

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects
r
7636 projects
Makefile
30231 projects

Projects that are alternatives of or similar to dlsa

data-algorithms-with-spark
O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian
Stars: ✭ 34 (+36%)
Mutual labels:  pyspark, spark-ml
machine-learning-course
Machine Learning Course @ Santa Clara University
Stars: ✭ 17 (-32%)
Mutual labels:  pyspark, spark-ml
ai-deployment
关注AI模型上线、模型部署
Stars: ✭ 149 (+496%)
Mutual labels:  pyspark, spark-ml
Spark Nlp
State of the Art Natural Language Processing
Stars: ✭ 2,518 (+9972%)
Mutual labels:  pyspark, spark-ml
Tdigest
t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark
Stars: ✭ 274 (+996%)
Mutual labels:  distributed-computing, pyspark
isarn-sketches-spark
Routines and data structures for using isarn-sketches idiomatically in Apache Spark
Stars: ✭ 28 (+12%)
Mutual labels:  pyspark, spark-ml
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (+500%)
Mutual labels:  distributed-computing, pyspark
pyspark-algorithms
PySpark Algorithms Book: https://www.amazon.com/dp/B07X4B2218/ref=sr_1_2
Stars: ✭ 72 (+188%)
Mutual labels:  distributed-computing, pyspark
Spark-for-data-engineers
Apache Spark for data engineers
Stars: ✭ 22 (-12%)
Mutual labels:  pyspark
Springboard-Data-Science-Immersive
No description or website provided.
Stars: ✭ 52 (+108%)
Mutual labels:  pyspark
hyperqueue
Scheduler for sub-node tasks for HPC systems with batch scheduling
Stars: ✭ 48 (+92%)
Mutual labels:  distributed-computing
protoactor-go
Proto Actor - Ultra fast distributed actors for Go, C# and Java/Kotlin
Stars: ✭ 4,138 (+16452%)
Mutual labels:  distributed-computing
distex
Distributed process pool for Python
Stars: ✭ 101 (+304%)
Mutual labels:  distributed-computing
check-engine
Data validation library for PySpark 3.0.0
Stars: ✭ 29 (+16%)
Mutual labels:  pyspark
DataEngineering
This repo contains commands that data engineers use in day to day work.
Stars: ✭ 47 (+88%)
Mutual labels:  pyspark
SynapseML
Simple and Distributed Machine Learning
Stars: ✭ 3,355 (+13320%)
Mutual labels:  pyspark
zmq
ZeroMQ based distributed patterns
Stars: ✭ 27 (+8%)
Mutual labels:  distributed-computing
kuwala
Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data sc…
Stars: ✭ 474 (+1796%)
Mutual labels:  pyspark
sparklanes
A lightweight data processing framework for Apache Spark
Stars: ✭ 17 (-32%)
Mutual labels:  pyspark
Prime95
Prime95 source code from GIMPS to find Mersenne Prime.
Stars: ✭ 25 (+0%)
Mutual labels:  distributed-computing

dlsa Distributed Least Squares Approximation

implemented with Apache Spark

Introduction

In this work, we develop a distributed least squares approximation (DLSA) method that is able to solve a large family of regression problems (e.g., linear regression, logistic regression, and Cox's model) on a distributed system. By approximating the local objective function using a local quadratic form, we are able to obtain a combined estimator by taking a weighted average of local estimators. The resulting estimator is proved to be statistically as efficient as the global estimator. Moreover, it requires only one round of communication. We further conduct shrinkage estimation based on the DLSA estimation using an adaptive Lasso approach. The solution can be easily obtained by using the LARS algorithm on the master node. It is theoretically shown that the resulting estimator possesses the oracle property and is selection consistent by using a newly designed distributed Bayesian information criterion (DBIC). The finite sample performance and the computational efficiency are further illustrated by an extensive numerical study and an airline dataset.

System Requirements

  • Spark >= 2.3.1

  • Python >= 3.7.0 Note that Spark < 3.0 is only compatible with Python <3.8

  • R >= 3.5 (optional)

    • lars

    See setup.py for detailed requirements.

Make a Python module

  • You firstly need to pack the core code into Python module
make zip

A dlsa.zip file will then be created within the folder projects/.

  • Then you should be able to upload it to the Spark cluster inside your Python script.
spark.sparkContext.addPyFile("dlsa.zip")

Run the PySpark code on the Spark platform

projects/bash/spark_dlsa_run.sh

or simply run

projects/logistic_dlsa.py

References

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].