All Projects → Azure → Pyspark Predictive Maintenance

Azure / Pyspark Predictive Maintenance

Licence: mit
Predictive Maintenance using Pyspark

Projects that are alternatives of or similar to Pyspark Predictive Maintenance

Nab
The Numenta Anomaly Benchmark
Stars: ✭ 1,352 (+1265.66%)
Mutual labels:  jupyter-notebook
Afnet
Code for paper in CVPR2019, 'Attentive Feedback Network for Boundary-aware Salient Object Detection'
Stars: ✭ 99 (+0%)
Mutual labels:  jupyter-notebook
100days Ml Code
100天机器学习 (翻译+ 实操)
Stars: ✭ 98 (-1.01%)
Mutual labels:  jupyter-notebook
Physlight
Stars: ✭ 99 (+0%)
Mutual labels:  jupyter-notebook
Pytorch Bert Document Classification
Enriching BERT with Knowledge Graph Embedding for Document Classification (PyTorch)
Stars: ✭ 99 (+0%)
Mutual labels:  jupyter-notebook
Facedetector
A re-implementation of mtcnn. Joint training, tutorial and deployment together.
Stars: ✭ 99 (+0%)
Mutual labels:  jupyter-notebook
Objectron
Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-clouds and planes. In each video, the camera moves around and above the object and captures it from different views. Each object is annotated with a 3D bounding box. The 3D bounding box describes the object’s position, orientation, and dimensions. The dataset contains about 15K annotated video clips and 4M annotated images in the following categories: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes
Stars: ✭ 1,352 (+1265.66%)
Mutual labels:  jupyter-notebook
Organic
Code repo for optimizing distributions of molecules.
Stars: ✭ 99 (+0%)
Mutual labels:  jupyter-notebook
Datascience book
Stars: ✭ 99 (+0%)
Mutual labels:  jupyter-notebook
Ucl Deep Learning Ans Reinforcement Learning
Deep learning and Reinforcement learning lecture and course work
Stars: ✭ 99 (+0%)
Mutual labels:  jupyter-notebook
Kmeans pytorch
kmeans using PyTorch
Stars: ✭ 98 (-1.01%)
Mutual labels:  jupyter-notebook
Hands On Exploratory Data Analysis With Python
Hands-on Exploratory Data Analysis with Python, published by Packt
Stars: ✭ 99 (+0%)
Mutual labels:  jupyter-notebook
Adversarially Learned Anomaly Detection
ALAD (Proceedings of IEEE ICDM 2018) official code
Stars: ✭ 99 (+0%)
Mutual labels:  jupyter-notebook
Linear algebra with python
Lecture Notes for Linear Algebra Featuring Python
Stars: ✭ 1,355 (+1268.69%)
Mutual labels:  jupyter-notebook
Math of machine learning
This is the code for "Mathematcs of Machine Learning" by Siraj Raval on Youtube
Stars: ✭ 99 (+0%)
Mutual labels:  jupyter-notebook
Almond
A Scala kernel for Jupyter
Stars: ✭ 1,354 (+1267.68%)
Mutual labels:  jupyter-notebook
Bds
Code and examples from Business Data Science
Stars: ✭ 99 (+0%)
Mutual labels:  jupyter-notebook
Mv3d tf
Tensorflow implementation of Multi-View 3D Object Detection Network (in progress)
Stars: ✭ 99 (+0%)
Mutual labels:  jupyter-notebook
Delf enhanced
Wrapper of DELF Tensorflow Model
Stars: ✭ 98 (-1.01%)
Mutual labels:  jupyter-notebook
Ml Sound Classifier
Machine Learning Sound Classifier
Stars: ✭ 98 (-1.01%)
Mutual labels:  jupyter-notebook

Predictive Maintenance using PySpark

Predictive maintenance is one of the most common machine learning use cases and with the latest advancements in information technology, the volume of stored data is growing faster in this domain than ever before which makes it necessary to leverage big data analytic capabilities to efficiently transform large amounts of data into business intelligence. Microsoft has published a series of learning materials including blogs, solution templates, modeling guides and sample tutorials in the domain of predictive maintenance. In this tutorial, we extended those materials by providing a detailed step-by-step process of using Spark Python API PySpark to demonstrate how to approach predictive maintenance for big data scenarios. The tutorial covers typical data science steps such as data ingestion, cleansing, feature engineering and model development.

Business Scenario and Data

The input data is simulated to reflect features that are generic for most of the predictive maintenance scenarios. To enable the tutorial to be completed very quickly, the data was simulated to be around 1.3 GB but the same PySpark framework can be easily applied to a much larger data set. The data is hosted on a publicly accessible Azure Blob Storage container and can be downloaded from here. In this tutorial, we import the data directly from the blob storage.

The data set has around 2 million records with 172 columns simulated for 1900 machines collected over 4 years. Each machine includes a device which stores data such as warnings, problems and errors generated by the machine over time. Each record has a Device ID and time stamp for each day and aggregated features for that day such as total number of a certain type of warning received in a day. Four categorical columns were also included to demonstrate generic handling of categorical variables. The goal is to predict if a machine will fail in the next 7 days. The last column of the data set indicates if a failure occurred and reported on that day.

Jupyter Notebooks

There are three Jupyter Notebooks on this GitHub repository.

File Name Description
Notebook_1_DataCleansing_FeatureEngineering Data cleansing, exploration and some parts of feature engineering.
Notebook_2_FeatureEngineering_RollingCompute How to deal with rolling feature computation for big data. This was one of the major road blocks.
Notebook_3_Labeling_FeatureSelection_Modeling Over-labeling technique, feature reduction, down-sampling, modeling, hyper-parameter tuning, cross-validation.

We formatted this tutorial as Jupyter notebooks because it is easy to show the step-by-step process this way. You can also easily compile the executable PySpark script(s) using your favorite IDE.

Hardware Specifications

The hardware used in this tutorial is a Linux Data Science Virtual Machine with 32 cores and 448 GB memory. For more detailed information of the Data Science Virtue Machine, please visit the link. For the size of the data used in this tutorial (1.3 GB), a machine with less cores and memory would also be adequate. However, in real life scenarios, one should choose the hardware configuration that is appropriate for the specific big data use case. Jupyter Notebooks included in this tutorial can also be downloaded and run on any machine that has PySpark enabled.

Spark Configuration

The Spark version installed on the Linux Data Science Virtual Machine for this tutorial is 2.0.2 with Python version 2.7.5.

Here are some configurations that needs to be performed before running this tutorial on a Linux machine.

  1. For standalone Spark, driver is the executor. The default memory for executor is 5g. This needs to be manually changed in "spark-defaults.conf" by using the following commands from the Linux Terminal:

    cd /dsvm/tools/spark/current/conf
    sudo cp spark-defaults.conf.template spark-defaults.conf
    sudo vi spark-defaults.conf
    

    Uncomment and change option 'spark.driver.memory' from 5g to Xg (whatever works for your machine).

  2. If using a machine with many cores (e.g.32), you will usually encounter the error "spark job failed with too many open files". This is because the default soft limit is 1024. It is recommended to increase ulimit to 64K. You can configure ulimit using the following commands:

    sudo vi /etc/security/limits.conf
    

    Add following lines into the configuration file, logout to make the change effective, then login again:

    *   soft    nofile 65536
    *   hard    nofile 65536
    
  3. Sometimes you might encounter the error of job failed with no space left on device. It is because by default Spark uses /tmp directory to store intermediate data. To solve this problem, add the following line to the spark-defaults.conf

spark.local.dir                     SOME/DIR/WHERE/YOU/HAVE/SPACE

Prerequisites

  1. The user should already know some basics of PySpark. This is not meant to be a PySpark 101 tutorial.
  2. Have PySpark (Spark 2.0., Python 2.7) already configured. Please note if you are using Python 3 on your machine, a few functions in this tutorial require some very minor tweaks because some Python 2 functions deprecated in Python 3.

References

  1. https://blogs.technet.microsoft.com/machinelearning/2016/04/21/predictive-maintenance-modelling-guide-in-the-cortana-intelligence-gallery/

  2. https://gallery.cortanaintelligence.com/Collection/Predictive-Maintenance-Modelling-Guide-1

  3. https://gallery.cortanaintelligence.com/Notebook/Predictive-Maintenance-Modelling-Guide-R-Notebook-1

  4. https://gallery.cortanaintelligence.com/Notebook/Predictive-Maintenance-Modelling-Guide-Python-Notebook-1

  5. https://gallery.cortanaintelligence.com/Solution/Predictive-Maintenance-10

  6. https://gallery.cortanaintelligence.com/Experiment/Predictive-Maintenance-Template-2

Acknowledgement

Special thanks to Said Bleik, Yiyu Chen and Ke Huang on great discussions about PySpark. Thank Fidan Boylu Uz for proof reading and modifying the tutorial materials.

Contributing and Adapting

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].