All Projects → AICoE → Prometheus Anomaly Detector

AICoE / Prometheus Anomaly Detector

Licence: gpl-3.0
A newer more updated version of the prometheus anomaly detector (https://github.com/AICoE/prometheus-anomaly-detector-legacy)

Projects that are alternatives of or similar to Prometheus Anomaly Detector

Byte Sized Code
A collection of Jupyter notebooks for learning Python from the ground up.
Stars: ✭ 142 (-47.99%)
Mutual labels:  hacktoberfest, jupyter-notebook
100daysofmlcode
My journey to learn and grow in the domain of Machine Learning and Artificial Intelligence by performing the #100DaysofMLCode Challenge.
Stars: ✭ 146 (-46.52%)
Mutual labels:  hacktoberfest, jupyter-notebook
Lacmus
Lacmus is a cross-platform application that helps to find people who are lost in the forest using computer vision and neural networks.
Stars: ✭ 142 (-47.99%)
Mutual labels:  hacktoberfest, jupyter-notebook
Rasa Ptbr Boilerplate
Um template para criar um FAQ chatbot usando Rasa, Rocket.chat, elastic search
Stars: ✭ 128 (-53.11%)
Mutual labels:  hacktoberfest, jupyter-notebook
Hacktoberfest2020 Contributions
A beginner-friendly project to help you in open-source contributions. Made specifically for contributions in HACKTOBERFEST 2020! Hello World Programs and Algorithms! Please leave a star ⭐ to support this project! ✨
Stars: ✭ 196 (-28.21%)
Mutual labels:  hacktoberfest, jupyter-notebook
Dea Notebooks
Repository for Digital Earth Australia Jupyter Notebooks: tools and workflows for geospatial analysis with Open Data Cube and xarray
Stars: ✭ 133 (-51.28%)
Mutual labels:  hacktoberfest, jupyter-notebook
Sqlcell
SQLCell is a magic function for the Jupyter Notebook that executes raw, parallel, parameterized SQL queries with the ability to accept Python values as parameters and assign output data to Python variables while concurrently running Python code. And *much* more.
Stars: ✭ 145 (-46.89%)
Mutual labels:  hacktoberfest, jupyter-notebook
Perfil Politico
A platform for profiling public figures in Brazilian politics
Stars: ✭ 117 (-57.14%)
Mutual labels:  hacktoberfest, jupyter-notebook
Virgilio
Virgilio is developed and maintained by these awesome people. You can email us virgilio.datascience (at) gmail.com or join the Discord chat.
Stars: ✭ 13,200 (+4735.16%)
Mutual labels:  hacktoberfest, jupyter-notebook
Nlp profiler
A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.
Stars: ✭ 181 (-33.7%)
Mutual labels:  hacktoberfest, jupyter-notebook
Mastering Python For Finance Second Edition
Sources codes for: Mastering Python for Finance, Second Edition
Stars: ✭ 127 (-53.48%)
Mutual labels:  hacktoberfest, jupyter-notebook
Naucse.python.cz
Website with learning materials / Stránka s učebními materiály
Stars: ✭ 248 (-9.16%)
Mutual labels:  hacktoberfest, jupyter-notebook
Pandaset Devkit
Stars: ✭ 121 (-55.68%)
Mutual labels:  hacktoberfest, jupyter-notebook
99 Ml Learning Projects
A list of 99 machine learning projects for anyone interested to learn from coding and building projects
Stars: ✭ 139 (-49.08%)
Mutual labels:  hacktoberfest, jupyter-notebook
Software Training
RoboJackets Software Training
Stars: ✭ 124 (-54.58%)
Mutual labels:  hacktoberfest, jupyter-notebook
Jupyter
Stars: ✭ 145 (-46.89%)
Mutual labels:  hacktoberfest, jupyter-notebook
Lab Workshops
Materials for workshops on text mining, machine learning, and data visualization
Stars: ✭ 112 (-58.97%)
Mutual labels:  hacktoberfest, jupyter-notebook
Scriptsdump
The biggest dump of scripts ever!
Stars: ✭ 114 (-58.24%)
Mutual labels:  hacktoberfest, jupyter-notebook
Tamburetei
Fazendo de tamburete as cadeiras de [email protected]
Stars: ✭ 177 (-35.16%)
Mutual labels:  hacktoberfest, jupyter-notebook
Hacktoberfest2020
A repo for new open source contributors to begin with open source contribution. Contribute and earn awesome swags.
Stars: ✭ 221 (-19.05%)
Mutual labels:  hacktoberfest, jupyter-notebook

Anomaly Detection in Prometheus Metrics

This repository contains the prototype for a Prometheus Anomaly Detector (PAD) which can be deployed on OpenShift. The PAD is a framework to deploy a metric prediction model to detect anomalies in prometheus metrics.

Prometheus is the chosen application to do monitoring across multiple products and platforms. Prometheus metrics are time series data identified by metric name and key/value pairs. With the increased amount of metrics flowing in it is getting harder to see the signals within the noise. The current state of the art is to graph out metrics on dashboards and alert on thresholds. This application leverages machine learning algorithms such as Fourier and Prophet models to perform time series forecasting and predict anomalous behavior in the metrics. The predicted values are compared with the actual values and if they differ from the default threshold values, it is flagged as an anomaly.

Use Case

The use case for this framework is to assist teams in real-time alerting of their system/application metrics. The time series forecasting performed by the models can be used by developers to update/enhance their systems to tackle the anomalies in the future.

Configurations

  • FLT_PROM_URL - URL for the prometheus host, from where the metric data will be collected
  • FLT_PROM_ACCESS_TOKEN - OAuth token to be passed as a header, to connect to the prometheus host (Optional)
  • FLT_METRICS_LIST - List of metrics that are to be collected from prometheus and train the prophet model.
    Example: "up{app='openshift-web-console', instance='172.44.0.18:8443'}; up{app='openshift-web-console', instance='172.44.4.18:8443'}; es_process_cpu_percent{instance='172.44.17.134:30290'}", multiple metrics can be separated using a semi-colon ;.
    If one metric and label configuration matches more than one timeseries, all the timeseries matching the configuration will be collected.
  • FLT_RETRAINING_INTERVAL_MINUTES - This specifies the frequency of the model training, or how often the model is retrained. (Default: 15)
    Example: If this parameter is set to 15, it will collect the past 15 minutes of metric data every 15 minutes and append it to the training dataframe.
  • FLT_ROLLING_TRAINING_WINDOW_SIZE - This parameter limits the size of the training dataframe to prevent Out of Memory errors. It can be set to the duration of data that should be stored in memory as dataframes. (Default 15d)
    Example: If set to 1d, every time before training the model using the training dataframe, the metric data that is older than 1 day will be deleted.

If you are testing locally, you can do the following:

  • Environment variables are loaded from .env. pipenv will load these automatically. So make sure you execute everything via pipenv install.

Configuration is currently done via environment variables. The configuration options are defined in prometheus-anomaly-detector/configuration.py.

Once the environment variables are set, you can run the application locally as:

python app.py

You can also use the Makefile to run the application:

make run_app

Using the pre-built Container Image

  • We have a pre-built container image available that you can use to deploy the Prometheus Anomaly Detector.
  • The image is hosted at: quay.io/aicoe/prometheus-anomaly-detector:latest
  • Example command: (make sure port 8080 is available)
    docker run --name pad -p 8080:8080 --network host \
        --env FLT_PROM_URL=http://demo.robustperception.io:9090 \
        --env FLT_RETRAINING_INTERVAL_MINUTES=15 \
        --env FLT_METRICS_LIST='up' \
        --env APP_FILE=app.py \
        --env FLT_DATA_START_TIME=3d \
        --env FLT_ROLLING_TRAINING_WINDOW_SIZE=15d \
        quay.io/aicoe/prometheus-anomaly-detector:latest
    
  • To remove the container, run docker rm pad

Implementation

The current setup is as follows: Thoth Dgraph anomaly detection - blog post (1)

  • Data - Prometheus metrics scraped from specified hosts/targets
  • Models being trained -
    • Fourier - It is used to map signals from the time domain to the frequency domain. It represents periodic time series data as a sum of sinusoidal components (sine and cosine)
    • Prophet[https://facebook.github.io/prophet/] - Procedure developed by Facebook for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. The following are the forecasted values:
      • yhat - Predicted time series value
      • yhat_lower - Lower bound of uncertainity interval
      • yhat_upper - Upper bound of uncertainity interval
  • Visualization - Grafana dashboards are created to visualize the predicted metrics
  • Alerts - Prometheus alerts are configured based on predicted metric values

Thoth Dgraph anomaly detection - blog post (2)

Model Testing

For a given timeframe of a metric, with known anomalies, the PAD can be run in test-mode to check whether the models reported back these anomalies. The accuracy and performance of the models can then be logged as metrics to MLFlow for comparing the results.

MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility and deployment. It currently offers three components: Screenshot from 2019-09-04 15-19-57 MLFlow: https://mlflow.org/

Test Configurations

  • FLT_PROM_URL - URL for the prometheus host, from where the metric data will be collected
  • FLT_PROM_ACCESS_TOKEN - OAuth token to be passed as a header, to connect to the prometheus host (Optional)
  • FLT_METRICS_LIST - List of metrics that are to be collected from prometheus and train the prophet model.
    Example: "up{app='openshift-web-console', instance='172.44.0.18:8443'}; up{app='openshift-web-console', instance='172.44.4.18:8443'}; es_process_cpu_percent{instance='172.44.17.134:30290'}", multiple metrics can be separated using a semi-colon ;.
    If one metric and label configuration matches more than one timeseries, all the timeseries matching the configuration will be collected.
  • FLT_RETRAINING_INTERVAL_MINUTES - This specifies the frequency of the model training, or how often the model is retrained. (Default: 15)
    Example: If this parameter is set to 15, it will collect the past 15 minutes of metric data every 15 minutes and append it to the training dataframe.
  • FLT_ROLLING_TRAINING_WINDOW_SIZE - This parameter limits the size of the training dataframe to prevent Out of Memory errors. It can be set to the duration of data that should be stored in memory as dataframes. (Default 15d)
    Example: If set to 1d, every time before training the model using the training dataframe, the metric data that is older than 1 day will be deleted.
  • MLFLOW_TRACKING_URI - URI for the MLFlow tracking server
  • FLT_TRUE_ANOMALY_THRESHOLD - Threshold value to calculate true anomalies using a linear function
  • FLT_DATA_START_TIME - This specifies the starting time of your metric data timeframe window
  • FLT_DATA_END_TIME - This specifies the ending time of your metric data timeframe window

Environment variables are loaded from .env. pipenv will load these automatically. So make sure you execute everything via pipenv install.

Configuration is currently done via environment variables. The configuration options are defined in prometheus-anomaly-detector/test_configuration.py.

Once the environment variables are set, you can run the application locally as:

python test_model.py

You can also use the Makefile to run the application:

make run_test

You can now view the metrics being logged in your MLFlow tracking server UI.

Screenshot from 2019-09-04 15-27-36

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].