All Projects → keikoproj → active-monitor

keikoproj / active-monitor

Licence: Apache-2.0 license
Provides deep monitoring and self-healing of Kubernetes clusters

Programming Languages

go
31211 projects - #10 most used programming language
Makefile
30231 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to active-monitor

upgrade-manager
Reliable, extensible rolling-upgrades of Autoscaling groups in Kubernetes
Stars: ✭ 132 (-2.22%)
Mutual labels:  kubernetes-controller, kubernetes-tools
Instance Manager
Create and manage instance groups with Kubernetes
Stars: ✭ 95 (-29.63%)
Mutual labels:  kubernetes-cluster, kubernetes-controller
Kubernetes Reflector
Custom Kubernetes controller that can be used to replicate secrets, configmaps and certificates.
Stars: ✭ 129 (-4.44%)
Mutual labels:  kubernetes-cluster, kubernetes-controller
gethexporter
Monitor your Geth Ethereum Server with Prometheus and Grafana
Stars: ✭ 103 (-23.7%)
Mutual labels:  prometheus-metrics
go-health
❤️ Health check your applications and dependencies
Stars: ✭ 91 (-32.59%)
Mutual labels:  healthcheck
certificate-expiry-monitor-controller
Certificate Expiry Monitor Controller monitors the expiration of TLS certificates used in Ingress.
Stars: ✭ 114 (-15.56%)
Mutual labels:  kubernetes-controller
kubehelper
KubeHelper - simplifies many daily Kubernetes cluster tasks through a web interface. Search, analysis, run commands, cron jobs, reports, filters, git synchronization and many more.
Stars: ✭ 200 (+48.15%)
Mutual labels:  kubernetes-cluster
pagerduty-exporter
Prometheus exporter for PagerDuty informations
Stars: ✭ 38 (-71.85%)
Mutual labels:  prometheus-metrics
kubernetes-scheduling-examples
Walk-through guide of advanced scheduling concepts in Kubernetes
Stars: ✭ 38 (-71.85%)
Mutual labels:  kubernetes-cluster
mck8s
mck8s: Orchestration platform for multi-cluster k8s environments
Stars: ✭ 60 (-55.56%)
Mutual labels:  kubernetes-cluster
KubeScrape
KubeScrape: An open-source dev tool that provides an intuitive way to view the health, structure, and live metrics of your Kubernetes cluster
Stars: ✭ 133 (-1.48%)
Mutual labels:  kubernetes-cluster
ops channel
命令通道是联接人与机器,人与业务的一座桥.它跟常用的开源运维工具(`ansible`,`saltstack`,`puppet`)有相似之处,但也有着本质的差异。
Stars: ✭ 34 (-74.81%)
Mutual labels:  aiops
mysql-operator
Asynchronous MySQL Replication on Kubernetes using Percona Server and Openark's Orchestrator.
Stars: ✭ 810 (+500%)
Mutual labels:  kubernetes-controller
kubeadm-vagrant
Setup Kubernetes Cluster with Kubeadm and Vagrant
Stars: ✭ 49 (-63.7%)
Mutual labels:  kubernetes-cluster
aws-kubernetes
Kubernetes cluster setup in AWS using Terraform and kubeadm
Stars: ✭ 32 (-76.3%)
Mutual labels:  kubernetes-cluster
kubernetes-cluster
Vagrant As Automation Script
Stars: ✭ 34 (-74.81%)
Mutual labels:  kubernetes-cluster
grafana-weathermap-panel
plugin weathermap for Grafana. This project is still in development.
Stars: ✭ 27 (-80%)
Mutual labels:  prometheus-metrics
crash-diagnostics
Crash-Diagnostics (Crashd) is a tool to help investigate, analyze, and troubleshoot unresponsive or crashed Kubernetes clusters.
Stars: ✭ 157 (+16.3%)
Mutual labels:  kubernetes-cluster
aws-vpn-controller
The AWS VPN Controller allows you to create and delete AWS VPNs and connect them to your VPCs using Kubernetes Custom Resource Definitions.
Stars: ✭ 26 (-80.74%)
Mutual labels:  kubernetes-controller
kubernetes-mongodb-shard
Deploy a mongodb sharded cluster on kubernetes.
Stars: ✭ 38 (-71.85%)
Mutual labels:  kubernetes-cluster

Active-Monitor

Maintenance PR slack

version Build Status codecov Go Report Card

Motivation

Active-Monitor is a Kubernetes custom resource controller which enables deep cluster monitoring and self-healing using Argo workflows.

While it is not too difficult to know that all entities in a cluster are running individually, it is often quite challenging to know that they can all coordinate with each other as required for successful cluster operation (network connectivity, volume access, etc).

Overview

Active-Monitor will create a new health namespace when installed in the cluster. Users can then create and submit HealthCheck object to the Kubernetes server. A HealthCheck / Remedy is essentially an instrumented wrapper around an Argo workflow.

The HealthCheck workflow is run periodically, as defined by repeatAfterSec or a schedule: cron property in its spec, and watched by the Active-Monitor controller.

Active-Monitor sets the status of the HealthCheck CR to indicate whether the monitoring check succeeded or failed. If in case the monitoring check failed then the Remedy workflow will execute to fix the issue. Status of Remedy will be updated in the CR. External systems can query these CRs and take appropriate action if they failed.

RemedyRunsLimit parameter allows to configure how many times a remedy should be run. If Remedy action fails for any reason it will stop on further retries. It is an optional parameter. If it is not set Remedyworkflow is triggered whenever HealthCheck workflow fails.

RemedyResetInterval parameter allows resetting remedy after the reset interval time and RemedyWorkflow can be retried again in case monitor workflow fails. If remedy reaches a RemedyRunsLimit it will be reset when HealthCheck passes in any subsequent run before RemedyResetInterval.

Typical examples of such workflows include tests for basic Kubernetes object creation/deletion, tests for cluster-wide services such as policy engines checks, authentication and authorization checks, etc.

The sort of HealthChecks one could run with Active-Monitor are:

  • verify namespace and deployment creation
  • verify AWS resources are using < 80% of their instance limits
  • verify kube-dns by running DNS lookups on the network
  • verify kube-dns by running DNS lookups on localhost
  • verify KIAM agent by running aws sts get-caller-identity on all available nodes
  • verify if pod max threads has reached
  • verify if storage volume for a pod (e.g: prometheus) has reached its capacity.
  • verify if critical pods e.g: calico, kube-dns/core-dns pods are in a failed or crashloopbackoff state

With the Cluster/Namespace level, healthchecks can be run in any namespace provided namespace is already created. The level in the HealthCheck spec defines at which level it runs; it can be either Namespace or Cluster.

When level is set to Namespace, Active-Monitor will create a ServiceAccount in the namespace as defined in the workflow spec, it will also create the Role and RoleBinding with namespace level permissions so that the HealthChecks in a namespace can be performed.

When the level is set to be Cluster the Active-Monitor will create a ServiceAccount in the namespace as defined in the workflow spec, it will also create the ClusterRole and ClusterRoleBinding with cluster level permissions so that the HealthChecks in a cluster scope can be performed.

Dependencies

Installation Guide

# step 0: ensure that all dependencies listed above are installed or present

# step 1: install argo workflow controller
kubectl apply -f https://raw.githubusercontent.com/keikoproj/active-monitor/master/deploy/deploy-argo.yaml

# step 2: install active-monitor CRD and start controller
kubectl apply -f https://raw.githubusercontent.com/keikoproj/active-monitor/master/config/crd/bases/activemonitor.keikoproj.io_healthchecks.yaml
kubectl apply -f https://raw.githubusercontent.com/keikoproj/active-monitor/master/deploy/deploy-active-monitor.yaml

Alternate Install - using locally cloned code

# step 0: ensure that all dependencies listed above are installed or present

# step 1: install argo workflow-controller
kubectl apply -f deploy/deploy-argo.yaml

# step 2: install active-monitor controller
make install
kubectl apply -f deploy/deploy-active-monitor.yaml

# step 3: run the controller via Makefile target
make run

Usage and Examples

Create a new healthcheck:

Example 1:

Create a new healthcheck with cluster level bindings to specified serviceaccount and in health namespace:

kubectl create -f https://raw.githubusercontent.com/keikoproj/active-monitor/master/examples/inlineHello.yaml

OR with local source code:

kubectl create -f examples/inlineHello.yaml

Then, list all healthchecks:

kubectl get healthcheck -n health OR kubectl get hc -n health

NAME                 LATEST STATUS   SUCCESS CNT     FAIL CNT    AGE
inline-hello-7nmzk   Succeeded        7               0          7m53s

View additional details/status of a healthcheck:

kubectl describe healthcheck inline-hello-zz5vm -n health

...
Status:
  Failed Count:              0
  Finished At:               2019-08-09T22:50:57Z
  Last Successful Workflow:  inline-hello-4mwxf
  Status:                    Succeeded
  Success Count:             13
Events:                      <none>

Example 2:

Create a new healthcheck with namespace level bindings to specified serviceaccount and in a specified namespace:

kubectl create ns test

kubectl create -f https://raw.githubusercontent.com/keikoproj/active-monitor/master/examples/inlineHello_ns.yaml

OR with local source code:

kubectl create -f examples/inlineHello_ns.yaml

Then, list all healthchecks:

kubectl get healthcheck -n test OR kubectl get hc -n test

NAME                 LATEST STATUS   SUCCESS CNT     FAIL CNT    AGE
inline-hello-zz5vm  Succeeded         7               0          7m53s

View additional details/status of a healthcheck:

kubectl describe healthcheck inline-hello-zz5vm -n test

...
Status:
  Failed Count:              0
  Finished At:               2019-08-09T22:50:57Z
  Last Successful Workflow:  inline-hello-4mwxf
  Status:                    Succeeded
  Success Count:             13
Events:                      <none>

argo list -n test

NAME                 STATUS      AGE   DURATION   PRIORITY
inline-hello-88rh2   Succeeded   29s   7s         0
inline-hello-xpsf5   Succeeded   1m    8s         0
inline-hello-z8llk   Succeeded   2m    7s         0

Generates Resources

  • activemonitor.keikoproj.io/v1alpha1/HealthCheck
  • argoproj.io/v1alpha1/Workflow

Sample HealthCheck CR:

apiVersion: activemonitor.keikoproj.io/v1alpha1
kind: HealthCheck
metadata:
  generateName: dns-healthcheck-
  namespace: health
spec:
  repeatAfterSec: 60
  description: "Monitor pod dns connections"
  workflow:
    generateName: dns-workflow-
    resource:
      namespace: health
      serviceAccount: activemonitor-controller-sa
      source:
        inline: |
            apiVersion: argoproj.io/v1alpha1
            kind: Workflow
            spec:
              ttlSecondsAfterFinished: 60
              entrypoint: start
              templates:
              - name: start
                retryStrategy:
                  limit: 3
                container: 
                  image: tutum/dnsutils
                  command: [sh, -c]
                  args: ["nslookup www.google.com"]

Sample RemedyWorkflow CR:

apiVersion: activemonitor.keikoproj.io/v1alpha1
kind: HealthCheck
metadata:
  generateName: fail-healthcheck-
  namespace: health
spec:
  repeatAfterSec: 60 # duration in seconds
  level: cluster
  workflow:
    generateName: fail-workflow-
    resource:
      namespace: health # workflow will be submitted in this ns
      serviceAccount: activemonitor-healthcheck-sa # workflow will be submitted using this
      source:
        inline: |
            apiVersion: argoproj.io/v1alpha1
            kind: Workflow
            metadata:
              labels:
                workflows.argoproj.io/controller-instanceid: activemonitor-workflows
            spec:
              ttlSecondsAfterFinished: 60
              entrypoint: start
              templates:
              - name: start
                retryStrategy:
                  limit: 1
                container: 
                  image: ravihari/ctrmemory:v2
                  command: ["python"]
                  args: ["promanalysis.py", "http://prometheus.system.svc.cluster.local:9090", "health", "memory-demo", "memory-demo-ctr", "95"]
  remedyworkflow:
    generateName: remedy-test-
    resource:
      namespace: health # workflow will be submitted in this ns
      serviceAccount: activemonitor-remedy-sa # workflow will be submitted using this acct
      source:
        inline: |
          apiVersion: argoproj.io/v1alpha1
          kind: Workflow
          spec:
            ttlSecondsAfterFinished: 60
            entrypoint: kubectl
            templates:
              -
                container:
                  args: ["kubectl delete po/memory-demo"]
                  command: ["/bin/bash", "-c"]
                  image: "ravihari/kubectl:v1"
                name: kubectl

Active-Monitor Architecture

Access Workflows on Argo UI

kubectl -n health port-forward deployment/argo-ui 8001:8001

Then visit: http://127.0.0.1:8001

Prometheus Metrics

Active-Monitor controller also exports metrics in Prometheus format which can be further used for notifications and alerting.

Prometheus metrics are available on :8080/metrics

kubectl -n health port-forward deployment/activemonitor-controller 8080:8080

Then visit: http://localhost:8080/metrics

Active-Monitor, by default, exports following Promethus metrics:

  • healthcheck_success_count - The total number of successful healthcheck resources
  • healthcheck_error_count - The total number of erred healthcheck resources
  • healthcheck_runtime_seconds - Time taken for the healthcheck's workflow to complete

Active-Monitor also supports custom metrics. For this to work, your workflow should export a global parameter. The parameter will be programmatically available in the completed workflow object under: workflow.status.outputs.parameters.

The global output parameters should look like below:

"{\"metrics\":
  [
    {\"name\": \"custom_total\", \"value\": 123, \"metrictype\": \"gauge\", \"help\": \"custom total\"},
    {\"name\": \"custom_metric\", \"value\": 12.3, \"metrictype\": \"gauge\", \"help\": \"custom metric\"}
  ]
}"

Contributing

Please see CONTRIBUTING.md.

To add a new example of a healthcheck and/or workflow:

Release Process

Please see RELEASE.

License

The Apache 2 license is used in this project. Details can be found in the LICENSE file.

Other Keiko Projects

Instance Manager - Kube Forensics - Addon Manager - Upgrade Manager - Minion Manager - Governor

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].