All Projects → cloud-bulldozer → kraken

cloud-bulldozer / kraken

Licence: Apache-2.0 license
Chaos and resiliency testing tool for Kubernetes and OpenShift

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to kraken

xk6-chaos
xk6 extension for running chaos experiments with k6 💣
Stars: ✭ 18 (-88.82%)
Mutual labels:  reliability, chaos-engineering
Simmy
Simmy is a chaos-engineering and fault-injection tool, integrating with the Polly resilience project for .NET
Stars: ✭ 313 (+94.41%)
Mutual labels:  resiliency, chaos-engineering
Awesome Sre
A curated list of Site Reliability and Production Engineering resources.
Stars: ✭ 7,687 (+4674.53%)
Mutual labels:  scalability, reliability
Chaostoolkit
Chaos Engineering Experiments Automation & Orchestration
Stars: ✭ 1,204 (+647.83%)
Mutual labels:  resiliency, chaos-engineering
Howtheysre
A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)
Stars: ✭ 6,962 (+4224.22%)
Mutual labels:  reliability, chaos-engineering
Cloud Design Patterns
Prescriptive Architecture Guidance for Cloud Applications
Stars: ✭ 484 (+200.62%)
Mutual labels:  scalability, resiliency
Kubeinvaders
Gamified Chaos Engineering Tool for Kubernetes
Stars: ✭ 673 (+318.01%)
Mutual labels:  openshift, chaos-engineering
oshinko-s2i
This is a place to put s2i images and utilities for spark application builders for openshift
Stars: ✭ 16 (-90.06%)
Mutual labels:  openshift
Performance-Engineers-DevOps
This repository helps performance testers and engineers who wants to dive into DevOps and SRE world.
Stars: ✭ 35 (-78.26%)
Mutual labels:  chaos-engineering
onix
A reactive configuration manager designed to support Infrastructure as a Code provisioning, and bi-directional configuration management providing a single source of truth across multi-cloud environments.
Stars: ✭ 89 (-44.72%)
Mutual labels:  openshift
deploy
Deploy Development Builds of Open Cluster Management (OCM) on RedHat Openshift Container Platform
Stars: ✭ 133 (-17.39%)
Mutual labels:  openshift
OpenCossan
OpenCossan is an open and free toolbox for uncertainty quantification and management.
Stars: ✭ 40 (-75.16%)
Mutual labels:  reliability
REGAL
Representation learning-based graph alignment based on implicit matrix factorization and structural embeddings
Stars: ✭ 78 (-51.55%)
Mutual labels:  scalability
pyg autoscale
Implementation of "GNNAutoScale: Scalable and Expressive Graph Neural Networks via Historical Embeddings" in PyTorch
Stars: ✭ 136 (-15.53%)
Mutual labels:  scalability
openshift-sync-plugin
Synchronizes OpenShift BuildConfig objects as Jenkins jobs and synchronizes job status into OpenShift Build objects
Stars: ✭ 16 (-90.06%)
Mutual labels:  openshift
openshift-update-graph
Visualize the OpenShift Update Graph
Stars: ✭ 20 (-87.58%)
Mutual labels:  openshift
aws-fis-templates-cdk
Collection of AWS Fault Injection Simulator (FIS) experiment templates deploy-able via the AWS CDK
Stars: ✭ 43 (-73.29%)
Mutual labels:  chaos-engineering
s2i-build
Github Action to build an OCI-compatible container image from source code.
Stars: ✭ 26 (-83.85%)
Mutual labels:  openshift
nebula
A distributed, fast open-source graph database featuring horizontal scalability and high availability
Stars: ✭ 8,196 (+4990.68%)
Mutual labels:  scalability
Proxy
The type-safe REST library for .NET Standard 2.0 (NetCoreStack Flying Proxy)
Stars: ✭ 40 (-75.16%)
Mutual labels:  scalability

Krkn aka Kraken

Docker Repository on Quay

Krkn logo

Chaos and resiliency testing tool for Kubernetes and OpenShift. Kraken injects deliberate failures into Kubernetes/OpenShift clusters to check if it is resilient to turbulent conditions.

Workflow

Kraken workflow

Demo

Kraken demo

Chaos Testing Guide

Guide encapsulates:

  • Test methodology that needs to be embraced.
  • Best practices that an OpenShift cluster, platform and applications running on top of it should take into account for best user experience, performance, resilience and reliability.
  • Tooling.
  • Scenarios supported.
  • Test environment recommendations as to how and where to run chaos tests.
  • Chaos testing in practice.

The guide is hosted at https://redhat-chaos.github.io/krkn.

How to Get Started

Instructions on how to setup, configure and run Kraken can be found at Installation.

See the getting started doc on support on how to get started with your own custom scenario or editing current scenarios for your specific usage.

After installation, refer back to the below sections for supported scenarios and how to tweak the kraken config to load them on your cluster.

Running Kraken with minimal configuration tweaks

For cases where you want to run Kraken with minimal configuration changes, refer to Kraken-hub. One use case is CI integration where you do not want to carry around different configuration files for the scenarios.

Setting up infrastructure dependencies

Kraken indexes the metrics specified in the profile into Elasticsearch in addition to leveraging Cerberus for understanding the health of the Kubernetes/OpenShift cluster under test. More information on the features is documented below. The infrastructure pieces can be easily installed and uninstalled by running:

$ cd kraken
$ podman-compose up or $ docker-compose up      # Spins up the containers specified in the docker-compose.yml file present in the run directory.
$ podman-compose down or $ docker-compose down  # Delete the containers installed.

This will manage the Cerberus and Elasticsearch containers on the host on which you are running Kraken.

NOTE: Make sure you have enough resources (memory and disk) on the machine on top of which the containers are running as Elasticsearch is resource intensive. Cerberus monitors the system components by default, the config can be tweaked to add applications namespaces, routes and other components to monitor as well. The command will keep running until killed since detached mode is not supported as of now.

Config

Instructions on how to setup the config and the options supported can be found at Config.

Kubernetes/OpenShift chaos scenarios supported

Scenario type Kubernetes OpenShift
Pod Scenarios ✔️ ✔️
Container Scenarios ✔️ ✔️
Node Scenarios ✔️ ✔️
Time Scenarios ✔️
Litmus Scenarios ✔️
Cluster Shut Down Scenarios ✔️ ✔️
Namespace Scenarios ✔️ ✔️
Zone Outage Scenarios ✔️ ✔️
Application_outages ✔️ ✔️
PVC scenario ✔️ ✔️
Network_Chaos ✔️ ✔️

Kraken scenario pass/fail criteria and report

It is important to make sure to check if the targeted component recovered from the chaos injection and also if the Kubernetes/OpenShift cluster is healthy as failures in one component can have an adverse impact on other components. Kraken does this by:

  • Having built in checks for pod and node based scenarios to ensure the expected number of replicas and nodes are up. It also supports running custom scripts with the checks.
  • Leveraging Cerberus to monitor the cluster under test and consuming the aggregated go/no-go signal to determine pass/fail post chaos. It is highly recommended to turn on the Cerberus health check feature available in Kraken. Instructions on installing and setting up Cerberus can be found here or can be installed from Kraken using the instructions. Once Cerberus is up and running, set cerberus_enabled to True and cerberus_url to the url where Cerberus publishes go/no-go signal in the Kraken config file. Cerberus can monitor application routes during the chaos and fails the run if it encounters downtime as it is a potential downtime in a customers, or users environment as well. It is especially important during the control plane chaos scenarios including the API server, Etcd, Ingress etc. It can be enabled by setting check_applicaton_routes: True in the Kraken config provided application routes are being monitored in the cerberus config.
  • Leveraging kube-burner alerting feature to fail the runs in case of critical alerts.

Signaling

In CI runs or any external job it is useful to stop Kraken once a certain test or state gets reached. We created a way to signal to kraken to pause the chaos or stop it completely using a signal posted to a port of your choice.

For example if we have a test run loading the cluster running and kraken separately running; we want to be able to know when to start/stop the kraken run based on when the test run completes or gets to a certain loaded state.

More detailed information on enabling and leveraging this feature can be found here.

Performance monitoring

Monitoring the Kubernetes/OpenShift cluster to observe the impact of Kraken chaos scenarios on various components is key to find out the bottlenecks as it is important to make sure the cluster is healthy in terms if both recovery as well as performance during/after the failure has been injected. Instructions on enabling it can be found here.

Scraping and storing metrics long term

Kraken supports capturing metrics for the duration of the scenarios defined in the config and indexes then into Elasticsearch to be able to store and evaluate the state of the runs long term. The indexed metrics can be visualized with the help of Grafana. It uses Kube-burner under the hood. The metrics to capture need to be defined in a metrics profile which Kraken consumes to query prometheus ( installed by default in OpenShift ) with the start and end timestamp of the run. Information on enabling and leveraging this feature can be found here.

Alerts

In addition to checking the recovery and health of the cluster and components under test, Kraken takes in a profile with the Prometheus expressions to validate and alerts, exits with a non-zero return code depending on the severity set. This feature can be used to determine pass/fail or alert on abnormalities observed in the cluster based on the metrics. Information on enabling and leveraging this feature can be found here.

Blogs and other useful resources

Roadmap

Following is a list of enhancements that we are planning to work on adding support in Kraken. Of course any help/contributions are greatly appreciated.

Contributions

We are always looking for more enhancements, fixes to make it better, any contributions are most welcome. Feel free to report or work on the issues filed on github.

More information on how to Contribute

If adding a new scenario or tweaking the main config, be sure to add in updates into the CI to be sure the CI is up to date. Please read this file for more information on updates.

Community

Key Members(slack_usernames/full name): paigerube14/Paige Rubendall, mffiedler/Mike Fiedler, ravielluri/Naga Ravi Chaitanya Elluri.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].