All Projects → vmware-tanzu → crash-diagnostics

vmware-tanzu / crash-diagnostics

Licence: other
Crash-Diagnostics (Crashd) is a tool to help investigate, analyze, and troubleshoot unresponsive or crashed Kubernetes clusters.

Programming Languages

go
31211 projects - #10 most used programming language

Projects that are alternatives of or similar to crash-diagnostics

ansible-role-rke2
Ansible Role to install RKE2 Kubernetes.
Stars: ✭ 118 (-24.84%)
Mutual labels:  kubernetes-cluster
hubble-ui
Observability & Troubleshooting for Kubernetes Services
Stars: ✭ 210 (+33.76%)
Mutual labels:  troubleshooting
easybuggy4django
EasyBuggy clone built on Django
Stars: ✭ 44 (-71.97%)
Mutual labels:  troubleshooting
jpetstore-kubernetes
Modernize and Extend: JPetStore on IBM Cloud Kubernetes Service
Stars: ✭ 21 (-86.62%)
Mutual labels:  kubernetes-cluster
kube-watch
Simple tool to get webhooks on Kubernetes cluster events
Stars: ✭ 21 (-86.62%)
Mutual labels:  kubernetes-cluster
kainstall-offline
kainstall tools offline file
Stars: ✭ 31 (-80.25%)
Mutual labels:  kubernetes-cluster
SQLCallStackResolver
Utility to resolve SQL Server callstacks to their correct symbolic form using just PDBs and without a dump file
Stars: ✭ 55 (-64.97%)
Mutual labels:  troubleshooting
serverless-lumigo-plugin
Serverless monitoring and troubleshooting plugin to easily apply distributed tracing
Stars: ✭ 59 (-62.42%)
Mutual labels:  troubleshooting
icp-ce-on-linux-containers
Multi node IBM Cloud Private Community Edition 3.2.x w/ Kubernetes 1.13.5 in a Box. Terraform, Packer and BASH based Infrastructure as Code script sets up a multi node LXD cluster, installs ICP-CE and clis on a metal or VM Ubuntu 18.04 host.
Stars: ✭ 52 (-66.88%)
Mutual labels:  kubernetes-cluster
terraform-azurerm-kubernetes
Terraform module to deploy a Kubernetes cluster on Azure, using AKS.
Stars: ✭ 16 (-89.81%)
Mutual labels:  kubernetes-cluster
k8s1.15.1
一键部署k8s1.15.1
Stars: ✭ 18 (-88.54%)
Mutual labels:  kubernetes-cluster
kube-microcosm
An example of a kubernetes cluster appropriate for a startup company
Stars: ✭ 61 (-61.15%)
Mutual labels:  kubernetes-cluster
K8s-Cluster-Provisioner-GCP-Terrafrom
This repo will seamlessly setup self managed Kubernetes cluster in GCP using Terraform and Kubespray.
Stars: ✭ 17 (-89.17%)
Mutual labels:  kubernetes-cluster
kubeadm-tf
PoC; terraform + kubeadm
Stars: ✭ 25 (-84.08%)
Mutual labels:  kubernetes-cluster
kubernetes-cluster
Vagrant As Automation Script
Stars: ✭ 34 (-78.34%)
Mutual labels:  kubernetes-cluster
Java-MicroProfile-on-Kubernetes
This application demonstrates the deployment of a Java based microservices application using Microprofile on Kubernetes Cluster. MicroProfile is a baseline platform definition that optimizes Enterprise Java for a microservices architecture and delivers application portability across multiple MicroProfile runtimes
Stars: ✭ 76 (-51.59%)
Mutual labels:  kubernetes-cluster
oci-cloud-controller-manager
Kubernetes Cloud Controller Manager implementation for Oracle Cloud Infrastucture
Stars: ✭ 101 (-35.67%)
Mutual labels:  kubernetes-cluster
kubernetes the easy way
Automating Kubernetes the hard way with Vagrant and scripts
Stars: ✭ 22 (-85.99%)
Mutual labels:  kubernetes-cluster
kubeadm-vagrant
Setup Kubernetes Cluster with Kubeadm and Vagrant
Stars: ✭ 49 (-68.79%)
Mutual labels:  kubernetes-cluster
cartpole-rl-remote
CartPole game by Reinforcement Learning, a journey from training to inference
Stars: ✭ 24 (-84.71%)
Mutual labels:  kubernetes-cluster

Go Report Card

Crashd - Crash Diagnostics

Crash Diagnostics (Crashd) is a tool that helps human operators to easily interact and collect information from infrastructures running on Kubernetes for tasks such as automated diagnosis and troubleshooting.

Crashd Features

  • Crashd uses the Starlark language, a Python dialect, to express and invoke automation functions
  • Easily automate interaction with infrastructures running Kubernetes
  • Interact and capture information from compute resources such as machines (via SSH)
  • Automatically execute commands on compute nodes to capture results
  • Capture object and cluster log from the Kubernetes API server
  • Easily extract data from Cluster-API managed clusters

How Does it Work?

Crashd executes script files, written in Starlark, that interacts a specified infrastructure along with its cluster resources. Starlark script files contain predefined Starlark functions that are capable of interacting and collect diagnostics and other information from the servers in the cluster.

For detail on the design of Crashd, see this Google Doc design document here.

Installation

There are two ways to get started with Crashd. Either download a pre-built binary or pull down the code and build it locally.

Download binary

  1. Dowload the latest binary release for your platform
  2. Extract tarball from release
    tar -xvf <RELEASE_TARBALL_NAME>.tar.gz
    
  3. Move the binary to your operating system's PATH

Compiling from source

Crashd is written in Go and requires version 1.11 or later. Clone the source from its repo or download it to your local directory. From the project's root directory, compile the code with the following:

GO111MODULE=on go build -o crashd .

Or, yo can run a versioned build using the build.go source code:

go run .ci/build/build.go

Build amd64/darwin OK: .build/amd64/darwin/crashd
Build amd64/linux OK: .build/amd64/linux/crashd

Getting Started

A Crashd script consists of a collection of Starlark functions stored in a file. For instance, the following script (saved as diagnostics.crsh) collects system information from a list of provided hosts using SSH. The collected data is then bundled as tar.gz file at the end:

# Crashd global config
crshd = crashd_config(workdir="{0}/crashd".format(os.home))

# Enumerate compute resources 
# Define a host list provider with configured SSH
hosts=resources(
    provider=host_list_provider(
        hosts=["170.10.20.30", "170.40.50.60"], 
        ssh_config=ssh_config(
            username=os.username,
            private_key_path="{0}/.ssh/id_rsa".format(os.home),
        ),
    ),
)

# collect data from hosts
capture(cmd="sudo df -i", resources=hosts)
capture(cmd="sudo crictl info", resources=hosts)
capture(cmd="df -h /var/lib/containerd", resources=hosts)
capture(cmd="sudo systemctl status kubelet", resources=hosts)
capture(cmd="sudo systemctl status containerd", resources=hosts)
capture(cmd="sudo journalctl -xeu kubelet", resources=hosts)

# archive collected data
archive(output_file="diagnostics.tar.gz", source_paths=[crshd.workdir])

The previous code snippet connects to two hosts (specified in the host_list_provider) and execute commands remotely, over SSH, and capture and stores the result.

See the complete list of supported functions here.

Running the script

To run the script, do the following:

$> crashd run diagnostics.crsh 

If you want to output debug information, use the --debug flag as shown:

$> crashd run --debug diagnostics.crsh

DEBU[0000] creating working directory /home/user/crashd
DEBU[0000] run: executing command on 2 resources
DEBU[0000] run: executing command on localhost using ssh: [sudo df -i]
DEBU[0000] ssh.run: /usr/bin/ssh -q -o StrictHostKeyChecking=no -i /home/user/.ssh/id_rsa -p 22  user@localhost "sudo df -i"
DEBU[0001] run: executing command on 170.10.20.30 using ssh: [sudo df -i]
...

Compute Resource Providers

Crashd utilizes the concept of a provider to enumerate compute resources. Each implementation of a provider is responsible for enumerating compute resources on which Crashd can execute commands using a transport (i.e. SSH). Crashd comes with several providers including

  • Host List Provider - uses an explicit list of host addresses (see previous example)
  • Kubernetes Nodes Provider - extracts host information from a Kubernetes API node objects
  • CAPV Provider - uses Cluster-API to discover machines in vSphere cluster
  • CAPA Provider - uses Cluster-API to discover machines running on AWS
  • More providers coming!

Accessing script parameters

Crashd scripts can access external values that can be used as script parameters.

Environment variables

Crashd scripts can access environment variables at runtime using the os.getenv method:

kube_capture(what="logs", namespaces=[os.getenv("KUBE_DEFAULT_NS")])

Command-line arguments

Scripts can also access command-line arguments passed as key/value pairs using the --args or --args-file flags. For instance, when the following command is used to start a script:

$ crashd run --args="kube_ns=kube-system, username=$(whoami)" diagnostics.crsh

Values from --args can be accessed as shown below:

kube_capture(what="logs", namespaces=["default", args.kube_ns])

More Examples

SSH Connection via a jump host

The SSH configuration function can be configured with a jump user and jump host. This is useful for providers that requires a host proxy for SSH connection as shown in the following example:

ssh=ssh_config(username=os.username, jump_user=args.jump_user, jump_host=args.jump_host)
hosts=host_list_provider(hosts=["some.host", "172.100.100.20"], ssh_config=ssh)
...

Connecting to Kubernetes nodes with SSH

The following uses the kube_nodes_provider to connect to Kubernetes nodes and execute remote commands against those nodes using SSH:

# SSH configuration
ssh=ssh_config(
    username=os.username,
    private_key_path="{0}/.ssh/id_rsa".format(os.home),
    port=args.ssh_port,
    max_retries=5,
)

# enumerate nodes as compute resources
nodes=resources(
    provider=kube_nodes_provider(
        kube_config=kube_config(path=args.kubecfg),
        ssh_config=ssh,
    ),
)

# exec `uptime` command on each node
uptimes = run(cmd="uptime", resources=nodes)

# print `run` result from first node
print(uptimes[0].result)

Retreiving Kubernetes API objects and logs

Thekube_capture is used, in the following example, to connect to a Kubernetes API server to retrieve Kubernetes API objects and logs. The retrieved data is then saved to the filesystem as shown below:

nspaces=[
    "capi-kubeadm-bootstrap-system",
    "capi-kubeadm-control-plane-system",
    "capi-system capi-webhook-system",
    "cert-manager tkg-system",
]

conf=kube_config(path=args.kubecfg)

# capture Kubernetes API object and store in files
kube_capture(what="logs", namespaces=nspaces, kube_config=conf)
kube_capture(what="objects", kinds=["services", "pods"], namespaces=nspaces, kube_config=conf)
kube_capture(what="objects", kinds=["deployments", "replicasets"], namespaces=nspaces, kube_config=conf)

Interacting with Cluster-API managed machines running on vSphere (CAPV)

As mentioned, Crashd provides the capv_provider which allows scripts to interact with Cluster-API managed clusters running on a vSphere infrastructure (CAPV). The following shows an abbreviated snippet of a Crashd script that retrieves diagnostics information from the management cluster machines managed by a CAPV-initiated cluster:

# enumerates management cluster nodes
nodes = resources(
    provider=capv_provider(
        ssh_config=ssh_config(username="capv", private_key_path=args.private_key),
        kube_config=kube_config(path=args.mc_config)
    )
)

# execute and capture commands output from management nodes
capture(cmd="sudo df -i", resources=nodes)
capture(cmd="sudo crictl info", resources=nodes)
capture(cmd="sudo cat /var/log/cloud-init-output.log", resources=nodes)
capture(cmd="sudo cat /var/log/cloud-init.log", resources=nodes)
...

The previous snippet interact with management cluster machines. The provider can be configured to enumerate workload machines (by specifying the name of a workload cluster) as shown in the following example:

# enumerates workload cluster nodes
nodes = resources(
    provider=capv_provider(
        workload_cluster=args.cluster_name,
        ssh_config=ssh_config(username="capv", private_key_path=args.private_key),
        kube_config=kube_config(path=args.mc_config)
    )
)

# execute and capture commands output from workload nodes
capture(cmd="sudo df -i", resources=nodes)
capture(cmd="sudo crictl info", resources=nodes)
...

All Examples

See all script examples in the ./examples directory.

Roadmap

This project has numerous possibilities ahead of it. Read about our evolving roadmap here.

Contributing

New contributors will need to sign a CLA (contributor license agreement). Details are described in our contributing documentation.

License

This project is available under the Apache License, Version 2.0

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].