KUBERNETES

Kubernetes Artifactory Triage

1 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

Performs a triage on the Open Source version of Artifactory in a Kubernetes cluster.

Tasks:

Check Artifactory Liveness and Readiness Endpoints in `NAMESPACE`

Source Code

Discoverable

Troubleshooting CheatSheet

Raises Issues

Kubernetes Labeled Pod Count

1 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

This codebundle fetches the number of running pods with the set of provided labels, letting you measure the number of running pods.

Tasks:

Measure Number of Running Pods with Label in `${NAMESPACE}`

Source Code

K8s Jaeger Query

1 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

This taskset queries Jaeger API directly for trace details and parses the results

Tasks:

Query Traces in Jaeger for Unhealthy HTTP Response Codes in Namespace `NAMESPACE`

Source Code

Discoverable

Troubleshooting CheatSheet

Kubernetes Workload Chaos Engineering

5 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

Provides chaos injection tasks for specific workloads like your apps in a Kubernetes namespace. These are destructive tasks and the expectation is that you can heal these changes by enabling your GitOps reconciliation.

Tasks:

Test `WORKLOAD_NAME` High Availability in Namespace `NAMESPACE`
OOMKill `WORKLOAD_NAME` Pod
Mangle Service Selector For `WORKLOAD_NAME` in `NAMESPACE`
Mangle Service Port For `WORKLOAD_NAME` in `NAMESPACE`
Fill Tmp Directory Of Pod From `WORKLOAD_NAME`

Source Code

Kubernetes Tail Application Logs

2 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

Performs application-level troubleshooting by inspecting the logs of a workload for parsable exceptions, and attempts to determine next steps.

Tasks:

Get `CONTAINER_NAME` Application Logs in Namespace `NAMESPACE`
Tail `CONTAINER_NAME` Application Logs For Stacktraces

Source Code

Kubernetes Tail Application Logs

1 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

Measures the number of exception stacktraces present in an application's logs over a time period.

Tasks:

Tail `${CONTAINER_NAME}` Application Logs For Stacktraces

Source Code

Kubernetes ArgoCD HelmRelease TaskSet

2 Troubleshooting Commands

Contributed by nmadhok

Codecollection: rw-cli-codecollection

This codebundle runs a series of tasks to identify potential helm release issues related to ArgoCD managed Helm objects.

Tasks:

Fetch all available ArgoCD Helm releases in namespace `NAMESPACE`
Fetch Installed ArgoCD Helm release versions in namespace `NAMESPACE`

Source Code

Discoverable

Troubleshooting CheatSheet

Kubernetes Redis Healthcheck

2 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

This taskset collects information on your redis workload in your Kubernetes cluster and raises issues if any health checks fail.

Tasks:

Ping `DEPLOYMENT_NAME` Redis Workload Show More
Common scenarios that might relate to this command or script:
1. Troubleshooting a Kubernetes CrashLoopBackoff event for a specific Redis deployment to see if the server is running properly and responding to commands. 2. Performing routine health checks on Redis deployments within the Kubernetes cluster to ensure that the servers are operational and responsive. 3. Checking the status of the Redis server after a recent deployment or upgrade to ensure that it is functioning as expected within the Kubernetes environment. 4. Verifying the status of the Redis server in response to user-reported issues or errors related to data storage or retrieval. 5. Investigating performance or latency issues within the Kubernetes cluster by inspecting the responsiveness of the Redis servers using the redis-cli PING command.
Verify `DEPLOYMENT_NAME` Redis Read Write Operation in Kubernetes Show More
Common scenarios that might relate to this command or script:
1. Troubleshooting application performance issues related to Redis in a Kubernetes environment. 2. Investigating and resolving connectivity issues between a Kubernetes deployment and the Redis database. 3. Monitoring and diagnosing potential data inconsistencies or corruption in the Redis database within a Kubernetes cluster. 4. Analyzing and troubleshooting CrashLoopBackoff events related to the Redis deployment in Kubernetes. 5. Providing support for developers by retrieving specific key values from the Redis database within a Kubernetes environment for debugging purposes.

Source Code

Discoverable

Troubleshooting CheatSheet

Raises Issues

Kubernetes Cluster Resource Health

3 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Identify resource constraints or issues in a cluster.

Tasks:

Identify High Utilization Nodes for Cluster `CONTEXT`
Identify Pods Causing High Node Utilization in Cluster `CONTEXT`
Identify Pods with Resource Limits Exceeding Node Capacity in Cluster `CONTEXT`

Source Code

Discoverable

Troubleshooting CheatSheet

Kubernetes Cluster Resource Health

3 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Counts the number of nodes above 90% CPU or Memory Utilization from kubectl top.

Tasks:

Identify High Utilization Nodes for Cluster `${CONTEXT}`
Identify Pods with Resource Limits Exceeding Node Capacity in Cluster `${CONTEXT}`
Generate Cluster Resource Health Score

Source Code

Kubernetes GitOps GitHub Remediation

4 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Provides a list of tasks that can remediate configuraiton issues with manifests in GitHub based GitOps repositories.

Tasks:

Remediate Readiness and Liveness Probe GitOps Manifests in Namespace `NAMESPACE`
Increase ResourceQuota Limit for Namespace `NAMESPACE` in GitHub GitOps Repository
Adjust Pod Resources to Match VPA Recommendation in `NAMESPACE`
Expand Persistent Volume Claims in Namespace `NAMESPACE`

Source Code

Discoverable

Kubernetes Restart resource

3 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

This taskset restarts a resource with a given set of labels, typically used with other tasksets.

Tasks:

Get Current Resource State with Labels `LABELS`
Get Resource Logs with Labels `LABELS`
Restart Resource with Labels `LABELS` in `CONTEXT`

Source Code

Troubleshooting CheatSheet

Kubernetes DaemonSet Triage

8 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Triages issues related to a DaemonSet and its pods, including node scheduling and resource constraints.

Tasks:

Analyze Application Log Patterns for DaemonSet `DAEMONSET_NAME` in Namespace `NAMESPACE`
Detect Log Anomalies for DaemonSet `DAEMONSET_NAME` in Namespace `NAMESPACE`
Check Liveness Probe Configuration for DaemonSet `DAEMONSET_NAME`
Check Readiness Probe Configuration for DaemonSet `DAEMONSET_NAME` in Namespace `NAMESPACE`
Inspect DaemonSet Warning Events for `DAEMONSET_NAME` in Namespace `NAMESPACE`
Fetch DaemonSet Workload Details For `DAEMONSET_NAME` in Namespace `NAMESPACE`
Inspect DaemonSet Status for `DAEMONSET_NAME` in namespace `NAMESPACE`
Check Node Affinity and Tolerations for DaemonSet `DAEMONSET_NAME` in Namespace `NAMESPACE`

Source Code

Discoverable

Troubleshooting CheatSheet

Kubernetes Jenkins Healthcheck

2 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

This taskset collects information about perstistent volumes and persistent volume claims to validate health or help troubleshoot potential issues.

Tasks:

Query The Jenkins Kubernetes Workload HTTP Endpoint in Kubernetes StatefulSet `STATEFULSET_NAME`
Query For Stuck Jenkins Jobs in Kubernetes Statefulset Workload `STATEFULSET_NAME`

Source Code

Discoverable

Troubleshooting CheatSheet

Raises Issues

Kubernetes Image Check

4 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

This taskset provides detailed information about the images used in a Kubernetes namespace.

Tasks:

Check Image Rollover Times for Namespace `NAMESPACE`
List Images and Tags for Every Container in Running Pods for Namespace `NAMESPACE`
List Images and Tags for Every Container in Failed Pods for Namespace `NAMESPACE`
List ImagePullBackOff Events and Test Path and Tags for Namespace `NAMESPACE`

Source Code

Discoverable

Troubleshooting CheatSheet

Kubernetes Postgres Healthcheck

8 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Runs a series of tasks to check the overall health of a postgres cluster and to provide detailed information useful for debugging or reviewing configurations.

Tasks:

List Resources Related to Postgres Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
Get Postgres Pod Logs & Events for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
Get Postgres Pod Resource Utilization for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
Get Running Postgres Configuration for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
Get Patroni Output and Add to Report for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
Fetch Patroni Database Lag for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
Check Database Backup Status for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
Run DB Queries for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`

Source Code

Discoverable

Troubleshooting CheatSheet

Kubernetes Postgres Healthcheck

3 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Runs multiple Kubernetes and psql commands to report on the health of a postgres cluster. Produces a value between 0 (completely failing thet test) and 1 (fully passing the test). Checks for database lag & backup health.

Tasks:

Check Patroni Database Lag in Namespace `${NAMESPACE}` on Host `${HOSTNAME}` using `patronictl`
Check Database Backup Status for Cluster `${OBJECT_NAME}` in Namespace `${NAMESPACE}`
Generate Namespace Score for Namespace `${NAMESPACE}`

Source Code

Kubernetes Deployment Operations

7 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Perform oprational tasks for a Kubernetes deployment.

Tasks:

Restart Deployment `DEPLOYMENT_NAME` in Namespace `NAMESPACE`
Force Delete Pods in Deployment `DEPLOYMENT_NAME` in Namespace `NAMESPACE`
Rollback Deployment `DEPLOYMENT_NAME` in Namespace `NAMESPACE` to Previous Version
Scale Down Deployment `DEPLOYMENT_NAME` in Namespace `NAMESPACE`
Scale Up Deployment `DEPLOYMENT_NAME` in Namespace `NAMESPACE` by SCALE_UP_FACTORx
Clean Up Stale ReplicaSets for Deployment `DEPLOYMENT_NAME` in Namespace `NAMESPACE`
Scale Down Stale ReplicaSets for Deployment `DEPLOYMENT_NAME` in Namespace `NAMESPACE`

Source Code

Discoverable

Kubernetes Cluster Node Health

1 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Evaluate cluster node health using kubectl

Tasks:

Check for Node Restarts in Cluster `CONTEXT` within Interval `INTERVAL`

Source Code

Discoverable

Troubleshooting CheatSheet

Kubernetes Cluster Node Health

2 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Evaluate cluster node health using kubectl.

Tasks:

Check for Node Restarts in Cluster `${CONTEXT}`
Generate Namespace Score in Kubernetes Cluster `$${CONTEXT}`

Source Code

Azure AKS Triage

4 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Runs diagnostic checks against an AKS cluster.

Tasks:

Check for Resource Health Issues Affecting AKS Cluster `AKS_CLUSTER` In Resource Group `AZ_RESOURCE_GROUP`
Check Configuration Health of AKS Cluster `AKS_CLUSTER` In Resource Group `AZ_RESOURCE_GROUP`
Check Network Configuration of AKS Cluster `AKS_CLUSTER` In Resource Group `AZ_RESOURCE_GROUP`
Fetch Activities for AKS Cluster `AKS_CLUSTER` In Resource Group `AZ_RESOURCE_GROUP`

Source Code

Discoverable

Troubleshooting CheatSheet

Kubernetes Flux Choas Testing

5 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

This taskset is used to suspend a flux resource for the purposes of executing chaos tasks.

Tasks:

Suspend the Flux Resource Reconciliation for `FLUX_RESOURCE_NAME` in namespace `FLUX_RESOURCE_NAMESPACE`
Select Random FluxCD Workload for Chaos Target in Namespace `FLUX_RESOURCE_NAMESPACE`
Execute Chaos Command on `TARGET_RESOURCE` in Namespace `TARGET_NAMESPACE`
Execute Additional Chaos Command on FLUX_RESOURCE_TYPE 'FLUX_RESOURCE_NAME' in namespace 'FLUX_RESOURCE_NAMESPACE'
Resume Flux Resource Reconciliation in `TARGET_NAMESPACE`

Source Code

Kubernetes Fluxcd Reconciliation Report

1 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

Generates a report of the reconciliation errors for fluxcd in your cluster.

Tasks:

Check FluxCD Reconciliation Health in Kubernetes Namespace `FLUX_NAMESPACE`

Source Code

Discoverable

Kubernetes Fluxcd Reconciliation Monitor

1 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

Measures failing reconciliations for fluxcd

Tasks:

Health Check Flux Reconciliation

Source Code

Kubernetes Namespace Inspection

9 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

This taskset runs general troubleshooting checks against all applicable objects in a namespace. Looks for warning events, odd or frequent normal events, restarting containers and failed or pending pods.

Tasks:

Inspect Warning Events in Namespace `NAMESPACE`
Inspect Container Restarts In Namespace `NAMESPACE`
Inspect Pending Pods In Namespace `NAMESPACE`
Inspect Failed Pods In Namespace `NAMESPACE`
Inspect Workload Status Conditions In Namespace `NAMESPACE`
Get Listing Of Resources In Namespace `NAMESPACE`
Check Event Anomalies in Namespace `NAMESPACE`
Check Missing or Risky PodDisruptionBudget Policies in Namepace `NAMESPACE`
Check Resource Quota Utilization in Namespace `NAMESPACE`

Source Code

Discoverable

Troubleshooting CheatSheet

Kubernetes Namespace Healthcheck

4 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

This SLI uses kubectl to score namespace health. Produces a value between 0 (completely failing thet test) and 1 (fully passing the test). Looks for container restarts, events, and pods not ready.

Tasks:

Get Error Event Count within ${EVENT_AGE} and calculate Score
Get Container Restarts and Score in Namespace `${NAMESPACE}`
Get NotReady Pods in `${NAMESPACE}`
Generate Namespace Score in `${NAMESPACE}`

Source Code

Kubernetes FluxCD Kustomization TaskSet

3 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

This codebundle runs a series of tasks to identify potential Kustomization issues related to Flux managed Kustomization objects.

Tasks:

List All FluxCD Kustomization objects in Namespace `NAMESPACE` in Cluster `CONTEXT`
List Suspended FluxCD Kustomization objects in Namespace `NAMESPACE` in Cluster `CONTEXT`
List Unready FluxCD Kustomizations in Namespace `NAMESPACE` in Cluster `CONTEXT`

Source Code

Discoverable

Troubleshooting CheatSheet

Kubernetes FluxCD Kustomization Health

3 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

This codebundle checks for unhealthy or suspended FluxCD Kustomization objects.

Tasks:

List Suspended FluxCD Kustomization objects in Namespace `${NAMESPACE}` in Cluster `${CONTEXT}`
List Unready FluxCD Kustomizations in Namespace `${NAMESPACE}` in Cluster `${CONTEXT}`
Generate FluxCD Kustomization Health Score for Namespace `${NAMESPACE}` in Cluster `${CONTEXT}`

Source Code

Kubernetes Ingress GCE & GCP HTTP Load Balancer Healthcheck

5 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Troubleshoot GCE Ingress Resources related to GCP HTTP Load Balancer in GKE

Tasks:

Search For GCE Ingress Warnings in GKE Context `CONTEXT`
Identify Unhealthy GCE HTTP Ingress Backends in GKE Namespace `NAMESPACE`
Validate GCP HTTP Load Balancer Configurations in GCP Project `GCP_PROJECT_ID`
Fetch Network Error Logs from GCP Operations Manager for Ingress Backends in GCP Project `GCP_PROJECT_ID`
Review GCP Operations Logging Dashboard in GCP project `GCP_PROJECT_ID`

Source Code

Troubleshooting CheatSheet

Raises Issues

Kubernetes Service Account Check

1 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

This taskset provides tasks to troubleshoot service accounts in a Kubernetes namespace.

Tasks:

Test Service Account Access to Kubernetes API Server in Namespace `NAMESPACE`

Source Code

Discoverable

Troubleshooting CheatSheet

Raises Issues

Azure Internal LoadBalancer Triage

1 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

Triages issues related to a Azure Loadbalancers and its activity logs.

Tasks:

Check Activity Logs for Azure Load Balancer `AZ_LB_NAME`

Source Code

Discoverable

Raises Issues

Kubernetes StatefulSet Triage

8 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Triages issues related to a StatefulSet and its pods, including persistent volumes and ordered deployment characteristics.

Tasks:

Analyze Application Log Patterns for StatefulSet `STATEFULSET_NAME` in Namespace `NAMESPACE`
Detect Log Anomalies for StatefulSet `STATEFULSET_NAME` in Namespace `NAMESPACE`
Check Liveness Probe Configuration for StatefulSet `STATEFULSET_NAME`
Check Readiness Probe Configuration for StatefulSet `STATEFULSET_NAME` in Namespace `NAMESPACE`
Inspect StatefulSet Warning Events for `STATEFULSET_NAME` in Namespace `NAMESPACE`
Fetch StatefulSet Workload Details For `STATEFULSET_NAME` in Namespace `NAMESPACE`
Inspect StatefulSet Replicas for `STATEFULSET_NAME` in namespace `NAMESPACE`
Check StatefulSet PersistentVolumeClaims for `STATEFULSET_NAME` in Namespace `NAMESPACE`

Source Code

Discoverable

Troubleshooting CheatSheet

Kubernetes Persistent Volume Healthcheck

6 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

This taskset collects information about storage such as PersistentVolumes and PersistentVolumeClaims to validate health or help troubleshoot potential storage issues.

Tasks:

Fetch Events for Unhealthy Kubernetes PersistentVolumeClaims in Namespace `NAMESPACE`
List PersistentVolumeClaims in Terminating State in Namespace `NAMESPACE`
List PersistentVolumes in Terminating State in Namespace `NAMESPACE`
List Pods with Attached Volumes and Related PersistentVolume Details in Namespace `NAMESPACE`
Fetch the Storage Utilization for PVC Mounts in Namespace `NAMESPACE`
Check for RWO Persistent Volume Node Attachment Issues in Namespace `NAMESPACE`

Source Code

Discoverable

Troubleshooting CheatSheet

Raises Issues

Kubernetes Persistent Volume Healthcheck

2 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

This SLI collects information about storage such as PersistentVolumes and PersistentVolumeClaims and generates an aggregated health score for the namespace. 1 = Healthy, 0 = Failed, >0 <1 = Degraded

Tasks:

Fetch the Storage Utilization for PVC Mounts in Namespace `${NAMESPACE}`
Generate Namespace Score for Namespace `${NAMESPACE}`

Source Code

Kubernetes Istio System Health

7 Troubleshooting Commands

Contributed by Nbarola

Codecollection: rw-cli-codecollection

Checks istio proxy sidecar injection status, high memory and cpu usage, warnings and errors in logs, valid certificates, configuration and verify istio installation.

Tasks:

Verify Istio Sidecar Injection for Cluster `CONTEXT`
Check Istio Sidecar Resource Usage for Cluster `CONTEXT`
Validate Istio Installation in Cluster `CONTEXT`
Check Istio Controlplane Logs For Errors in Cluster `CONTEXT`
Fetch Istio Proxy Logs in Cluster `CONTEXT`
Verify Istio SSL Certificates in Cluster `CONTEXT`
Check Istio Configuration Health in Cluster `CONTEXT`

Source Code

Discoverable

Kubernetes Istio System Health

8 Troubleshooting Commands

Contributed by Nbarola

Codecollection: rw-cli-codecollection

Checks istio proxy sidecar injection status, high memory and cpu usage, warnings and errors in logs, valid certificates, configuration and verify istio installation.

Tasks:

Verify Istio Sidecar Injection for Cluster `${CONTEXT}`
Check Istio Sidecar Resource Usage for Cluster `${CONTEXT}`
Validate Istio Installation in Cluster `${CONTEXT}`
Check Istio Controlplane Logs For Errors in Cluster `${CONTEXT}`
Fetch Istio Proxy Logs in Cluster `${CONTEXT}`
Verify Istio SSL Certificates in Cluster `${CONTEXT}`
Check Istio Configuration Health in Cluster `${CONTEXT}`
Generate Health Score for Cluster ${CONTEXT}

Source Code

Kubernetes Pod Resources Health

4 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

Inspects the resources provisioned for a given set of pods and raises issues or recommendations as necessary.

Tasks:

Show Pods Without Resource Limit or Resource Requests Set in Namespace `NAMESPACE`
Check Pod Resource Utilization with Top in Namespace `NAMESPACE`
Identify VPA Pod Resource Recommendations in Namespace `NAMESPACE`
Identify Overutilized Pods in Namespace `NAMESPACE`

Source Code

Discoverable

Troubleshooting CheatSheet

Raises Issues

Kubernetes Deployment Triage

9 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Triages issues related to a deployment and its replicas.

Tasks:

Analyze Application Log Patterns for Deployment `DEPLOYMENT_NAME` in Namespace `NAMESPACE`
Detect Log Anomalies for Deployment `DEPLOYMENT_NAME` in Namespace `NAMESPACE`
Perform Comprehensive Log Analysis for Deployment `DEPLOYMENT_NAME` in Namespace `NAMESPACE`
Fetch Deployment Logs for `DEPLOYMENT_NAME` in Namespace `NAMESPACE`
Check Liveness Probe Configuration for Deployment `DEPLOYMENT_NAME`
Check Readiness Probe Configuration for Deployment `DEPLOYMENT_NAME` in Namespace `NAMESPACE`
Inspect Deployment Warning Events for `DEPLOYMENT_NAME` in Namespace `NAMESPACE`
Check Deployment Replica Status for `DEPLOYMENT_NAME` in Namespace `NAMESPACE`
Inspect Container Restarts for Deployment `DEPLOYMENT_NAME` in Namespace `NAMESPACE`

Source Code

Discoverable

Troubleshooting CheatSheet

Kubernetes Deployment Healthcheck

6 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

This SLI uses kubectl to score deployment health. Produces a value between 0 (completely failing the test) and 1 (fully passing the test). Looks for container restarts, critical log errors, pods not ready, deployment status, and recent events.

Tasks:

Get Container Restarts and Score for Deployment `${DEPLOYMENT_NAME}`
Get Critical Log Errors and Score for Deployment `${DEPLOYMENT_NAME}`
Get NotReady Pods Score for Deployment `${DEPLOYMENT_NAME}`
Get Deployment Replica Status and Score for `${DEPLOYMENT_NAME}`
Get Recent Warning Events Score for `${DEPLOYMENT_NAME}`
Generate Deployment Health Score for `${DEPLOYMENT_NAME}`

Source Code

Kubernetes Namespace Chaos Engineering

5 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

Provides chaos injection tasks for Kubernetes namespaces. These are destructive tasks and the expectation is that you can heal these changes by enabling your GitOps reconciliation.

Tasks:

Kill Random Pods In Namespace `NAMESPACE`
OOMKill Pods In Namespace `NAMESPACE`
Mangle Service Selector In Namespace `NAMESPACE`
Mangle Service Port In Namespace `NAMESPACE`
Fill Random Pod Tmp Directory In Namespace `NAMESPACE`

Source Code

Kubernetes Application Troubleshoot

3 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

Performs application-level troubleshooting by inspecting the logs of a workload for parsable exceptions, and attempts to determine next steps.

Tasks:

Get `CONTAINER_NAME` Application Logs from Workload `WORKLOAD_NAME` in Namespace `NAMESPACE`
Scan `CONTAINER_NAME` Application For Misconfigured Environment
Tail `CONTAINER_NAME` Application Logs For Stacktraces in Workload `WORKLOAD_NAME`

Source Code

Discoverable

Troubleshooting CheatSheet

Kubernetes Application Monitor

1 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

Measures the number of exception stacktraces present in an application's logs over a time period.

Tasks:

Measure Application Exceptions in `${NAMESPACE}`

Source Code

Kubernetes ArgoCD Application Health & Troubleshoot

5 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

This taskset collects information and runs general troubleshooting checks against argocd application objects within a namespace.

Tasks:

Fetch ArgoCD Application Sync Status & Health for `APPLICATION`
Fetch ArgoCD Application Last Sync Operation Details for `APPLICATION`
Fetch Unhealthy ArgoCD Application Resources for `APPLICATION`
Scan For Errors in Pod Logs Related to ArgoCD Application `APPLICATION`
Fully Describe ArgoCD Application `APPLICATION`

Source Code

Discoverable

Troubleshooting CheatSheet

Kubeprometheus Operator Troubleshoot

5 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

This taskset investigates the logs, state and health of Kubernetes Prometheus operator.

Tasks:

Check Prometheus Service Monitors in namespace `NAMESPACE`
Check For Successful Rule Setup in Kubernetes Namespace `NAMESPACE`
Verify Prometheus RBAC Can Access ServiceMonitors in Namespace `PROM_NAMESPACE`
Inspect Prometheus Operator Logs for Scraping Errors in Namespace `NAMESPACE`
Check Prometheus API Healthy in Namespace `PROM_NAMESPACE`

Source Code

Discoverable

Troubleshooting CheatSheet

Raises Issues

Kubernetes Ingress Healthcheck

2 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Triages issues related to a ingress objects and services.

Tasks:

Fetch Ingress Object Health in Namespace `NAMESPACE`
Check for Ingress and Service Conflicts in Namespace `NAMESPACE`

Source Code

Discoverable

Troubleshooting CheatSheet

Kubernetes cert-manager Healthcheck

3 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Checks the overall health of certificates in a namespace that are managed by cert-manager.

Tasks:

Get Namespace Certificate Summary for Namespace `NAMESPACE`
Find Unhealthy Certificates in Namespace `NAMESPACE`
Find Failed Certificate Requests and Identify Issues for Namespace `NAMESPACE`

Source Code

Discoverable

Troubleshooting CheatSheet

Raises Issues

Kubernetes cert-manager Healthcheck

1 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Counts the number of unhealthy cert-manager managed certificates in a namespace.

Tasks:

Count Unready and Expired Certificates in Namespace `${NAMESPACE}`

Source Code

Kubernetes Vault Triage

9 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

A suite of tasks that can be used to triage potential issues in your vault namespace.

Tasks:

Fetch Vault CSI Driver Logs in Namespace `NAMESPACE`
Get Vault CSI Driver Warning Events in `NAMESPACE`
Check Vault CSI Driver Replicas Show More
Common scenarios that might relate to this command or script:
1. Troubleshooting Kubernetes CrashLoopBackoff events for the "vault-csi-provider" daemonset. 2. Investigating why the "vault-csi-provider" daemonset is not running or experiencing errors in a specific context and namespace. 3. Monitoring and understanding the resource utilization and scheduling of the "vault-csi-provider" daemonset in the Kubernetes cluster. 4. Debugging issues related to the deployment and scaling of the "vault-csi-provider" daemonset. 5. Verifying the configuration and settings of the "vault-csi-provider" daemonset to ensure it meets the desired specifications and requirements.
Fetch Vault Pod Workload Logs in Namespace `NAMESPACE` with Labels `LABELS`
Get Related Vault Events in Namespace `NAMESPACE`
Fetch Vault StatefulSet Manifest Details in `NAMESPACE`
Fetch Vault DaemonSet Manifest Details in Kubernetes Cluster `NAMESPACE`
Verify Vault Availability in Namespace `NAMESPACE` and Context `CONTEXT`
Check Vault StatefulSet Replicas in `NAMESPACE`

Source Code

Troubleshooting CheatSheet

Raises Issues

Kubernetes FluxCD HelmRelease TaskSet

5 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

This codebundle runs a series of tasks to identify potential helm release issues related to Flux managed Helm objects.

Tasks:

List all available FluxCD Helmreleases in Namespace `NAMESPACE`
Fetch Installed FluxCD Helmrelease Versions in Namespace `NAMESPACE`
Fetch Mismatched FluxCD HelmRelease Version in Namespace `NAMESPACE`
Fetch FluxCD HelmRelease Error Messages in Namespace `NAMESPACE`
Check for Available Helm Chart Updates in Namespace `NAMESPACE`

Source Code

Discoverable

Troubleshooting CheatSheet

Raises Issues

Kubernetes Application Log Health

10 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Analyzes logs from Kubernetes Application Logs fetched through kubectl

Tasks:

Scan WORKLOAD_TYPE `WORKLOAD_NAME` Logs for Errors in Namespace `NAMESPACE`
Scan WORKLOAD_TYPE `WORKLOAD_NAME` Logs for Stack Traces in Namespace `NAMESPACE`
Scan WORKLOAD_TYPE `WORKLOAD_NAME` Logs for Connection Failures in Namespace `NAMESPACE`
Scan WORKLOAD_TYPE `WORKLOAD_NAME` Logs for Timeout Errors in Namespace `NAMESPACE`
Scan WORKLOAD_TYPE `WORKLOAD_NAME` Logs for Authentication and Authorization Failures in Namespace `NAMESPACE`
Scan WORKLOAD_TYPE `WORKLOAD_NAME` Logs for Null Pointer and Unhandled Exceptions in Namespace `NAMESPACE`
Scan WORKLOAD_TYPE `WORKLOAD_NAME` for Log Anomalies in Namespace `NAMESPACE`
Scan WORKLOAD_TYPE `WORKLOAD_NAME` Logs for Application Restarts and Failures in Namespace `NAMESPACE`
Scan WORKLOAD_TYPE `WORKLOAD_NAME` Logs for Memory and CPU Resource Warnings in Namespace `NAMESPACE`
Scan WORKLOAD_TYPE `WORKLOAD_NAME` Logs for Service Dependency Failures in Namespace `NAMESPACE`

Source Code

Discoverable

Kubernetes Application Log Health

11 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Checks for issues in logs from Kubernetes Application Logs fetched through kubectl. Returning 1 when it's healthy and 0 when it's unhealthy.

Tasks:

Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Errors in Namespace `${NAMESPACE}`
Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Stack Traces in Namespace `${NAMESPACE}`
Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Connection Failures in Namespace `${NAMESPACE}`
Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Timeout Errors in Namespace `${NAMESPACE}`
Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Authentication and Authorization Failures in Namespace `${NAMESPACE}`
Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Null Pointer and Unhandled Exceptions in Namespace `${NAMESPACE}`
Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` for Log Anomalies in Namespace `${NAMESPACE}`
Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Application Restarts and Failures in Namespace `${NAMESPACE}`
Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Memory and CPU Resource Warnings in Namespace `${NAMESPACE}`
Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Service Dependency Failures in Namespace `${NAMESPACE}`
Generate Application Gateway Health Score

Source Code

K8s OpenTelemetry Collector Health

3 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

This taskset performs diagnostic checks on a OpenTelemetry Collector to ensure it's pushing metrics.

Tasks:

Query Collector Queued Spans in Namespace `NAMESPACE`
Check OpenTelemetry Collector Logs For Errors In Namespace `NAMESPACE`
Query OpenTelemetry Logs For Dropped Spans In Namespace `NAMESPACE`

Source Code

Discoverable

Kubernetes Namespace Troubleshoot

5 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

This taskset runs general troubleshooting checks against all applicable objects in a namespace, checks error events, and searches pod logs for error entries.

Tasks:

Trace Namespace Errors
Fetch Unready Pods
Triage Namespace
Object Condition Check
Namespace Get All

Source Code

Kubernetes Namespace Healthcheck

4 Troubleshooting Commands

Contributed by Shea Stewart

Codecollection: rw-public-codecollection

This SLI uses kubectl to score namespace health. Produces a value between 0 (completely failing thet test) and 1 (fully passing the test). Looks for container restarts, events, and pods not ready.

Tasks:

Get Event Count and Score
Get Container Restarts and Score
Get NotReady Pods
Generate Namspace Score

Source Code

Kubernetes Decomission Workload

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Searches a namespace for matching objects and provides the commands to decommission them.

Tasks:

Generate Decomission Commands

Source Code

Kubernetes Event Query

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Returns the number of events with matching messages as an SLI metric.

Tasks:

Get Number Of Matching Events

Source Code

Kubernetes Daemonset Health Check

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Checks that the current state of a daemonset is healthy and returns a score of either 1 (healthy) or 0 (unhealthy).

Tasks:

Health Check Daemonset

Source Code

Kubernetes Triage Deployment Replicas

3 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Triages issues related to a deployment's replicas.

Tasks:

Fetch Logs
Get Related Events
Check Deployment Replicas

Source Code

Kubernetes Top

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Retreieve aggregate data via kubectl top command.

Tasks:

Running Kubectl Top And Extracting Metric Data

Source Code

Kubernetes Troubleshoot Deployment

4 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

A taskset for troubleshooting general issues associated with typical kubernetes deployment resources. Supports API interactions via both the API client and Kubectl binary through RunWhen Shell Services.

Tasks:

Troubleshoot Resourcing
Troubleshoot Events
Troubleshoot PVC
Troubleshoot Pods

Source Code

Kubernetes Patroni Health Check

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Uses kubectl (or equivalent) to query the state of a patroni cluster and determine if it's healthy.

Tasks:

Determine Patroni Health

Source Code

Kubernetes Run Shell Command

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

This codebundle runs an arbitrary kubectl command and writes the stdout to a report. Typically used in conjunction with other codebundles.

Tasks:

Running Kubectl And Adding Stdout To Report

Source Code

Kubernetes Triage StatefulSet

4 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

A taskset for troubleshooting issues for StatefulSets and their related resources.

Tasks:

Check StatefulSets Replicas Ready
Get Events For The StatefulSet
Get StatefulSet Logs
Get StatefulSet Manifests Dump

Source Code

Kubernetes PostgreSQL Triage

7 Troubleshooting Commands

Contributed by Shea Stewart

Codecollection: rw-public-codecollection

Runs multiple Kubernetes and psql commands to report on the health of a postgres cluster.

Tasks:

Get Standard Resources
Describe Custom Resources
Get Pod Logs & Events
Get Pod Resource Utilization
Get Running Configuration
Get Patroni Output
Run DB Queries

Source Code

Kubernetes API Server Health

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Check the health of a Kubernetes API server using kubectl. Returns 1 when OK, or a 0 in the case of an unhealthy API server.

Tasks:

Running Kubectl Check Against API Server

Source Code

Kubernetes Synthetic PVC Test

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Creates an adhoc one-shot job which mounts a PVC as a canary test, which is polled for success before being torn down.

Tasks:

Run Canary Job

Source Code

Kubernetes Triage Patroni

3 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Taskset to triage issues related to patroni.

Tasks:

Get Patroni Status
Get Pods Status
Fetch Logs

Source Code

Kubernetes Patroni Lag Health

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Detects and reinitializes laggy Patroni cluster members which are unable to catchup in replication using kubectl and patronictl.

Tasks:

Determine Patroni Health

Source Code

Troubleshooting CheatSheet

Kubernetes Patroni Lag Metric

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Measures the maximum replica lag across a Patroni cluster.

Tasks:

Measure Patroni Member Lag

Source Code

Kubernetes Workload Metric

1 Troubleshooting Commands

Contributed by Shea Stewart

Codecollection: rw-public-codecollection

This codebundle runs a kubectl get command that produces a value and pushes the metric. Uses jmespath for filtering and allows calculations such as count, sum, avg on specified fields.

Tasks:

Running Kubectl get and push the metric

Source Code

Kubernetes PostgreSQL Query

1 Troubleshooting Commands

Contributed by Shea Stewart

Codecollection: rw-public-codecollection

Runs a postgres SQL query and pushes the returned result into a report. During execution, the SQL query should be passed to a Kubernetes workload that has access to the psql binary. The workload will run the query and return the results from stdout.

Tasks:

Run Postgres Query And Results to Report

Source Code

Cortex Metrics Ingester Health

1 Troubleshooting Commands

Contributed by Shea Stewart

Codecollection: rw-public-codecollection

Uses kubectl to query the state of a ingestor ring. Returns the json of injester id, status and timestamp.

Tasks:

Fetch Ingestor Ring Member List and Status

Source Code

Troubleshooting CheatSheet

Cortex Metrics Ingester Health

1 Troubleshooting Commands

Contributed by Shea Stewart

Codecollection: rw-public-codecollection

Uses kubectl to query the state of a ingestor ring and determine if it's healthy. Returns 1 if healthy, 0 if unhealthy.

Tasks:

Determine Cortex Ingester Ring Health

Source Code

Kubernetes CLI Command

1 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-generic-codecollection

Runs an ad-hoc user-provided command, and if the provided command outputs a non-empty string to stdout then an issue is generated with a configurable title and content. User commands should filter expected/healthy content (eg: with grep) and only output found errors.

Tasks:

TASK_TITLE

Source Code

Metric from Kubernetes CLI Command

1 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-generic-codecollection

Runs an ad-hoc user-provided command, and if the provided command outputs a non-empty string to stdout then a health score of 0 (unhealthy) is pushed, otherwise if there is no output, indicating no issues, then a 1 is pushed. User commands should filter expected/healthy content (eg: with grep) and only output found errors.

Tasks:

${TASK_TITLE}

Source Code

Kubernetes CLI Command

1 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-generic-codecollection

This taskset runs a user provided kubectl command and adds the output to the report. Command line tools like jq are available.

Tasks:

TASK_TITLE

Source Code

Metric from Kubernetes CLI Command

1 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-generic-codecollection

This taskset runs a user provided kubectl command and pushes the metric. The supplied command must result in distinct single metric. Command line tools like jq are available.

Tasks:

${TASK_TITLE}

Source Code