GKE

Icon

Icon 1 4 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


A taskset for troubleshooting general issues associated with typical kubernetes deployment resources. Supports API interactions via both the API client and Kubectl binary through RunWhen Shell Services.

Tasks:
  • Troubleshoot Resourcing
  • Troubleshoot Events
  • Troubleshoot PVC
  • Troubleshoot Pods

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


This codebundle runs an arbitrary kubectl command and writes the stdout to a report. Typically used in conjunction with other codebundles.

Tasks:
  • Running Kubectl And Adding Stdout To Report

Icon 1 5 Troubleshooting Commands

Icon 2 Contributed by Shea Stewart

Icon 2 Codecollection: rw-public-codecollection


Uses promql on the Ops Suite API to determine the health of a Kong managed ingress resource and pushes the result as an SLI metric. Produces a 1 for a healthy resource, or 0 for an unhealthy resource.

Tasks:
  • Get Access Token
  • Get HTTP Error Rate
  • Get Upstream Health
  • Get Request Latency Rate
  • Generate Kong Ingress Score

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Check the health of a Kubernetes API server using kubectl. Returns 1 when OK, or a 0 in the case of an unhealthy API server.

Tasks:
  • Running Kubectl Check Against API Server

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Shea Stewart

Icon 2 Codecollection: rw-public-codecollection


This codebundle runs a kubectl get command that produces a value and pushes the metric. Uses jmespath for filtering and allows calculations such as count, sum, avg on specified fields.

Tasks:
  • Running Kubectl get and push the metric

Icon 1 7 Troubleshooting Commands

Icon 2 Contributed by Shea Stewart

Icon 2 Codecollection: rw-public-codecollection


Runs multiple Kubernetes and psql commands to report on the health of a postgres cluster.

Tasks:
  • Get Standard Resources
  • Describe Custom Resources
  • Get Pod Logs & Events
  • Get Pod Resource Utilization
  • Get Running Configuration
  • Get Patroni Output
  • Run DB Queries

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Creates an adhoc one-shot job which mounts a PVC as a canary test, which is polled for success before being torn down.

Tasks:
  • Run Canary Job

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Retreieve aggregate data via kubectl top command.

Tasks:
  • Running Kubectl Top And Extracting Metric Data

Icon 1 5 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


This taskset runs general troubleshooting checks against all applicable objects in a namespace, checks error events, and searches pod logs for error entries.

Tasks:
  • Trace Namespace Errors
  • Fetch Unready Pods
  • Triage Namespace
  • Object Condition Check
  • Namespace Get All

Icon 1 4 Troubleshooting Commands

Icon 2 Contributed by Shea Stewart

Icon 2 Codecollection: rw-public-codecollection


This SLI uses kubectl to score namespace health. Produces a value between 0 (completely failing thet test) and 1 (fully passing the test). Looks for container restarts, events, and pods not ready.

Tasks:
  • Get Event Count and Score
  • Get Container Restarts and Score
  • Get NotReady Pods
  • Generate Namspace Score

Icon 1 8 Troubleshooting Commands

Icon 2 Contributed by Shea Stewart

Icon 2 Codecollection: rw-public-codecollection


Uses promql on the Ops Suite API to determine the health of a MongoDB database instance and pushes the result as an SLI metric. Produces a 1 for a healthy resource, or 0 for an unhealthy resource.

Tasks:
  • Get Access Token
  • Get Instance Status
  • Get Connection Utilization Rate
  • Get MongoDB Member State Health
  • Get MongoDB Replication Lag
  • Get MongoDB Queue Size
  • Get Assertion Rate
  • Generate MongoDB Score

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Shea Stewart

Icon 2 Codecollection: rw-public-codecollection


Runs a postgres SQL query and pushes the returned result into a report. During execution, the SQL query should be passed to a Kubernetes workload that has access to the psql binary. The workload will run the query and return the results from stdout.

Tasks:
  • Run Postgres Query And Results to Report

Icon 1 3 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Triages issues related to a deployment's replicas.

Tasks:
  • Fetch Logs
  • Get Related Events
  • Check Deployment Replicas

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Checks that the current state of a daemonset is healthy and returns a score of either 1 (healthy) or 0 (unhealthy).

Tasks:
  • Health Check Daemonset

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Searches a namespace for matching objects and provides the commands to decommission them.

Tasks:
  • Generate Decomission Commands

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Uses kubectl (or equivalent) to query the state of a patroni cluster and determine if it's healthy.

Tasks:
  • Determine Patroni Health

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Detects and reinitializes laggy Patroni cluster members which are unable to catchup in replication using kubectl and patronictl.

Tasks:
  • Determine Patroni Health

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Measures the maximum replica lag across a Patroni cluster.

Tasks:
  • Measure Patroni Member Lag

Icon 1 3 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Taskset to triage issues related to patroni.

Tasks:
  • Get Patroni Status
  • Get Pods Status
  • Fetch Logs

Icon 1 4 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


A taskset for troubleshooting issues for StatefulSets and their related resources.

Tasks:
  • Check StatefulSets Replicas Ready
  • Get Events For The StatefulSet
  • Get StatefulSet Logs
  • Get StatefulSet Manifests Dump

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Returns the number of events with matching messages as an SLI metric.

Tasks:
  • Get Number Of Matching Events

Icon 1 2 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


This codebundle runs a series of tasks to identify potential Kustomization issues related to Flux managed Kustomization objects.

Tasks:
  • List all available Kustomization objects in Namespace `NAMESPACE`
  • Get details for unready Kustomizations in Namespace `NAMESPACE`

Icon 1 5 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


This taskset collects information and runs general troubleshooting checks against argocd application objects within a namespace.

Tasks:
  • Fetch ArgoCD Application Sync Status & Health for `APPLICATION`
  • Fetch ArgoCD Application Last Sync Operation Details for `APPLICATION`
  • Fetch Unhealthy ArgoCD Application Resources for `APPLICATION`
  • Scan For Errors in Pod Logs Related to ArgoCD Application `APPLICATION`
  • Fully Describe ArgoCD Application `APPLICATION`

Icon 1 4 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-cli-codecollection


This taskset provides detailed information about the images used in a Kubernetes namespace.

Tasks:
  • Check Image Rollover Times for Namespace `NAMESPACE`
  • List Images and Tags for Every Container in Running Pods for Namespace `NAMESPACE`
  • List Images and Tags for Every Container in Failed Pods for Namespace `NAMESPACE`
  • List ImagePullBackOff Events and Test Path and Tags for Namespace `NAMESPACE`

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-cli-codecollection


Performs a triage on the Open Source version of Artifactory in a Kubernetes cluster.

Tasks:
  • Check Artifactory Liveness and Readiness Endpoints Show More
    Common scenarios that might relate to this command or script:
    1. Monitoring and troubleshooting the health and performance of a Kubernetes statefulset hosting an artifactory API endpoint. 2. Investigating and resolving issues related to network connectivity or DNS resolution within the Kubernetes cluster that may be affecting the readiness of the artifactory API endpoint. 3. Identifying and mitigating potential resource constraints or bottlenecks in the Kubernetes environment that could be causing the artifactory API endpoint to become unresponsive. 4. Implementing automated alerting and remediation processes to proactively detect and address any future readiness issues with the artifactory API endpoint. 5. Collaborating with developers to optimize the performance and resilience of the artifactory API endpoint, potentially through changes to the underlying Kubernetes deployment configuration or resource allocation.

Icon 1 4 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-cli-codecollection


Inspects the resources provisioned for a given set of pods and raises issues or recommendations as necessary.

Tasks:
  • Show Pods Without Resource Limit or Resource Requests Set in Namespace `NAMESPACE`
  • Get Pod Resource Utilization with Top in Namespace `NAMESPACE`
  • Identify VPA Pod Resource Recommendations in Namespace `NAMESPACE`
  • Identify Resource Constrained Pods In Namespace `NAMESPACE`

Icon 1 9 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-cli-codecollection


A suite of tasks that can be used to triage potential issues in your vault namespace.

Tasks:
  • Fetch Vault CSI Driver Logs Show More
    Common scenarios that might relate to this command or script:
    1. Investigating a Kubernetes CrashLoopBackoff event for the daemonset "vault-csi-provider" to identify the root cause of the issue and fix it. 2. Monitoring and analyzing the performance and behavior of the "vault-csi-provider" daemonset in a production environment to ensure smooth operation. 3. Troubleshooting user-reported issues related to the "vault-csi-provider" daemonset by reviewing its logs to determine any potential errors or anomalies. 4. Auditing and investigating security incidents or compliance violations related to the "vault-csi-provider" daemonset by examining its logs for suspicious activities. 5. Performing routine maintenance and oversight of the "vault-csi-provider" daemonset, such as checking for any operational errors or warnings in its logs.
  • Get Vault CSI Driver Warning Events Show More
    Common scenarios that might relate to this command or script:
    1. Monitoring and troubleshooting the Kubernetes cluster for potential issues related to the "vault-csi-provider" in a specific namespace. 2. Investigating and addressing crash loop events that are affecting the performance of the "vault-csi-provider" in the Kubernetes context. 3. Identifying and resolving warning events that may impact the stability and reliability of the "vault-csi-provider" in the specified namespace. 4. Analyzing and debugging errors or warnings related to the "vault-csi-provider" in the Kubernetes environment to ensure smooth operation. 5. Troubleshooting and resolving any potential issues with the "vault-csi-provider" that could lead to a CrashLoopBackoff event in the Kubernetes context.
  • Check Vault CSI Driver Replicas Show More
    Common scenarios that might relate to this command or script:
    1. Troubleshooting Kubernetes CrashLoopBackoff events for the "vault-csi-provider" daemonset. 2. Investigating why the "vault-csi-provider" daemonset is not running or experiencing errors in a specific context and namespace. 3. Monitoring and understanding the resource utilization and scheduling of the "vault-csi-provider" daemonset in the Kubernetes cluster. 4. Debugging issues related to the deployment and scaling of the "vault-csi-provider" daemonset. 5. Verifying the configuration and settings of the "vault-csi-provider" daemonset to ensure it meets the desired specifications and requirements.
  • Fetch Vault Logs Show More
    Common scenarios that might relate to this command or script:
    1. Investigating an issue with the "vault" statefulset where it is reporting CrashLoopBackoff events, and troubleshooting possible reasons for the failures by inspecting the last 100 lines of logs. 2. Monitoring the performance of the "vault" statefulset and identifying any potential errors or issues by periodically checking the logs. 3. Debugging an incident where the "vault" statefulset is not functioning as expected, and using the logs to pinpoint the root cause of the problem. 4. Troubleshooting a deployment issue with the "vault" statefulset, such as pods not starting up properly, by examining the logs for any error messages or warnings. 5. Conducting routine maintenance or debugging tasks on the "vault" statefulset, such as during an upgrade or scaling operation, to ensure that the system is running smoothly and identify any potential issues before they become critical.
  • Get Related Vault Events Show More
    Common scenarios that might relate to this command or script:
    1. Troubleshooting a critical production issue where the Kubernetes CrashLoopBackoff events are causing an application to fail or become unresponsive. 2. Investigating and resolving recurring warning events related to the "vault" in a specific Kubernetes context and namespace that may be affecting the performance or availability of the environment. 3. Monitoring and analyzing the error logs related to the "vault" in a Kubernetes context and namespace to proactively identify and address any potential issues before they impact the system. 4. Conducting a post-incident analysis to understand the root cause of a recent outage or disruption in the Kubernetes environment, involving the exploration of warning events related to the "vault". 5. Performing routine maintenance and system health checks to ensure the stability and reliability of the Kubernetes deployment, including addressing any warning events related to the "vault" that may have been flagged during regular monitoring activities.
  • Fetch Vault StatefulSet Manifest Details Show More
    Common scenarios that might relate to this command or script:
    1. Troubleshooting an issue with the statefulset named "vault" not starting properly in a particular namespace, and needing to investigate its configuration to identify any potential misconfigurations. 2. Debugging a Kubernetes CrashLoopBackoff event for the "vault" statefulset and needing to review its configuration to identify any potential issues causing the crash loops. 3. Implementing changes or updates to the configuration of the "vault" statefulset in the specified namespace and needing to verify the current configuration before making any modifications. 4. Conducting a review of the current statefulset configuration for "vault" in order to ensure compliance with best practices or security standards. 5. Analyzing the configuration of the "vault" statefulset in the context of a specific Kubernetes cluster as part of a broader investigation into performance or stability issues.
  • Fetch Vault DaemonSet Manifest Details Show More
    Common scenarios that might relate to this command or script:
    1. Troubleshooting CrashLoopBackoff events: DevOps or SREs might use this command to retrieve the configuration for a specific daemonset in order to debug and diagnose issues causing CrashLoopBackoff events in the Kubernetes cluster. 2. Updating or modifying configuration: When making changes to the configuration of a specific daemonset, DevOps or SREs might use this command to retrieve the current configuration as a reference and then make necessary modifications before applying the changes to the cluster. 3. Auditing and compliance checks: DevOps or SREs might use this command to audit the configuration of a specific daemonset to ensure it meets security and compliance standards. 4. Disaster recovery: In the event of a disaster or unexpected outage, DevOps or SREs might use this command to quickly retrieve the configuration of a specific daemonset in order to rebuild or restore the cluster to its previous state. 5. Investigating performance issues: When investigating performance issues within the Kubernetes cluster, DevOps or SREs might use this command to analyze the configuration of a specific daemonset to identify any potential bottlenecks or optimization opportunities.
  • Verify Vault Availability Show More
    Common scenarios that might relate to this command or script:
    1. Troubleshooting Kubernetes CrashLoopBackoff events: A DevOps or Site Reliability Engineer might use the cURL tool to send requests to various endpoints in order to diagnose the cause of the crash loop and identify any issues with the server or application. 2. Testing API endpoints: The engineer might use cURL to test the functionality of different API endpoints within the system, ensuring that data is being retrieved and manipulated correctly. 3. Debugging network connectivity issues: If there are issues with network connectivity between servers or services, the engineer might use cURL to check for connectivity and diagnose any problems. 4. Monitoring vault operations: The engineer might use cURL to monitor and manage operations on a secure repository, such as fetching secrets from Vault or updating configurations. 5. Automating tasks: The engineer might use cURL in scripts or automation tools to perform routine tasks, such as regularly retrieving data from a server at the specified VAULT_URL and processing it for further analysis or storage.
  • Check Vault StatefulSet Replicas Show More
    Common scenarios that might relate to this command or script:
    1. Investigating and troubleshooting issues related to the StatefulSet 'vault' in a Kubernetes cluster, such as CrashLoopBackoff events, high resource utilization, or deployment failures. 2. Monitoring and analyzing the performance and availability of the 'vault' StatefulSet in order to make informed decisions about scaling, updates, or other operational tasks. 3. Automating the retrieval and analysis of StatefulSet information for 'vault' as part of a larger monitoring and alerting system to proactively address potential issues. 4. Integrating the StatefulSet information retrieval into a larger DevOps or SRE workflow for managing and maintaining the 'vault' application within the Kubernetes environment. 5. Troubleshooting and resolving any unintended changes or discrepancies in the 'vault' StatefulSet configuration or deployment within the specified namespace and context.

Icon 1 9 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Triages issues related to a deployment and its replicas.

Tasks:
  • Check Deployment Log For Issues with `DEPLOYMENT_NAME`
  • Check Liveness Probe Configuration for Deployment `DEPLOYMENT_NAME`
  • Check Readiness Probe Configuration for Deployment `DEPLOYMENT_NAME`
  • Inspect Container Restarts for Deployment `DEPLOYMENT_NAME` Namespace `NAMESPACE`
  • Inspect Deployment Warning Events for `DEPLOYMENT_NAME`
  • Get Deployment Workload Details For `DEPLOYMENT_NAME` and Add to Report Show More
    Common scenarios that might relate to this command or script:
    1. Troubleshooting Kubernetes CrashLoopBackoff events: When a deployment is continuously crashing and restarting, a DevOps or SRE might use this command to retrieve the YAML configuration in order to review the settings and find any misconfigurations that could be causing the issue. 2. Investigating performance issues: If there are performance issues with a specific deployment, a DevOps or SRE might use this command to view the detailed configuration and identify any potential bottlenecks or inefficiencies. 3. Auditing and documentation: In order to keep track of the configurations for different deployments, a DevOps or SRE might use this command to retrieve the YAML configuration and document it for future reference or auditing purposes. 4. Comparing configurations: When comparing the settings of different deployments or versions of a deployment, a DevOps or SRE might use this command to retrieve the YAML configuration and compare them side by side. 5. Making changes to the deployment: Before making any changes to the deployment configuration, a DevOps or SRE might use this command to retrieve the current configuration as a reference point for the updates.
  • Inspect Deployment Replicas for `DEPLOYMENT_NAME`
  • Check Deployment Event Anomalies for `DEPLOYMENT_NAME`
  • Check ReplicaSet Health for Deployment `DEPLOYMENT_NAME`

Icon 1 8 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Runs a series of tasks to check the overall health of a postgres cluster and to provide detailed information useful for debugging or reviewing configurations.

Tasks:
  • List Resources Related to Postgres Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
  • Get Postgres Pod Logs & Events for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
  • Get Postgres Pod Resource Utilization for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
  • Get Running Postgres Configuration for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
  • Get Patroni Output and Add to Report for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
  • Fetch Patroni Database Lag for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
  • Check Database Backup Status for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
  • Run DB Queries for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`

Icon 1 3 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Runs multiple Kubernetes and psql commands to report on the health of a postgres cluster. Produces a value between 0 (completely failing thet test) and 1 (fully passing the test). Checks for database lag & backup health.

Tasks:
  • Fetch Patroni Database Lag
  • Check Database Backup Status for Cluster `${OBJECT_NAME}` in Namespace `${NAMESPACE}`
  • Generate Namspace Score

Icon 1 3 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-cli-codecollection


This taskset performs diagnostic checks on a OpenTelemetry Collector to ensure it's pushing metrics.

Tasks:
  • Query Collector Queued Spans in Namespace `NAMESPACE`
  • Check OpenTelemetry Collector Logs For Errors In Namespace `NAMESPACE`
  • Scan OpenTelemetry Logs For Dropped Spans In Namespace `NAMESPACE`

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


List all GCP nodes that have been preempted in the previous time interval.

Tasks:
  • List all nodes in an active prempt operation for GCP Project `GCP_PROJECT_ID`

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Counts nodes that have been preempted within the defined time interval.

Tasks:
  • Count the number of nodes in active prempt operation

Icon 1 2 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Triages issues related to a ingress objects and services.

Tasks:
  • Fetch Ingress Object Health in Namespace `NAMESPACE`
  • Check for Ingress and Service Conflicts in Namespace `NAMESPACE`

Icon 1 5 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


This taskset is used to suspend a flux resource for the purposes of executing chaos tasks.

Tasks:
  • Suspend the Flux Resource Reconciliation
  • Find Random FluxCD Workload as Chaos Target
  • Execute Chaos Command
  • Execute Additional Chaos Command
  • Resume Flux Resource Reconciliation

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


This taskset queries Jaeger API directly for trace details and parses the results

Tasks:
  • Query Traces in Jaeger for Unhealthy HTTP Response Codes in Namespace `NAMESPACE`

Icon 1 3 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-cli-codecollection


Performs application-level troubleshooting by inspecting the logs of a workload for parsable exceptions, and attempts to determine next steps.

Tasks:
  • Get `CONTAINER_NAME` Application Logs Show More
    Common scenarios that might relate to this command or script:
    1. Troubleshooting a Kubernetes CrashLoopBackoff event to identify the root cause of the issue and fix the underlying problem. 2. Analyzing application errors or performance issues in a Kubernetes cluster by examining the logs of specific containers. 3. Monitoring and debugging a deployment rollout to verify that new pods are starting correctly and to investigate any potential failures. 4. Investigating a security incident or suspicious activity within a Kubernetes cluster by reviewing container logs for any unauthorized access or malicious behavior. 5. Troubleshooting connectivity or networking problems within a Kubernetes environment by inspecting the logs of affected containers.
  • Scan `CONTAINER_NAME` Application For Misconfigured Environment
  • Tail `CONTAINER_NAME` Application Logs For Stacktraces Show More
    Common scenarios that might relate to this command or script:
    1. Troubleshooting Kubernetes CrashLoopBackoff events 2. Analyzing performance issues within a specific container in a deployment or stateful set 3. Investigating resource utilization and potential bottlenecks within a specific namespace and context 4. Monitoring and debugging application errors or crashes within a specific container 5. Identifying and resolving networking issues within a specific deployment or stateful set

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-cli-codecollection


Measures the number of exception stacktraces present in an application's logs over a time period.

Tasks:
  • Measure Application Exceptions

Icon 1 2 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-cli-codecollection


This taskset collects information on your redis workload in your Kubernetes cluster and raises issues if any health checks fail.

Tasks:
  • Ping `DEPLOYMENT_NAME` Redis Workload Show More
    Common scenarios that might relate to this command or script:
    1. Troubleshooting a Kubernetes CrashLoopBackoff event for a specific Redis deployment to see if the server is running properly and responding to commands. 2. Performing routine health checks on Redis deployments within the Kubernetes cluster to ensure that the servers are operational and responsive. 3. Checking the status of the Redis server after a recent deployment or upgrade to ensure that it is functioning as expected within the Kubernetes environment. 4. Verifying the status of the Redis server in response to user-reported issues or errors related to data storage or retrieval. 5. Investigating performance or latency issues within the Kubernetes cluster by inspecting the responsiveness of the Redis servers using the redis-cli PING command.
  • Verify `DEPLOYMENT_NAME` Redis Read Write Operation Show More
    Common scenarios that might relate to this command or script:
    1. Troubleshooting application performance issues related to Redis in a Kubernetes environment. 2. Investigating and resolving connectivity issues between a Kubernetes deployment and the Redis database. 3. Monitoring and diagnosing potential data inconsistencies or corruption in the Redis database within a Kubernetes cluster. 4. Analyzing and troubleshooting CrashLoopBackoff events related to the Redis deployment in Kubernetes. 5. Providing support for developers by retrieving specific key values from the Redis database within a Kubernetes environment for debugging purposes.

Icon 1 6 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


This taskset collects information about storage such as PersistentVolumes and PersistentVolumeClaims to validate health or help troubleshoot potential storage issues.

Tasks:
  • Fetch Events for Unhealthy Kubernetes PersistentVolumeClaims in Namespace `NAMESPACE`
  • List PersistentVolumeClaims in Terminating State in Namespace `NAMESPACE`
  • List PersistentVolumes in Terminating State in Namespace `NAMESPACE`
  • List Pods with Attached Volumes and Related PersistentVolume Details in Namespace `NAMESPACE`
  • Fetch the Storage Utilization for PVC Mounts in Namespace `NAMESPACE`
  • Check for RWO Persistent Volume Node Attachment Issues in Namespace `NAMESPACE`

Icon 1 2 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


This SLI collects information about storage such as PersistentVolumes and PersistentVolumeClaims and generates an aggregated health score for the namespace. 1 = Healthy, 0 = Failed, >0 <1 = Degraded

Tasks:
  • Fetch the Storage Utilization for PVC Mounts in Namespace `${NAMESPACE}`
  • Generate Namspace Score

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


This taskset provides tasks to troubleshoot service accounts in a Kubernetes namespace.

Tasks:
  • Test Service Account Access to Kubernetes API Server in Namespace `NAMESPACE`

Icon 1 5 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


This codebundle runs a series of tasks to identify potential helm release issues related to Flux managed Helm objects.

Tasks:
  • List all available FluxCD Helmreleases in Namespace `NAMESPACE`
  • Fetch Installed FluxCD Helmrelease Versions in Namespace `NAMESPACE`
  • Fetch Mismatched FluxCD HelmRelease Version in Namespace `NAMESPACE`
  • Fetch FluxCD HelmRelease Error Messages in Namespace `NAMESPACE`
  • Check for Available Helm Chart Updates in Namespace `NAMESPACE`

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


This codebundle fetches the number of running pods with the set of provided labels, letting you measure the number of running pods.

Tasks:
  • Measure Number of Running Pods with Label

Icon 1 5 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Troubleshoot GCE Ingress Resources related to GCP HTTP Load Balancer in GKE

Tasks:
  • Search For GCE Ingress Warnings in GKE Show More
    Common scenarios that might relate to this command or script:
    1. Troubleshooting Kubernetes CrashLoopBackoff events: A DevOps or Site Reliability Engineer might use this command to gather information on abnormal events related to an Ingress and its associated Services in order to identify and fix any issues causing the CrashLoopBackoff events. 2. Investigating service disruption: If there are reports of service disruption within a specific namespace, a DevOps or Site Reliability Engineer might use this command to retrieve events related to the Ingress and Services to identify any abnormal events causing the disruption. 3. Debugging failed deployments: When a deployment fails within a specified namespace, a DevOps or Site Reliability Engineer might use this command to gather information on any abnormal events related to the Ingress and Services that could be contributing to the failed deployment. 4. Monitoring for unusual behavior: As part of routine monitoring and maintenance, a DevOps or Site Reliability Engineer might use this command to regularly check for abnormal events related to Ingress and Services within a specific namespace for any unusual behavior that could indicate potential issues. 5. Identifying resource conflicts: In a multi-tenant environment, a DevOps or Site Reliability Engineer might use this command to retrieve events related to the Ingress and Services in order to identify any resource conflicts or issues arising from interactions between different applications or services within the same namespace.
  • Identify Unhealthy GCE HTTP Ingress Backends Show More
    Common scenarios that might relate to this command or script:
    1. Troubleshooting Kubernetes CrashLoopBackoff events: A DevOps or Site Reliability Engineer may use this task to quickly identify and address unhealthy backends that are causing CrashLoopBackoff events in a Kubernetes cluster. 2. Monitoring and alerting: This task can be used to set up automated monitoring and alerting for unhealthy backends in a Kubernetes cluster, allowing the team to proactively address any issues before they impact the system. 3. Incident response: In the event of a system outage or performance degradation, a DevOps or SRE may use this task to quickly identify and address any unhealthy backends that are contributing to the issue. 4. Capacity planning: This task can be used to analyze the health and status of backends in a Kubernetes cluster, allowing the team to make informed decisions about capacity planning and resource allocation. 5. Continuous improvement: By regularly using this task to monitor and analyze the health of backends in a Kubernetes cluster, a DevOps or SRE can identify areas for improvement and optimize the system for better performance and reliability.
  • Validate GCP HTTP Load Balancer Configurations
  • Fetch Network Error Logs from GCP Operations Manager for Ingress Backends Show More
    Common scenarios that might relate to this command or script:
    1. Monitoring and troubleshooting an Ingress controller in Kubernetes when it goes into CrashLoopBackoff due to unhealthy backends. 2. Investigating and resolving issues with backend services not responding or returning errors within a Kubernetes cluster. 3. Troubleshooting and identifying the root cause of failures in Kubernetes pods or deployments that result in CrashLoopBackoff events. 4. Analyzing GCP logging data for error messages related to Kubernetes workloads and diagnosing issues such as connectivity problems or service outages. 5. Automating the process of identifying and retrieving error logs from GCP logging for specific backends in an Ingress controller in Kubernetes.
  • Review GCP Operations Logging Dashboard Show More
    Common scenarios that might relate to this command or script:
    1. Troubleshooting Kubernetes CrashLoopBackoff events: The DevOps or Site Reliability Engineer might use this command to quickly access and review logs for unhealthy backends in order to identify the root cause of the CrashLoopBackoff events. 2. Investigating high error rates in a specific GCP project or namespace: The engineer might use the command to easily gather and analyze logs from specific environments to identify patterns or issues causing high error rates. 3. Monitoring and analyzing traffic spikes or anomalies in a GCP ingress: The command could be used to generate logs for a specific ingress and quickly review them for unusual traffic patterns or anomalies. 4. Troubleshooting performance issues in a particular GCP context: The engineer might utilize the command to gather and analyze logs for a specific context to diagnose and resolve performance-related issues. 5. Investigating failures in a specific environment or application namespace: The command could be used to quickly access and investigate logs for a specific environment or application namespace experiencing failures or errors.

Icon 1 2 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Identify resource constraints or issues in a cluster.

Tasks:
  • Identify High Utilization Nodes for Cluster `CONTEXT`
  • Identify Pods Causing High Node Utilization in Cluster `CONTEXT`

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Counts the number of nodes above 90% CPU or Memory Utilization from kubectl top.

Tasks:
  • Identify High Utilization Nodes for Cluster `${CONTEXT}`

Icon 1 2 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-cli-codecollection


Performs application-level troubleshooting by inspecting the logs of a workload for parsable exceptions, and attempts to determine next steps.

Tasks:
  • Get `CONTAINER_NAME` Application Logs
  • Tail `CONTAINER_NAME` Application Logs For Stacktraces

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-cli-codecollection


Measures the number of exception stacktraces present in an application's logs over a time period.

Tasks:
  • Tail `${CONTAINER_NAME}` Application Logs For Stacktraces

Icon 1 5 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-cli-codecollection


This taskset investigates the logs, state and health of Kubernetes Prometheus operator.

Tasks:
  • Check Prometheus Service Monitors
  • Check For Successful Rule Setup Show More
    Common scenarios that might relate to this command or script:
    1. Monitoring and troubleshooting the overall health and performance of Kubernetes clusters 2. Investigating issues with applications or microservices running on Kubernetes pods, such as service failures or high resource usage 3. Identifying and addressing problems with containerized applications, such as crashes or network connectivity issues 4. Analyzing and debugging system and application logs for specific error patterns or anomalies 5. Proactively monitoring and detecting potential security threats or unauthorized access within Kubernetes environments
  • Verify Prometheus RBAC Can Access ServiceMonitors Show More
    Common scenarios that might relate to this command or script:
    1. Troubleshooting Kubernetes CrashLoopBackoff events: A DevOps or Site Reliability Engineer might use this command to retrieve the details of a specific ClusterRole in order to investigate if the role permissions are causing the CrashLoopBackoff events. 2. Managing and auditing access control: This command can be used to view and manage the permissions and access controls for different resources within a Kubernetes cluster. A DevOps or Site Reliability Engineer might use this command to audit and update the permissions of a specific ClusterRole. 3. Debugging deployment issues: If there are issues with deploying certain resources within the Kubernetes cluster, a DevOps or Site Reliability Engineer might use this command to retrieve the details of a specific ClusterRole to ensure that the necessary permissions are in place for the deployment to succeed. 4. Monitoring and troubleshooting resource usage: This command can be used to retrieve information about the resources allocated and used by a specific ClusterRole within the Kubernetes cluster. A DevOps or Site Reliability Engineer might use this command to monitor and troubleshoot any resource usage issues related to the role. 5. Performing routine maintenance and upgrades: As part of routine maintenance and upgrade tasks, a DevOps or Site Reliability Engineer might use this command to review and update the permissions of a specific ClusterRole to ensure compatibility and compliance with the latest changes and updates in the Kubernetes cluster.
  • Identify Endpoint Scraping Errors Show More
    Common scenarios that might relate to this command or script:
    1. Monitoring the health and performance of the Prometheus container in a Kubernetes environment 2. Troubleshooting issues with data scraping or ingestion in a Prometheus instance running in a Kubernetes cluster 3. Investigating errors or anomalies related to Prometheus metrics collection and storage 4. Performing log analysis and troubleshooting for Prometheus containers experiencing CrashLoopBackoff events 5. Verifying the successful retrieval and filtering of logs from the Prometheus container for proactive monitoring and alerting purposes
  • Check Prometheus API Healthy Show More
    Common scenarios that might relate to this command or script:
    1. Troubleshooting Kubernetes CrashLoopBackoff events: A DevOps or Site Reliability Engineer might use this command to check the health status of the Prometheus container after it experiences CrashLoopBackoff events, in order to diagnose and resolve any issues causing the continuous crashing. 2. Monitoring application health during deployment: During the deployment of a new version of an application on Kubernetes, a DevOps or Site Reliability Engineer might use this command to continuously monitor the health status of the Prometheus container to ensure that the new version is functioning properly. 3. Investigating intermittent connectivity issues: If there are intermittent connectivity issues reported by users accessing the application hosted on Kubernetes, a DevOps or Site Reliability Engineer might use this command to check the health status of the Prometheus container and investigate if there are any underlying network issues affecting the application. 4. Performance troubleshooting: When performance issues are reported with an application running on Kubernetes, a DevOps or Site Reliability Engineer might use this command to monitor the health status of the Prometheus container and gather insights into potential performance bottlenecks. 5. Post-incident analysis: After an incident or outage involving the application on Kubernetes, a DevOps or Site Reliability Engineer might use this command to analyze the health status of the Prometheus container and identify any issues that contributed to the incident, in order to prevent similar problems in the future.

Icon 1 8 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-cli-codecollection


Triages issues related to a StatefulSet and its replicas.

Tasks:
  • Check Readiness Probe Configuration for StatefulSet `STATEFULSET_NAME`
  • Check Liveness Probe Configuration for StatefulSet `STATEFULSET_NAME`
  • Troubleshoot StatefulSet Warning Events for `STATEFULSET_NAME`
  • Check StatefulSet Event Anomalies for `STATEFULSET_NAME`
  • Fetch StatefulSet Logs for `STATEFULSET_NAME`
  • Get Related StatefulSet `STATEFULSET_NAME` Events Show More
    Common scenarios that might relate to this command or script:
    1. Troubleshooting a Kubernetes CrashLoopBackoff event in a production environment to ensure that the application is running smoothly and efficiently. 2. Monitoring and managing resource utilization within a specific namespace to optimize performance and prevent potential issues. 3. Investigating networking or connectivity issues within a Kubernetes cluster to ensure seamless communication between pods and services. 4. Resolving deployment failures or errors within a statefulset to maintain the availability and stability of the application. 5. Performing routine maintenance and checks on Kubernetes clusters to proactively identify and address any potential issues before they escalate.
  • Fetch Manifest Details for StatefulSet `STATEFULSET_NAME`
  • List StatefulSets with Unhealthy Replica Counts In Namespace `NAMESPACE`

Icon 1 2 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-cli-codecollection


This taskset collects information about perstistent volumes and persistent volume claims to validate health or help troubleshoot potential issues.

Tasks:
  • Query The Jenkins Kubernetes Workload HTTP Endpoint Show More
    Common scenarios that might relate to this command or script:
    1. Troubleshooting a CrashLoopBackoff event in a StatefulSet to identify any issues with the container startup process. 2. Checking for connectivity issues or errors within a specific container in a StatefulSet. 3. Monitoring and debugging the interaction between a Jenkins service account and a specific container within a StatefulSet. 4. Gathering specific data or metrics from a container in a StatefulSet for analysis or debugging purposes. 5. Verifying the response of a specific endpoint or API within a container in a StatefulSet.
  • Query For Stuck Jenkins Jobs Show More
    Common scenarios that might relate to this command or script:
    1. Troubleshooting Kubernetes CrashLoopBackoff events: A DevOps or Site Reliability Engineer might use this command to investigate and gather information about stuck or blocked items in a Jenkins job queue that could be causing the CrashLoopBackoff. 2. Monitoring and debugging performance issues: If there are performance issues with a specific statefulset within the Kubernetes cluster, the engineer might use this command to retrieve information and identify any stuck or blocked items impacting the performance. 3. Investigating job queue delays: In the event of delays in the Jenkins job queue, the engineer may use this command to gather information on any stuck or blocked items that could be causing the delays. 4. Identifying and resolving resource contention: This command could be used to gather data on any resource contention within a statefulset, helping the engineer to identify and address any stuck or blocked items contributing to the issue. 5. Troubleshooting job failures: If there are frequent job failures within a specific statefulset, the engineer might use this command to retrieve information and pinpoint any stuck or blocked items that could be causing the failures.

Icon 1 3 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-cli-codecollection


Triages issues related to a Daemonset and its available replicas.

Tasks:
  • Get DaemonSet Log Details For Report Show More
    Common scenarios that might relate to this command or script:
    1. Monitoring the health and performance of a specific daemonset in a Kubernetes cluster to troubleshoot any issues or anomalies. 2. Investigating frequent CrashLoopBackoff events for a particular daemonset to identify the root cause and potential solutions. 3. Analyzing the logs of a specific daemonset to track down errors or issues related to resource utilization, connectivity, or application functionality. 4. Troubleshooting networking problems or intermittent failures for a daemonset by reviewing its recent log entries to identify patterns or recurring issues. 5. Performing regular maintenance or checks on a specific daemonset to proactively identify and address any potential issues before they impact production environments.
  • Get Related Daemonset Events Show More
    Common scenarios that might relate to this command or script:
    1. Monitoring and troubleshooting Kubernetes cluster for potential issues such as CrashLoopBackoff events 2. Investigating performance or stability issues within a specific Kubernetes context and namespace 3. Troubleshooting errors related to a specific daemon set in a Kubernetes cluster 4. Conducting regular maintenance and auditing of Kubernetes clusters for potential issues or misconfigurations 5. Investigating and resolving any potential security vulnerabilities or breaches in a Kubernetes environment
  • Check Daemonset Replicas Show More
    Common scenarios that might relate to this command or script:
    1. Troubleshooting Kubernetes CrashLoopBackoff events in a production environment to identify the root cause and resolve the issue. 2. Conducting a routine check on various daemonsets in a Kubernetes cluster to ensure they are running as expected and have the correct configuration. 3. Investigating performance issues related to a specific daemonset in a Kubernetes cluster and using the command to gather detailed information for analysis. 4. Auditing the status and configuration of all daemonsets in a Kubernetes cluster as part of regular maintenance tasks. 5. Resolving connectivity or networking issues affecting a specific daemonset by examining its current status and configuration with the command.

Icon 1 3 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Checks the overall health of certificates in a namespace that are managed by cert-manager.

Tasks:
  • Get Namespace Certificate Summary for Namespace `NAMESPACE`
  • Find Unhealthy Certificates in Namespace `NAMESPACE`
  • Find Failed Certificate Requests and Identify Issues for Namespace `NAMESPACE`

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Counts the number of unhealthy cert-manager managed certificates in a namespace.

Tasks:
  • Count Unready and Expired Certificates

Icon 1 3 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-cli-codecollection


This taskset restarts a resource with a given set of labels, typically used with other tasksets.

Tasks:
  • Get Current Resource State with Labels `LABELS`
  • Get Resource Logs with Labels `LABELS`
  • Restart Resource with Labels `LABELS`

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Evaluate cluster node health using kubectl

Tasks:
  • Check for Node Restarts in Cluster `CONTEXT`

Icon 1 2 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Evaluate cluster node health using kubectl.

Tasks:
  • Check for Node Restarts in Cluster `${CONTEXT}`
  • Generate Namspace Score

Icon 1 4 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Provides a list of tasks that can remediate configuraiton issues with manifests in GitHub based GitOps repositories.

Tasks:
  • Remediate Readiness and Liveness Probe GitOps Manifests in Namespace `NAMESPACE`
  • Increase ResourceQuota for Namespace `NAMESPACE`
  • Adjust Pod Resources to Match VPA Recommendation in `NAMESPACE`
  • Expand Persistent Volume Claims in Namespace `NAMESPACE`

Icon 1 9 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


This taskset runs general troubleshooting checks against all applicable objects in a namespace. Looks for warning events, odd or frequent normal events, restarting containers and failed or pending pods.

Tasks:
  • Inspect Warning Events in Namespace `NAMESPACE`
  • Inspect Container Restarts In Namespace `NAMESPACE`
  • Inspect Pending Pods In Namespace `NAMESPACE`
  • Inspect Failed Pods In Namespace `NAMESPACE`
  • Inspect Workload Status Conditions In Namespace `NAMESPACE`
  • Get Listing Of Resources In Namespace `NAMESPACE`
  • Check Event Anomalies in Namespace `NAMESPACE`
  • Check Missing or Risky PodDisruptionBudget Policies in Namepace `NAMESPACE`
  • Check Resource Quota Utilization in Namespace `NAMESPACE`

Icon 1 4 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


This SLI uses kubectl to score namespace health. Produces a value between 0 (completely failing thet test) and 1 (fully passing the test). Looks for container restarts, events, and pods not ready.

Tasks:
  • Get Event Count and Score
  • Get Container Restarts and Score
  • Get NotReady Pods
  • Generate Namspace Score

Icon 1 2 Troubleshooting Commands

Icon 2 Contributed by nmadhok

Icon 2 Codecollection: rw-cli-codecollection


This codebundle runs a series of tasks to identify potential helm release issues related to ArgoCD managed Helm objects.

Tasks:
  • Fetch all available ArgoCD Helm releases in namespace `NAMESPACE`
  • Fetch Installed ArgoCD Helm release versions in namespace `NAMESPACE`