KUBERNETES
Returns the number of events with matching messages as an SLI metric.
Tasks:
Tasks:
- Get Number Of Matching Events
Searches a namespace for matching objects and provides the commands to decommission them.
Tasks:
Tasks:
- Generate Decomission Commands
Checks that the current state of a daemonset is healthy and returns a score of either 1 (healthy) or 0 (unhealthy).
Tasks:
Tasks:
- Health Check Daemonset
Check the health of a Kubernetes API server using kubectl.
Returns 1 when OK, or a 0 in the case of an unhealthy API server.
Tasks:
Tasks:
- Running Kubectl Check Against API Server
This codebundle runs a kubectl get command that produces a value and pushes the metric.
Uses jmespath for filtering and allows calculations such as count, sum, avg on specified fields.
Tasks:
Tasks:
- Running Kubectl get and push the metric
Uses kubectl (or equivalent) to query the state of a patroni cluster and determine if it's healthy.
Tasks:
Tasks:
- Determine Patroni Health
Runs multiple Kubernetes and psql commands to report on the health of a postgres cluster.
Tasks:
Tasks:
- Get Standard Resources
- Describe Custom Resources
- Get Pod Logs & Events
- Get Pod Resource Utilization
- Get Running Configuration
- Get Patroni Output
- Run DB Queries
Taskset to triage issues related to patroni.
Tasks:
Tasks:
- Get Patroni Status
- Get Pods Status
- Fetch Logs
Runs a postgres SQL query and pushes the returned result into a report.
During execution, the SQL query should be passed to a Kubernetes workload that has access to the psql binary.
The workload will run the query and return the results from stdout.
Tasks:
Tasks:
- Run Postgres Query And Results to Report
Triages issues related to a deployment's replicas.
Tasks:
Tasks:
- Fetch Logs
- Get Related Events
- Check Deployment Replicas
A taskset for troubleshooting general issues associated with typical kubernetes deployment resources.
Supports API interactions via both the API client and Kubectl binary through RunWhen Shell Services.
Tasks:
Tasks:
- Troubleshoot Resourcing
- Troubleshoot Events
- Troubleshoot PVC
- Troubleshoot Pods
Retreieve aggregate data via kubectl top command.
Tasks:
Tasks:
- Running Kubectl Top And Extracting Metric Data
This codebundle runs an arbitrary kubectl command and writes the stdout to a report.
Typically used in conjunction with other codebundles.
Tasks:
Tasks:
- Running Kubectl And Adding Stdout To Report
Creates an adhoc one-shot job which mounts a PVC as a canary test, which is polled for success before being torn down.
Tasks:
Tasks:
- Run Canary Job
A taskset for troubleshooting issues for StatefulSets and their related resources.
Tasks:
Tasks:
- Check StatefulSets Replicas Ready
- Get Events For The StatefulSet
- Get StatefulSet Logs
- Get StatefulSet Manifests Dump
Measures the maximum replica lag across a Patroni cluster.
Tasks:
Tasks:
- Measure Patroni Member Lag
Detects and reinitializes laggy Patroni cluster members which are unable to catchup in replication using kubectl and patronictl.
Tasks:
Tasks:
- Determine Patroni Health
Uses kubectl to query the state of a ingestor ring and determine if it's healthy. Returns 1 if healthy, 0 if unhealthy.
Tasks:
Tasks:
- Determine Cortex Ingester Ring Health
Uses kubectl to query the state of a ingestor ring. Returns the json of injester id, status and timestamp.
Tasks:
Tasks:
- Fetch Ingestor Ring Member List and Status
This SLI uses kubectl to score namespace health. Produces a value between 0 (completely failing thet test) and 1 (fully passing the test). Looks for container restarts, events, and pods not ready.
Tasks:
Tasks:
- Get Event Count and Score
- Get Container Restarts and Score
- Get NotReady Pods
- Generate Namspace Score
This taskset runs general troubleshooting checks against all applicable objects in a namespace, checks error events, and searches pod logs for error entries.
Tasks:
Tasks:
- Trace Namespace Errors
- Fetch Unready Pods
- Triage Namespace
- Object Condition Check
- Namespace Get All
This codebundle fetches the number of running pods with the set of provided labels, letting you measure the number of running pods.
Tasks:
Tasks:
- Measure Number of Running Pods with Label
This taskset investigates the logs, state and health of Kubernetes Prometheus operator.
Tasks:
Tasks:
This taskset provides detailed information about the images used in a Kubernetes namespace.
Tasks:
Tasks:
- Check Image Rollover Times for Namespace `NAMESPACE`
- List Images and Tags for Every Container in Running Pods for Namespace `NAMESPACE`
- List Images and Tags for Every Container in Failed Pods for Namespace `NAMESPACE`
- List ImagePullBackOff Events and Test Path and Tags for Namespace `NAMESPACE`
This taskset is used to suspend a flux resource for the purposes of executing chaos tasks.
Tasks:
Tasks:
- Suspend the Flux Resource Reconciliation
- Find Random FluxCD Workload as Chaos Target
- Execute Chaos Command
- Execute Additional Chaos Command
- Resume Flux Resource Reconciliation
Troubleshoot GCE Ingress Resources related to GCP HTTP Load Balancer in GKE
Tasks:
Tasks:
Counts the number of nodes above 90% CPU or Memory Utilization from kubectl top.
Tasks:
Tasks:
- Identify High Utilization Nodes for Cluster `${CONTEXT}`
Identify resource constraints or issues in a cluster.
Tasks:
Tasks:
- Identify High Utilization Nodes for Cluster `CONTEXT`
- Identify Pods Causing High Node Utilization in Cluster `CONTEXT`
Inspects the resources provisioned for a given set of pods and raises issues or recommendations as necessary.
Tasks:
Tasks:
- Show Pods Without Resource Limit or Resource Requests Set in Namespace `NAMESPACE`
- Get Pod Resource Utilization with Top in Namespace `NAMESPACE`
- Identify VPA Pod Resource Recommendations in Namespace `NAMESPACE`
- Identify Resource Constrained Pods In Namespace `NAMESPACE`
Triages issues related to a StatefulSet and its replicas.
Tasks:
Tasks:
- Check Readiness Probe Configuration for StatefulSet `STATEFULSET_NAME`
- Check Liveness Probe Configuration for StatefulSet `STATEFULSET_NAME`
- Troubleshoot StatefulSet Warning Events for `STATEFULSET_NAME`
- Check StatefulSet Event Anomalies for `STATEFULSET_NAME`
- Fetch StatefulSet Logs for `STATEFULSET_NAME`
- Get Related StatefulSet `STATEFULSET_NAME` Events Show More
- Fetch Manifest Details for StatefulSet `STATEFULSET_NAME`
- List StatefulSets with Unhealthy Replica Counts In Namespace `NAMESPACE`
This SLI uses kubectl to score namespace health. Produces a value between 0 (completely failing thet test) and 1 (fully passing the test). Looks for container restarts, events, and pods not ready.
Tasks:
Tasks:
- Get Event Count and Score
- Get Container Restarts and Score
- Get NotReady Pods
- Generate Namspace Score
This taskset runs general troubleshooting checks against all applicable objects in a namespace. Looks for warning events, odd or frequent normal events, restarting containers and failed or pending pods.
Tasks:
Tasks:
- Inspect Warning Events in Namespace `NAMESPACE`
- Inspect Container Restarts In Namespace `NAMESPACE`
- Inspect Pending Pods In Namespace `NAMESPACE`
- Inspect Failed Pods In Namespace `NAMESPACE`
- Inspect Workload Status Conditions In Namespace `NAMESPACE`
- Get Listing Of Resources In Namespace `NAMESPACE`
- Check Event Anomalies in Namespace `NAMESPACE`
- Check Missing or Risky PodDisruptionBudget Policies in Namepace `NAMESPACE`
- Check Resource Quota Utilization in Namespace `NAMESPACE`
This taskset collects information about storage such as PersistentVolumes and PersistentVolumeClaims to
validate health or help troubleshoot potential storage issues.
Tasks:
Tasks:
- Fetch Events for Unhealthy Kubernetes PersistentVolumeClaims in Namespace `NAMESPACE`
- List PersistentVolumeClaims in Terminating State in Namespace `NAMESPACE`
- List PersistentVolumes in Terminating State in Namespace `NAMESPACE`
- List Pods with Attached Volumes and Related PersistentVolume Details in Namespace `NAMESPACE`
- Fetch the Storage Utilization for PVC Mounts in Namespace `NAMESPACE`
- Check for RWO Persistent Volume Node Attachment Issues in Namespace `NAMESPACE`
Performs a triage on the Open Source version of Artifactory in a Kubernetes cluster.
Tasks:
Tasks:
- Check Artifactory Liveness and Readiness Endpoints Show More
Suspends the flux reconciliation being applied to a given namespace.
Tasks:
Tasks:
- Flux Suspend Namespace NAMESPACE
- Unsuspend Flux for Namespace NAMESPACE
Measures the count of error activity log entries as a SLI metric for the Azure tenancy.
Tasks:
Tasks:
- Run Azure Monitor Activity Log Triage
Triages issues related to a Azure Loadbalancers, Kubernetes ingress objects and services.
Tasks:
Tasks:
- Run Azure Monitor Activity Log Triage
A suite of tasks that can be used to triage potential issues in your vault namespace.
Tasks:
Tasks:
- Fetch Vault CSI Driver Logs Show More
- Get Vault CSI Driver Warning Events Show More
- Check Vault CSI Driver Replicas Show More
- Fetch Vault Logs Show More
- Get Related Vault Events Show More
- Fetch Vault StatefulSet Manifest Details Show More
- Fetch Vault DaemonSet Manifest Details Show More
- Verify Vault Availability Show More
- Check Vault StatefulSet Replicas Show More
Provides a list of tasks that can remediate configuraiton issues with manifests in GitHub based GitOps repositories.
Tasks:
Tasks:
- Remediate Readiness and Liveness Probe GitOps Manifests in Namespace `NAMESPACE`
- Increase ResourceQuota for Namespace `NAMESPACE`
- Adjust Pod Resources to Match VPA Recommendation in `NAMESPACE`
- Expand Persistent Volume Claims in Namespace `NAMESPACE`
This taskset collects information and runs general troubleshooting checks against argocd application objects within a namespace.
Tasks:
Tasks:
- Fetch ArgoCD Application Sync Status & Health for `APPLICATION`
- Fetch ArgoCD Application Last Sync Operation Details for `APPLICATION`
- Fetch Unhealthy ArgoCD Application Resources for `APPLICATION`
- Scan For Errors in Pod Logs Related to ArgoCD Application `APPLICATION`
- Fully Describe ArgoCD Application `APPLICATION`
This codebundle runs a series of tasks to identify potential helm release issues related to ArgoCD managed Helm objects.
Tasks:
Tasks:
- Fetch all available ArgoCD Helm releases in namespace `NAMESPACE`
- Fetch Installed ArgoCD Helm release versions in namespace `NAMESPACE`
Triages issues related to a ingress objects and services.
Tasks:
Tasks:
- Fetch Ingress Object Health in Namespace `NAMESPACE`
- Check for Ingress and Service Conflicts in Namespace `NAMESPACE`
This taskset performs diagnostic checks on a OpenTelemetry Collector to ensure it's pushing metrics.
Tasks:
Tasks:
- Query Collector Queued Spans in Namespace `NAMESPACE`
- Check OpenTelemetry Collector Logs For Errors In Namespace `NAMESPACE`
- Scan OpenTelemetry Logs For Dropped Spans In Namespace `NAMESPACE`
Measures the number of exception stacktraces present in an application's logs over a time period.
Tasks:
Tasks:
- Measure Application Exceptions
Performs application-level troubleshooting by inspecting the logs of a workload for parsable exceptions,
and attempts to determine next steps.
Tasks:
Tasks:
Triages issues related to a deployment and its replicas.
Tasks:
Tasks:
- Check Deployment Log For Issues with `DEPLOYMENT_NAME`
- Check Liveness Probe Configuration for Deployment `DEPLOYMENT_NAME`
- Check Readiness Probe Configuration for Deployment `DEPLOYMENT_NAME`
- Inspect Container Restarts for Deployment `DEPLOYMENT_NAME` Namespace `NAMESPACE`
- Inspect Deployment Warning Events for `DEPLOYMENT_NAME`
- Get Deployment Workload Details For `DEPLOYMENT_NAME` and Add to Report Show More
- Inspect Deployment Replicas for `DEPLOYMENT_NAME`
- Check Deployment Event Anomalies for `DEPLOYMENT_NAME`
- Check ReplicaSet Health for Deployment `DEPLOYMENT_NAME`
Triages issues related to a Azure Loadbalancers and its activity logs.
Tasks:
Tasks:
- Check Activity Logs for Azure Load Balancer `AZ_LB_NAME`
This codebundle runs a series of tasks to identify potential Kustomization issues related to Flux managed Kustomization objects.
Tasks:
Tasks:
- List all available Kustomization objects in Namespace `NAMESPACE`
- Get details for unready Kustomizations in Namespace `NAMESPACE`
This codebundle runs a series of tasks to identify potential helm release issues related to Flux managed Helm objects.
Tasks:
Tasks:
- List all available FluxCD Helmreleases in Namespace `NAMESPACE`
- Fetch Installed FluxCD Helmrelease Versions in Namespace `NAMESPACE`
- Fetch Mismatched FluxCD HelmRelease Version in Namespace `NAMESPACE`
- Fetch FluxCD HelmRelease Error Messages in Namespace `NAMESPACE`
- Check for Available Helm Chart Updates in Namespace `NAMESPACE`
Measures the number of exception stacktraces present in an application's logs over a time period.
Tasks:
Tasks:
- Tail `${CONTAINER_NAME}` Application Logs For Stacktraces
Performs application-level troubleshooting by inspecting the logs of a workload for parsable exceptions,
and attempts to determine next steps.
Tasks:
Tasks:
- Get `CONTAINER_NAME` Application Logs
- Tail `CONTAINER_NAME` Application Logs For Stacktraces
Runs multiple Kubernetes and psql commands to report on the health of a postgres cluster. Produces a value between 0 (completely failing thet test) and 1 (fully passing the test). Checks for database lag & backup health.
Tasks:
Tasks:
- Fetch Patroni Database Lag
- Check Database Backup Status for Cluster `${OBJECT_NAME}` in Namespace `${NAMESPACE}`
- Generate Namspace Score
Runs a series of tasks to check the overall health of a postgres cluster and to provide detailed information useful for debugging or reviewing configurations.
Tasks:
Tasks:
- List Resources Related to Postgres Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
- Get Postgres Pod Logs & Events for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
- Get Postgres Pod Resource Utilization for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
- Get Running Postgres Configuration for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
- Get Patroni Output and Add to Report for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
- Fetch Patroni Database Lag for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
- Check Database Backup Status for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
- Run DB Queries for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
This taskset provides tasks to troubleshoot service accounts in a Kubernetes namespace.
Tasks:
Tasks:
- Test Service Account Access to Kubernetes API Server in Namespace `NAMESPACE`
Measures failing reconciliations for fluxcd
Tasks:
Tasks:
- Health Check Flux Reconciliation
Generates a report of the reconciliation errors for fluxcd in your cluster.
Tasks:
Tasks:
- Health Check Flux Reconciliation
Provides chaos injection tasks for Kubernetes namespaces. These are destructive tasks and the expectation is that you can heal these changes by enabling your GitOps reconciliation.
Tasks:
Tasks:
- Kill Random Pods In Namespace `NAMESPACE`
- OOMKill Pods In Namespace `NAMESPACE`
- Mangle Service Selector In Namespace `NAMESPACE`
- Mangle Service Port In Namespace `NAMESPACE`
- Fill Random Pod Tmp Directory In Namespace `NAMESPACE`
Counts the number of unhealthy cert-manager managed certificates in a namespace.
Tasks:
Tasks:
- Count Unready and Expired Certificates
Checks the overall health of certificates in a namespace that are managed by cert-manager.
Tasks:
Tasks:
- Get Namespace Certificate Summary for Namespace `NAMESPACE`
- Find Unhealthy Certificates in Namespace `NAMESPACE`
- Find Failed Certificate Requests and Identify Issues for Namespace `NAMESPACE`
This taskset queries Jaeger API directly for trace details and parses the results
Tasks:
Tasks:
- Query Traces in Jaeger for Unhealthy HTTP Response Codes in Namespace `NAMESPACE`
Provides chaos injection tasks for specific workloads like your apps in a Kubernetes namespace. These are destructive tasks and the expectation is that you can heal these changes by enabling your GitOps reconciliation.
Tasks:
Tasks:
- Test `WORKLOAD_NAME` High Availability
- OOMKill `WORKLOAD_NAME` Pod
- Mangle Service Selector For `WORKLOAD_NAME`
- Mangle Service Port For `WORKLOAD_NAME`
- Fill Tmp Directory Of Pod From `WORKLOAD_NAME`
This taskset restarts a resource with a given set of labels, typically used with other tasksets.
Tasks:
Tasks:
- Get Current Resource State with Labels `LABELS`
- Get Resource Logs with Labels `LABELS`
- Restart Resource with Labels `LABELS`
Runs an ad-hoc user-provided command, and if the provided command outputs a non-empty string to stdout then a health score of 0 (unhealthy) is pushed, otherwise if there is no output, indicating no issues, then a 1 is pushed.
User commands should filter expected/healthy content (eg: with grep) and only output found errors.
Tasks:
Tasks:
- ${TASK_TITLE}
Runs an ad-hoc user-provided command, and if the provided command outputs a non-empty string to stdout then an issue is generated with a configurable title and content.
User commands should filter expected/healthy content (eg: with grep) and only output found errors.
Tasks:
Tasks:
- TASK_TITLE
This taskset runs a user provided kubectl command and pushes the metric. The supplied command must result in distinct single metric. Command line tools like jq are available.
Tasks:
Tasks:
- ${TASK_TITLE}
This taskset runs a user provided kubectl command and adds the output to the report. Command line tools like jq are available.
Tasks:
Tasks:
- TASK_TITLE