AKS
A taskset for troubleshooting general issues associated with typical kubernetes deployment resources.
Supports API interactions via both the API client and Kubectl binary through RunWhen Shell Services.
Tasks:
Tasks:
- Troubleshoot Resourcing
- Troubleshoot Events
- Troubleshoot PVC
- Troubleshoot Pods
This codebundle runs an arbitrary kubectl command and writes the stdout to a report.
Typically used in conjunction with other codebundles.
Tasks:
Tasks:
- Running Kubectl And Adding Stdout To Report
Check the health of a Kubernetes API server using kubectl.
Returns 1 when OK, or a 0 in the case of an unhealthy API server.
Tasks:
Tasks:
- Running Kubectl Check Against API Server
This codebundle runs a kubectl get command that produces a value and pushes the metric.
Uses jmespath for filtering and allows calculations such as count, sum, avg on specified fields.
Tasks:
Tasks:
- Running Kubectl get and push the metric
Runs multiple Kubernetes and psql commands to report on the health of a postgres cluster.
Tasks:
Tasks:
- Get Standard Resources
- Describe Custom Resources
- Get Pod Logs & Events
- Get Pod Resource Utilization
- Get Running Configuration
- Get Patroni Output
- Run DB Queries
Creates an adhoc one-shot job which mounts a PVC as a canary test, which is polled for success before being torn down.
Tasks:
Tasks:
- Run Canary Job
Retreieve aggregate data via kubectl top command.
Tasks:
Tasks:
- Running Kubectl Top And Extracting Metric Data
This taskset runs general troubleshooting checks against all applicable objects in a namespace, checks error events, and searches pod logs for error entries.
Tasks:
Tasks:
- Trace Namespace Errors
- Fetch Unready Pods
- Triage Namespace
- Object Condition Check
- Namespace Get All
This SLI uses kubectl to score namespace health. Produces a value between 0 (completely failing thet test) and 1 (fully passing the test). Looks for container restarts, events, and pods not ready.
Tasks:
Tasks:
- Get Event Count and Score
- Get Container Restarts and Score
- Get NotReady Pods
- Generate Namspace Score
Runs a postgres SQL query and pushes the returned result into a report.
During execution, the SQL query should be passed to a Kubernetes workload that has access to the psql binary.
The workload will run the query and return the results from stdout.
Tasks:
Tasks:
- Run Postgres Query And Results to Report
Triages issues related to a deployment's replicas.
Tasks:
Tasks:
- Fetch Logs
- Get Related Events
- Check Deployment Replicas
Checks that the current state of a daemonset is healthy and returns a score of either 1 (healthy) or 0 (unhealthy).
Tasks:
Tasks:
- Health Check Daemonset
Searches a namespace for matching objects and provides the commands to decommission them.
Tasks:
Tasks:
- Generate Decomission Commands
Uses kubectl (or equivalent) to query the state of a patroni cluster and determine if it's healthy.
Tasks:
Tasks:
- Determine Patroni Health
Detects and reinitializes laggy Patroni cluster members which are unable to catchup in replication using kubectl and patronictl.
Tasks:
Tasks:
- Determine Patroni Health
Measures the maximum replica lag across a Patroni cluster.
Tasks:
Tasks:
- Measure Patroni Member Lag
Taskset to triage issues related to patroni.
Tasks:
Tasks:
- Get Patroni Status
- Get Pods Status
- Fetch Logs
A taskset for troubleshooting issues for StatefulSets and their related resources.
Tasks:
Tasks:
- Check StatefulSets Replicas Ready
- Get Events For The StatefulSet
- Get StatefulSet Logs
- Get StatefulSet Manifests Dump
Returns the number of events with matching messages as an SLI metric.
Tasks:
Tasks:
- Get Number Of Matching Events
This codebundle runs a series of tasks to identify potential Kustomization issues related to Flux managed Kustomization objects.
Tasks:
Tasks:
- List all available Kustomization objects in Namespace `NAMESPACE`
- Get details for unready Kustomizations in Namespace `NAMESPACE`
This taskset collects information and runs general troubleshooting checks against argocd application objects within a namespace.
Tasks:
Tasks:
- Fetch ArgoCD Application Sync Status & Health for `APPLICATION`
- Fetch ArgoCD Application Last Sync Operation Details for `APPLICATION`
- Fetch Unhealthy ArgoCD Application Resources for `APPLICATION`
- Scan For Errors in Pod Logs Related to ArgoCD Application `APPLICATION`
- Fully Describe ArgoCD Application `APPLICATION`
This taskset provides detailed information about the images used in a Kubernetes namespace.
Tasks:
Tasks:
- Check Image Rollover Times for Namespace `NAMESPACE`
- List Images and Tags for Every Container in Running Pods for Namespace `NAMESPACE`
- List Images and Tags for Every Container in Failed Pods for Namespace `NAMESPACE`
- List ImagePullBackOff Events and Test Path and Tags for Namespace `NAMESPACE`
Performs a triage on the Open Source version of Artifactory in a Kubernetes cluster.
Tasks:
Tasks:
- Check Artifactory Liveness and Readiness Endpoints Show More
Inspects the resources provisioned for a given set of pods and raises issues or recommendations as necessary.
Tasks:
Tasks:
- Show Pods Without Resource Limit or Resource Requests Set in Namespace `NAMESPACE`
- Get Pod Resource Utilization with Top in Namespace `NAMESPACE`
- Identify VPA Pod Resource Recommendations in Namespace `NAMESPACE`
- Identify Resource Constrained Pods In Namespace `NAMESPACE`
A suite of tasks that can be used to triage potential issues in your vault namespace.
Tasks:
Tasks:
- Fetch Vault CSI Driver Logs Show More
- Get Vault CSI Driver Warning Events Show More
- Check Vault CSI Driver Replicas Show More
- Fetch Vault Logs Show More
- Get Related Vault Events Show More
- Fetch Vault StatefulSet Manifest Details Show More
- Fetch Vault DaemonSet Manifest Details Show More
- Verify Vault Availability Show More
- Check Vault StatefulSet Replicas Show More
Triages issues related to a deployment and its replicas.
Tasks:
Tasks:
- Check Deployment Log For Issues with `DEPLOYMENT_NAME`
- Check Liveness Probe Configuration for Deployment `DEPLOYMENT_NAME`
- Check Readiness Probe Configuration for Deployment `DEPLOYMENT_NAME`
- Inspect Container Restarts for Deployment `DEPLOYMENT_NAME` Namespace `NAMESPACE`
- Inspect Deployment Warning Events for `DEPLOYMENT_NAME`
- Get Deployment Workload Details For `DEPLOYMENT_NAME` and Add to Report Show More
- Inspect Deployment Replicas for `DEPLOYMENT_NAME`
- Check Deployment Event Anomalies for `DEPLOYMENT_NAME`
- Check ReplicaSet Health for Deployment `DEPLOYMENT_NAME`
Runs a series of tasks to check the overall health of a postgres cluster and to provide detailed information useful for debugging or reviewing configurations.
Tasks:
Tasks:
- List Resources Related to Postgres Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
- Get Postgres Pod Logs & Events for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
- Get Postgres Pod Resource Utilization for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
- Get Running Postgres Configuration for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
- Get Patroni Output and Add to Report for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
- Fetch Patroni Database Lag for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
- Check Database Backup Status for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
- Run DB Queries for Cluster `OBJECT_NAME` in Namespace `NAMESPACE`
Runs multiple Kubernetes and psql commands to report on the health of a postgres cluster. Produces a value between 0 (completely failing thet test) and 1 (fully passing the test). Checks for database lag & backup health.
Tasks:
Tasks:
- Fetch Patroni Database Lag
- Check Database Backup Status for Cluster `${OBJECT_NAME}` in Namespace `${NAMESPACE}`
- Generate Namspace Score
This taskset performs diagnostic checks on a OpenTelemetry Collector to ensure it's pushing metrics.
Tasks:
Tasks:
- Query Collector Queued Spans in Namespace `NAMESPACE`
- Check OpenTelemetry Collector Logs For Errors In Namespace `NAMESPACE`
- Scan OpenTelemetry Logs For Dropped Spans In Namespace `NAMESPACE`
Triages issues related to a ingress objects and services.
Tasks:
Tasks:
- Fetch Ingress Object Health in Namespace `NAMESPACE`
- Check for Ingress and Service Conflicts in Namespace `NAMESPACE`
This taskset is used to suspend a flux resource for the purposes of executing chaos tasks.
Tasks:
Tasks:
- Suspend the Flux Resource Reconciliation
- Find Random FluxCD Workload as Chaos Target
- Execute Chaos Command
- Execute Additional Chaos Command
- Resume Flux Resource Reconciliation
This taskset queries Jaeger API directly for trace details and parses the results
Tasks:
Tasks:
- Query Traces in Jaeger for Unhealthy HTTP Response Codes in Namespace `NAMESPACE`
Performs application-level troubleshooting by inspecting the logs of a workload for parsable exceptions,
and attempts to determine next steps.
Tasks:
Tasks:
Measures the number of exception stacktraces present in an application's logs over a time period.
Tasks:
Tasks:
- Measure Application Exceptions
This taskset collects information about storage such as PersistentVolumes and PersistentVolumeClaims to
validate health or help troubleshoot potential storage issues.
Tasks:
Tasks:
- Fetch Events for Unhealthy Kubernetes PersistentVolumeClaims in Namespace `NAMESPACE`
- List PersistentVolumeClaims in Terminating State in Namespace `NAMESPACE`
- List PersistentVolumes in Terminating State in Namespace `NAMESPACE`
- List Pods with Attached Volumes and Related PersistentVolume Details in Namespace `NAMESPACE`
- Fetch the Storage Utilization for PVC Mounts in Namespace `NAMESPACE`
- Check for RWO Persistent Volume Node Attachment Issues in Namespace `NAMESPACE`
This SLI collects information about storage such as PersistentVolumes and PersistentVolumeClaims and generates an aggregated health score for the namespace. 1 = Healthy, 0 = Failed, >0 <1 = Degraded
Tasks:
Tasks:
- Fetch the Storage Utilization for PVC Mounts in Namespace `${NAMESPACE}`
- Generate Namspace Score
This taskset provides tasks to troubleshoot service accounts in a Kubernetes namespace.
Tasks:
Tasks:
- Test Service Account Access to Kubernetes API Server in Namespace `NAMESPACE`
This codebundle runs a series of tasks to identify potential helm release issues related to Flux managed Helm objects.
Tasks:
Tasks:
- List all available FluxCD Helmreleases in Namespace `NAMESPACE`
- Fetch Installed FluxCD Helmrelease Versions in Namespace `NAMESPACE`
- Fetch Mismatched FluxCD HelmRelease Version in Namespace `NAMESPACE`
- Fetch FluxCD HelmRelease Error Messages in Namespace `NAMESPACE`
- Check for Available Helm Chart Updates in Namespace `NAMESPACE`
This codebundle fetches the number of running pods with the set of provided labels, letting you measure the number of running pods.
Tasks:
Tasks:
- Measure Number of Running Pods with Label
Triages issues related to a Azure Loadbalancers and its activity logs.
Tasks:
Tasks:
- Check Activity Logs for Azure Load Balancer `AZ_LB_NAME`
Identify resource constraints or issues in a cluster.
Tasks:
Tasks:
- Identify High Utilization Nodes for Cluster `CONTEXT`
- Identify Pods Causing High Node Utilization in Cluster `CONTEXT`
Counts the number of nodes above 90% CPU or Memory Utilization from kubectl top.
Tasks:
Tasks:
- Identify High Utilization Nodes for Cluster `${CONTEXT}`
Performs application-level troubleshooting by inspecting the logs of a workload for parsable exceptions,
and attempts to determine next steps.
Tasks:
Tasks:
- Get `CONTAINER_NAME` Application Logs
- Tail `CONTAINER_NAME` Application Logs For Stacktraces
Measures the number of exception stacktraces present in an application's logs over a time period.
Tasks:
Tasks:
- Tail `${CONTAINER_NAME}` Application Logs For Stacktraces
This taskset investigates the logs, state and health of Kubernetes Prometheus operator.
Tasks:
Tasks:
Runs diagnostic checks against an AKS cluster.
Tasks:
Tasks:
- Check for Resource Health Issues Affecting AKS Cluster `AKS_CLUSTER` In Resource Group `AZ_RESOURCE_GROUP`
- Check Configuration Health of AKS Cluster `AKS_CLUSTER` In Resource Group `AZ_RESOURCE_GROUP`
- Check Network Configuration of AKS Cluster `AKS_CLUSTER` In Resource Group `AZ_RESOURCE_GROUP`
- Fetch Activities for AKS Cluster `AKS_CLUSTER` In Resource Group `AZ_RESOURCE_GROUP`
Generates a composite score about the health of an AKS cluster using the AZ CLI. Returns a 1 if all checks pass, 0 if they all fail, and value between 0 and 1 for partial success/fail. Checks the upstream service for reported errors. Looks for Critical or Error activities within a specified time period. Checks the overall configuration for provisioning failures.
Tasks:
Tasks:
- Check for Resource Health Issues Affecting AKS Cluster `${AKS_CLUSTER}` In Resource Group `${AZ_RESOURCE_GROUP}`
- Fetch Activities for AKS Cluster `${AKS_CLUSTER}` In Resource Group `${AZ_RESOURCE_GROUP}`
- Check Configuration Health of AKS Cluster `${AKS_CLUSTER}` In Resource Group `${AZ_RESOURCE_GROUP}`
- Generate AKS Cluster Health Score
Triages issues related to a StatefulSet and its replicas.
Tasks:
Tasks:
- Check Readiness Probe Configuration for StatefulSet `STATEFULSET_NAME`
- Check Liveness Probe Configuration for StatefulSet `STATEFULSET_NAME`
- Troubleshoot StatefulSet Warning Events for `STATEFULSET_NAME`
- Check StatefulSet Event Anomalies for `STATEFULSET_NAME`
- Fetch StatefulSet Logs for `STATEFULSET_NAME`
- Get Related StatefulSet `STATEFULSET_NAME` Events Show More
- Fetch Manifest Details for StatefulSet `STATEFULSET_NAME`
- List StatefulSets with Unhealthy Replica Counts In Namespace `NAMESPACE`
Checks the overall health of certificates in a namespace that are managed by cert-manager.
Tasks:
Tasks:
- Get Namespace Certificate Summary for Namespace `NAMESPACE`
- Find Unhealthy Certificates in Namespace `NAMESPACE`
- Find Failed Certificate Requests and Identify Issues for Namespace `NAMESPACE`
Counts the number of unhealthy cert-manager managed certificates in a namespace.
Tasks:
Tasks:
- Count Unready and Expired Certificates
This taskset restarts a resource with a given set of labels, typically used with other tasksets.
Tasks:
Tasks:
- Get Current Resource State with Labels `LABELS`
- Get Resource Logs with Labels `LABELS`
- Restart Resource with Labels `LABELS`
Evaluate cluster node health using kubectl
Tasks:
Tasks:
- Check for Node Restarts in Cluster `CONTEXT`
Evaluate cluster node health using kubectl.
Tasks:
Tasks:
- Check for Node Restarts in Cluster `${CONTEXT}`
- Generate Namspace Score
Provides a list of tasks that can remediate configuraiton issues with manifests in GitHub based GitOps repositories.
Tasks:
Tasks:
- Remediate Readiness and Liveness Probe GitOps Manifests in Namespace `NAMESPACE`
- Increase ResourceQuota for Namespace `NAMESPACE`
- Adjust Pod Resources to Match VPA Recommendation in `NAMESPACE`
- Expand Persistent Volume Claims in Namespace `NAMESPACE`
This taskset runs general troubleshooting checks against all applicable objects in a namespace. Looks for warning events, odd or frequent normal events, restarting containers and failed or pending pods.
Tasks:
Tasks:
- Inspect Warning Events in Namespace `NAMESPACE`
- Inspect Container Restarts In Namespace `NAMESPACE`
- Inspect Pending Pods In Namespace `NAMESPACE`
- Inspect Failed Pods In Namespace `NAMESPACE`
- Inspect Workload Status Conditions In Namespace `NAMESPACE`
- Get Listing Of Resources In Namespace `NAMESPACE`
- Check Event Anomalies in Namespace `NAMESPACE`
- Check Missing or Risky PodDisruptionBudget Policies in Namepace `NAMESPACE`
- Check Resource Quota Utilization in Namespace `NAMESPACE`
This SLI uses kubectl to score namespace health. Produces a value between 0 (completely failing thet test) and 1 (fully passing the test). Looks for container restarts, events, and pods not ready.
Tasks:
Tasks:
- Get Event Count and Score
- Get Container Restarts and Score
- Get NotReady Pods
- Generate Namspace Score
This codebundle runs a series of tasks to identify potential helm release issues related to ArgoCD managed Helm objects.
Tasks:
Tasks:
- Fetch all available ArgoCD Helm releases in namespace `NAMESPACE`
- Fetch Installed ArgoCD Helm release versions in namespace `NAMESPACE`