SLI

Icon

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-generic-codecollection


Runs an ad-hoc user-provided command, and if the provided command outputs a non-empty string to stdout then a health score of 0 (unhealthy) is pushed, otherwise if there is no output, indicating no issues, then a 1 is pushed. User commands should filter expected/healthy content (eg: with grep) and only output found errors.

Tasks:
  • ${TASK_TITLE}

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-generic-codecollection


This SLI runs a user-provided cURL command and can push the result as a metric. Optional headers and post-processing commands are supported. If HEADERS is provided, the file is appended to the cURL command using -K. If POST_PROCESS is provided, it is appended as a pipe (|) to further process the output (e.g., jq).

Tasks:
  • ${TASK_TITLE}

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-generic-codecollection


This taskset runs a user provided kubectl command and pushes the metric. The supplied command must result in distinct single metric. Command line tools like jq are available.

Tasks:
  • ${TASK_TITLE}

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-generic-codecollection


Runs an ad-hoc user-provided command, and if the provided command outputs a non-empty string to stdout then a health score of 0 (unhealthy) is pushed, otherwise if there is no output, indicating no issues, then a 1 is pushed. User commands should filter expected/healthy content (eg: with grep) and only output found errors.

Tasks:
  • ${TASK_TITLE}

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-generic-codecollection


Runs an ad-hoc user-provided command, and pushes the command's stdout as the health metric. If no output is produced, the resulting metric is empty; if the command produces output, that exact text is used as the metric. User commands should produce the desired health metric or numeric value if needed—e.g., output "0" if unhealthy or "1" if healthy.

Tasks:
  • ${TASK_TITLE}

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-generic-codecollection


This sli runs a user provided azure cli command and pushes the metric. The supplied command must result in distinct single metric. Command line tools like jq are available.

Tasks:
  • ${TASK_TITLE}

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-generic-codecollection


This SLI runs a user-provided script/command with up to 10 configurable environment variables from secrets and pushes the result as a metric. Each environment variable is configured as an individual secret for maximum security and clarity. The supplied command must result in a distinct single metric.

Tasks:
  • ${TASK_TITLE}

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-generic-codecollection


Runs an ad-hoc user-provided command, and if the provided command outputs a non-empty string to stdout then a health score of 0 (unhealthy) is pushed, otherwise if there is no output, indicating no issues, then a 1 is pushed. User commands should filter expected/healthy content (eg: with grep) and only output found errors.

Tasks:
  • ${TASK_TITLE}

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-generic-codecollection


This sli runs a user provided awscli command and pushes the metric. The supplied command must result in distinct single metric. Command line tools like jq are available.

Tasks:
  • ${TASK_TITLE}

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-generic-codecollection


Runs a user provided gcloud command and pushes the metric to the RunWhen Platform. The supplied command must result in distinct single metric. Command line tools like jq are available.

Tasks:
  • ${TASK_TITLE}

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-generic-codecollection


This SLI runs a user-provided script/command with flexible environment variable support and pushes the result as a metric. Supports SSH keys, git credentials, and arbitrary environment variables from secrets. The supplied command must result in a distinct single metric.

Tasks:
  • ${TASK_TITLE}

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-generic-codecollection


Runs an ad-hoc user-provided command, and if the provided command outputs a non-empty string to stdout then a health score of 0 (unhealthy) is pushed, otherwise if there is no output, indicating no issues, then a 1 is pushed. User commands should filter expected/healthy content (eg: with grep) and only output found errors.

Tasks:
  • ${TASK_TITLE}

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-generic-codecollection


This SLI runs a user provided curl command and can push the result as a metric. Command line tools like jq are available. Accepts HEADERS as a secret.

Tasks:
  • ${TASK_TITLE}

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-generic-codecollection


Runs an ad-hoc user-provided command, and if the provided command outputs a non-empty string to stdout then a health score of 0 (unhealthy) is pushed, otherwise if there is no output, indicating no issues, then a 1 is pushed. User commands should filter expected/healthy content (eg: with grep) and only output found errors.

Tasks:
  • ${TASK_TITLE}

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: aws-c7n-codecollection


Counts the number of S3 buckets in an Account that are insecure or unhealthy.

Tasks:
  • Count S3 Buckets With Public Access in AWS Account `${AWS_ACCOUNT_NAME}`

Icon 1 4 Troubleshooting Commands

Icon 2 Contributed by saurabh3460

Icon 2 Codecollection: aws-c7n-codecollection


Counts the number of EBS resources by identifying unattached volumes, unused and aged snapshots, and unencrypted volumes.

Tasks:
  • Check Unattached EBS Volumes in `${AWS_REGION}`
  • Check Unencrypted EBS Volumes in `${AWS_REGION}`
  • Check Unused EBS Snapshots in `${AWS_REGION}`
  • Generate EBS Score

Icon 1 4 Troubleshooting Commands

Icon 2 Contributed by saurabh3460

Icon 2 Codecollection: aws-c7n-codecollection


Count the number of EC2 instances that are stale or stopped

Tasks:
  • Check for stale AWS EC2 instances in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}`
  • Check for stopped AWS EC2 instances in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}`
  • Check for invalid AWS Auto Scaling Groups in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}`
  • Generate Health Score

Icon 1 5 Troubleshooting Commands

Icon 2 Contributed by saurabh3460

Icon 2 Codecollection: aws-c7n-codecollection


Count publicly accessible security groups, unused EIPs, unused ELBs, and VPCs with flow logs disabled

Tasks:
  • Check for publicly accessible security groups in AWS account `${AWS_ACCOUNT_ID}`
  • Check for unused Elastic IPs in AWS account `${AWS_ACCOUNT_ID}`
  • Check for unused ELBs in AWS account `${AWS_ACCOUNT_ID}`
  • Check for VPCs with Flow Logs disabled in AWS account `${AWS_ACCOUNT_ID}`
  • Generate Health Score

Icon 1 4 Troubleshooting Commands

Icon 2 Contributed by saurabh3460

Icon 2 Codecollection: aws-c7n-codecollection


Check AWS RDS instances that are unencrypted, publicly accessible, or have backups disabled.

Tasks:
  • Check for unencrypted RDS instances in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}`
  • Check for publicly accessible RDS instances in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}`
  • Check for disabled backup RDS instances in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}`
  • Generate Health Score

Icon 1 6 Troubleshooting Commands

Icon 2 Contributed by saurabh3460

Icon 2 Codecollection: aws-c7n-codecollection


Count AWS ACM certificates that are unused, Expiring, or expired and failed status.

Tasks:
  • Check for unused ACM certificates in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}`
  • Check for Expiring ACM certificates in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}`
  • Check for expired ACM certificates in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}`
  • Check for Failed Status ACM Certificates in AWS Region `${AWS_REGION}` in AWS Account `${AWS_ACCOUNT_ID}`
  • Check for Pending Validation ACM Certificates in AWS Region `${AWS_REGION}` in AWS Account `${AWS_ACCOUNT_ID}`
  • Generate Health Score

Icon 1 4 Troubleshooting Commands

Icon 2 Contributed by saurabh3460

Icon 2 Codecollection: aws-c7n-codecollection


Check AWS Monitoring Configuration Health

Tasks:
  • Check CloudWatch Log Groups Without Retention Period in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}`
  • Check if CloudTrail exists and is configured for multi-region in AWS Region `${AWS_REGION}` in AWS Account `${AWS_ACCOUNT_ID}`
  • Check CloudTrail Without CloudWatch Logs in AWS Region `${AWS_REGION}` in AWS Account `${AWS_ACCOUNT_ID}`
  • Generate Health Score

Icon 1 8 Troubleshooting Commands

Icon 2 Contributed by saurabh3460

Icon 2 Codecollection: azure-c7n-codecollection


Check Azure storage health by identifying unused disks, snapshots, and storage accounts

Tasks:
  • Count Azure Storage Accounts with Health Status of `Available` in resource group `${AZURE_RESOURCE_GROUP}`
  • Count Unused Disks in resource group `${AZURE_RESOURCE_GROUP}`
  • Count Unused Snapshots in resource group `${AZURE_RESOURCE_GROUP}`
  • Count Unused Storage Accounts in resource group `${AZURE_RESOURCE_GROUP}`
  • Count Storage Containers with Public Access in resource group `${AZURE_RESOURCE_GROUP}`
  • Count Storage Account Misconfigurations in resource group `${AZURE_RESOURCE_GROUP}`
  • Count Storage Account Changes with Critical/High Security Risk in resource group `${AZURE_RESOURCE_GROUP}`
  • Generate Health Score

Icon 1 11 Troubleshooting Commands

Icon 2 Contributed by saurabh3460

Icon 2 Codecollection: azure-c7n-codecollection


Count Virtual machines that are publicly accessible, have high CPU usage, underutilized memory, stopped state, unused network interfaces, and unused public IPs in Azure

Tasks:
  • Check Azure VM Health in resource group `${AZURE_RESOURCE_GROUP}`
  • Check for VMs With Public IP in resource group `${AZURE_RESOURCE_GROUP}`
  • Check for Stopped VMs in resource group `${AZURE_RESOURCE_GROUP}`
  • Check for VMs With High CPU Usage in resource group `${AZURE_RESOURCE_GROUP}`
  • Check for Underutilized VMs Based on CPU Usage in resource group `${AZURE_RESOURCE_GROUP}`
  • Check for VMs With High Memory Usage in resource group `${AZURE_RESOURCE_GROUP}`
  • Check for Underutilized VMs Based on Memory Usage in resource group `${AZURE_RESOURCE_GROUP}`
  • Check for Unused Network Interfaces in resource group `${AZURE_RESOURCE_GROUP}`
  • Check for Unused Public IPs in resource group `${AZURE_RESOURCE_GROUP}`
  • Check VMs Agent Status in resource group `${AZURE_RESOURCE_GROUP}`
  • Generate Health Score

Icon 1 10 Troubleshooting Commands

Icon 2 Contributed by saurabh3460

Icon 2 Codecollection: azure-c7n-codecollection


Count databases that are publicly accessible, without replication, without high availability configuration, with high CPU usage, high memory usage, high cache miss rate, low availability, and risky configuration changes in Azure

Tasks:
  • Score Database Availability in resource group `${AZURE_RESOURCE_GROUP}`
  • Count Publicly Accessible Databases in resource group `${AZURE_RESOURCE_GROUP}`
  • Count Databases Without Replication in resource group `${AZURE_RESOURCE_GROUP}`
  • Count Databases Without High Availability in resource group `${AZURE_RESOURCE_GROUP}`
  • Count Databases With High CPU Usage in resource group `${AZURE_RESOURCE_GROUP}`
  • Count Databases With High Memory Usage in resource group `${AZURE_RESOURCE_GROUP}`
  • Count Redis Caches With High Cache Miss Rate in resource group `${AZURE_RESOURCE_GROUP}`
  • Count Databases With Health Issues in resource group `${AZURE_RESOURCE_GROUP}`
  • Count Risky Database Configuration Changes in resource group `${AZURE_RESOURCE_GROUP}`
  • Generate Health Score

Icon 1 3 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-workspace-utils


Determines if any RunWhen CodeCollection or private runner components require image updates.

Tasks:
  • Check for CodeCollection Updates against ACR Registry`${REGISTRY_NAME}`
  • Check for RunWhen Local Image Updates against ACR Registry`${REGISTRY_NAME}`
  • Count Images Needing Update and Push Metric

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-workspace-utils


Determines if any RunWhen Local images have available updates in the private Azure Container Registry service.

Tasks:
  • Check for Available RunWhen Helm Images in ACR Registry`${REGISTRY_NAME}`

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Vui Le

Icon 2 Codecollection: rw-public-codecollection


Check GitLab latency by getting a list of repo names.

Tasks:
  • Check GitLab Latency With Get Repos

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


A gRPC curl SLI for querying and extracting data from a generic grpcurl call.

Tasks:
  • Run gRPCurl Command and Push Metric

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Checks an Artifactory instance health endpoint to determine its operational status. The response is parsed to determine if the service is healthy, resulting in a metric of 1 if it is, or 0 if not.

Tasks:
  • Check If Artifactory Endpoint Is Healthy

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Run a PromQL query against Prometheus instant query API, perform a provided transform, and return the result.

Tasks:
  • Querying Prometheus Instance And Pushing Aggregated Data

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Retreieve aggregate data via kubectl top command.

Tasks:
  • Running Kubectl Top And Extracting Metric Data

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Shea Stewart

Icon 2 Codecollection: rw-public-codecollection


Queries Twitter to count amount of tweets within a specified time range for a specific user handle.

Tasks:
  • Query Twitter

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Retrieve the result of an AWS CloudWatch Metrics Insights query.

Tasks:
  • Running CloudWatch Metric Query And Pushing The Result

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Run arbitrary gcloud commands and parse their output for arbitrary values such as json to be submitted as a metric.

Tasks:
  • Run Gcloud CLI Command and Push metric

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Shea Stewart

Icon 2 Codecollection: rw-public-codecollection


Uses kubectl to query the state of a ingestor ring and determine if it's healthy. Returns 1 if healthy, 0 if unhealthy.

Tasks:
  • Determine Cortex Ingester Ring Health

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Vui Le

Icon 2 Codecollection: rw-public-codecollection


Check DNS latency for Google Resolver.

Tasks:
  • Check DNS latency for Google Resolver

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Shea Stewart

Icon 2 Codecollection: rw-public-codecollection


A curl SLI for querying and extracting data from a generic curl call. Supports jq. Should prodice a single metric.

Tasks:
  • Run Curl Command and Push Metric

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


A REST SLI for querying and extracting data from a REST endpoint that needs an explicit oauth2 flow. Where an access token must be acquired with a bearer token.

Tasks:
  • Request Data From Rest Endpoint

Icon 1 8 Troubleshooting Commands

Icon 2 Contributed by Shea Stewart

Icon 2 Codecollection: rw-public-codecollection


Uses promql on the Ops Suite API to determine the health of a MongoDB database instance and pushes the result as an SLI metric. Produces a 1 for a healthy resource, or 0 for an unhealthy resource.

Tasks:
  • Get Access Token
  • Get Instance Status
  • Get Connection Utilization Rate
  • Get MongoDB Member State Health
  • Get MongoDB Replication Lag
  • Get MongoDB Queue Size
  • Get Assertion Rate
  • Generate MongoDB Score

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Uses kubectl (or equivalent) to query the state of a patroni cluster and determine if it's healthy.

Tasks:
  • Determine Patroni Health

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Check the health of a Kubernetes API server using kubectl. Returns 1 when OK, or a 0 in the case of an unhealthy API server.

Tasks:
  • Running Kubectl Check Against API Server

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


This codebundle sets up a monitor for a specific region and GCP Product, which is then periodically checked for ongoing incidents based on the history available at https://status.cloud.google.com/incidents.json filtered based on severity level.

Tasks:
  • Get Number of GCP Incidents Effecting My Workspace

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Check the health of a Vault server. The response code is used to determine if the service is healthy, resulting in a metric of 1 if it is, or 0 if not.

Tasks:
  • Check If Vault Endpoint Is Healthy

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Retrieve number of expired TLS certificates managed by cert-manager within a given window. The metric pushed is the number of certs within the configured expiration window.

Tasks:
  • Inspect Certification Expiration Dates

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


A REST SLI for querying and extracting data from a REST endpoint that needs an explicit oauth2 flow. Where the token acquisition is handled using basic auth.

Tasks:
  • Request Data From Rest Endpoint

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Paul Dittaro

Icon 2 Codecollection: rw-public-codecollection


Check for unresolved incidents related to GitHub services, and provides a count of ongoing incidents as a metric.

Tasks:
  • Get Number of Incidents Affecting GitHub

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Vui Le

Icon 2 Codecollection: rw-public-codecollection


Retrieve the count of all AWS accounts in an organization.

Tasks:
  • Get Count Of AWS Accounts In Organization

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Vui Le

Icon 2 Codecollection: rw-public-codecollection


Check Grafana server health.

Tasks:
  • Check Grafana Server Health

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Retrieve the number of results of a GCP Log Explorer query.

Tasks:
  • Running GCE Logging Query And Pushing Result Count Metric

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Monitors AWS cost and usage data for the latest billing period. Accepts one tag for continuous monitoring.

Tasks:
  • Get All Billing Sliced By Tags

Icon 1 4 Troubleshooting Commands

Icon 2 Contributed by Shea Stewart

Icon 2 Codecollection: rw-public-codecollection


This SLI uses kubectl to score namespace health. Produces a value between 0 (completely failing thet test) and 1 (fully passing the test). Looks for container restarts, events, and pods not ready.

Tasks:
  • Get Event Count and Score
  • Get Container Restarts and Score
  • Get NotReady Pods
  • Generate Namspace Score

Icon 1 5 Troubleshooting Commands

Icon 2 Contributed by Shea Stewart

Icon 2 Codecollection: rw-public-codecollection


Uses promql on the Ops Suite API to determine the health of a Kong managed ingress resource and pushes the result as an SLI metric. Produces a 1 for a healthy resource, or 0 for an unhealthy resource.

Tasks:
  • Get Access Token
  • Get HTTP Error Rate
  • Get Upstream Health
  • Get Request Latency Rate
  • Generate Kong Ingress Score

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Monitors the average timing of a github actions workflow file within a repo and returns the average runtime in minutes.

Tasks:
  • Get Average Run Time For Workflow

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Measure HTTP latency against a given URL. The returned metric is the number of seconds the request took as a float value.

Tasks:
  • Check HTTP Latency to Well Known URL

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Retrieve binary result from an AWS CloudWatch Insights query. Pushes 0 (success) if logs are found (activity) or 1 if no logs were found in the time window.

Tasks:
  • Running CloudWatch Log Query And Pushing 1 If No Results Found

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Fetch the results of a datadog metric timeseries and push the extracted value as an SLI metric.

Tasks:
  • Query Datadog Metrics

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Vui Le

Icon 2 Codecollection: rw-public-codecollection


Check Elasticsearch cluster health

Tasks:
  • Check Elasticsearch Cluster Health

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Check the status of an Uptime.com component for a given site. It compares the operational state of the component with the list of allowed states, resulting in a 1 when acceptable, and 0 when not.

Tasks:
  • Check If Vault Endpoint Is Healthy

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Shea Stewart

Icon 2 Codecollection: rw-public-codecollection


This codebundle runs a kubectl get command that produces a value and pushes the metric. Uses jmespath for filtering and allows calculations such as count, sum, avg on specified fields.

Tasks:
  • Running Kubectl get and push the metric

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Retrieve aggregate results from multiple AWS Cloudwatch Metrics Insights queries ran against tagged resources. This codebundle fetches a list of instance IDs filtered by tags, and uses them to run a set of AWS metric queries against the CloudWatch metrics insights API and pushes an aggregated/transformed value provided by the API as a metric.

Tasks:
  • Run CloudWatch Metric Query Across Set Of IDs And Push Metric

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Returns the number of events with matching messages as an SLI metric.

Tasks:
  • Get Number Of Matching Events

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Retrieve the number of detected AWS CloudFormation stack events over a given history

Tasks:
  • Fetch CloudFormation Stack Events

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Check the health of pods deployed by cert-manager.

Tasks:
  • Health Check cert-manager Pods

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by

Icon 2 Codecollection: rw-public-codecollection


Performs a metric query using a Google MQL statement on the Ops Suite API and pushes the result as an SLI metric.

Tasks:
  • Running GCP OpsSuite Metric Query

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Vui Le

Icon 2 Codecollection: rw-public-codecollection


Retrieve a DataDog instance's "System Load" metric

Tasks:
  • Check Datadog System Load

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Queries the Sysdig data API with a PromQL query to fetch metric data.

Tasks:
  • Querying PromQL Endpoint And Pushing Metric Data

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


A general purpose REST SLI for querying and extracting data from a REST endpoint that uses a basic auth flow.

Tasks:
  • Request Data From Rest Endpoint

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Vui Le

Icon 2 Codecollection: rw-public-codecollection


Check availability of a GitLab server.

Tasks:
  • Check GitLab Server Status

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Vui Le

Icon 2 Codecollection: rw-public-codecollection


Ping a host and retrieve packet loss percentage.

Tasks:
  • Ping host and collect packet lost percentage

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Measures the maximum replica lag across a Patroni cluster.

Tasks:
  • Measure Patroni Member Lag

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Retrieve number of results from an AWS CloudWatch Insights query.

Tasks:
  • Running CloudWatch Log Query And Pushing The Count Of Results

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Checks that the current state of a daemonset is healthy and returns a score of either 1 (healthy) or 0 (unhealthy).

Tasks:
  • Health Check Daemonset

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Paul Dittaro

Icon 2 Codecollection: rw-public-codecollection


Retrieve number of upcoming Github platform maintenances over a given window.

Tasks:
  • Get Scheduled and Active GitHub Maintenance Windows

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Run a PromQL query against Prometheus range query API, perform a provided transform, and return the result.

Tasks:
  • Querying Prometheus Instance And Pushing Aggregated Data

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Vui Le

Icon 2 Codecollection: rw-public-codecollection


Check health of Pingdom platform.

Tasks:
  • Check Pingdom Health

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


An SLI which monitors another SLI that's submitting a 0-1 health score and when that health score falls below a threshold, will immediately trigger a taskset. When this SLI detects a rate below the threshold rate it submits a 1 to denote a signal was sent before returning to 0 when the monitored SLI is healthy.

Tasks:
  • Check If SLI Within Incident Threshold

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Vui Le

Icon 2 Codecollection: rw-public-codecollection


Check Jira latency when searching issues by current user.

Tasks:
  • Search Jira Issues By Current User

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Paul Dittaro

Icon 2 Codecollection: rw-public-codecollection


Check status of the GitHub platform (https://www.githubstatus.com/) for a specified set of GitHub service components. The metric supplied is a aggregated percentage indicating the availability of the components with 1 = 100% available.

Tasks:
  • Get Availability of GitHub or Individual GitHub Components

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Creates an adhoc one-shot job which mounts a PVC as a canary test, which is polled for success before being torn down.

Tasks:
  • Run Canary Job

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Shea Stewart

Icon 2 Codecollection: rw-public-codecollection


Performs a metric query using a PromQL statement on the Ops Suite API and pushes the result as an SLI metric.

Tasks:
  • Run Prometheus Instant Query Against Google Prom API Endpoint

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


A general purpose REST SLI for querying and extracting data from a REST endpoint that uses a implicit oauth2 flow.

Tasks:
  • Request Data From Rest Endpoint

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Vui Le

Icon 2 Codecollection: rw-public-codecollection


Check GitHub latency by getting a list of repo names.

Tasks:
  • Check GitHub Latency With Get Repos

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Queries the Sysdig data API to fetch metric data.

Tasks:
  • Query Sysdig Metric Data And Pushing Metric

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Check if an HTTP request against a URL fails or times out of a given latency window. A return of 1 is considered a success, while a 0 is failure.

Tasks:
  • Checking HTTP URL Is Available And Timely

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by Jonathan Funk

Icon 2 Codecollection: rw-public-codecollection


Runs a postgres SQL query and pushes the returned query result as an SLI metric. During execution, the SQL query should be passed to a Kubernetes workload that has access to the psql binary. The workload will run the query and return the result from stdout.

Tasks:
  • Run Postgres Query And Return Result As Metric

Icon 1 4 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Generates a composite score about the health of an AKS cluster using the AZ CLI. Returns a 1 if all checks pass, 0 if they all fail, and value between 0 and 1 for partial success/fail. Checks the upstream service for reported errors. Looks for Critical or Error activities within a specified time period. Checks the overall configuration for provisioning failures.

Tasks:
  • Check for Resource Health Issues Affecting AKS Cluster `${AKS_CLUSTER}` In Resource Group `${AZ_RESOURCE_GROUP}`
  • Fetch Activities for AKS Cluster `${AKS_CLUSTER}` In Resource Group `${AZ_RESOURCE_GROUP}`
  • Check Configuration Health of AKS Cluster `${AKS_CLUSTER}` In Resource Group `${AZ_RESOURCE_GROUP}`
  • Generate AKS Cluster Health Score

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


This CodeBundle counts the number of container images (from a configured list) outdated. It compares upstream images with those in the registry and counts the number that are outdated.

Tasks:
  • Count Outdated Images in Azure Container Registry `${ACR_REGISTRY}`

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Counts the number of unhealthy cert-manager managed certificates in a namespace.

Tasks:
  • Count Unready and Expired Certificates in Namespace `${NAMESPACE}`

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-cli-codecollection


Monitor AWS Lambda Invocation Errors

Tasks:
  • Analyze AWS Lambda Invocation Errors in Region `${AWS_REGION}`

Icon 1 5 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


This SLI measures DNS health metrics for Azure environments including resolution success rates, latency measurements, private DNS zone health, and external DNS resolver availability. Provides binary scoring (0/1) for each metric and calculates an overall DNS health score. Supports multiple FQDNs, private/public DNS zones, forward lookup zones, and external resolver testing.

Tasks:
  • DNS Resolution Success Rate
  • DNS Query Latency
  • Private DNS Zone Health
  • External DNS Resolver Availability
  • Generate DNS Health Score

Icon 1 7 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Service Level Indicators for GitHub Actions Health Monitoring

Tasks:
  • Calculate Workflow Success Rate Across Specified Repositories
  • Calculate Organization Health Score Across Specified Organizations
  • Calculate Runner Availability Score Across Specified Organizations
  • Calculate Security Workflow Score Across Specified Repositories
  • Calculate Performance Score Across Specified Repositories
  • Calculate API Rate Limit Health Score
  • Generate Overall GitHub Actions Health Score

Icon 1 7 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Calculates Azure ACR health by checking reachability, SKU, pull/push ratio, and storage utilization.

Tasks:
  • Check ACR Reachability for Registry `${ACR_NAME}`
  • Check ACR Usage SKU Metric for Registry `${ACR_NAME}`
  • Check ACR Pull/Push Success Ratio for Registry `${ACR_NAME}`
  • Check ACR Storage Utilization for Registry `${ACR_NAME}`
  • Check ACR Network Configuration for Registry `${ACR_NAME}`
  • Check ACR Security Configuration
  • Generate Comprehensive ACR Health Score for Registry `${ACR_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}`

Icon 1 6 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Queries the health status of an App Service, and returns 0 when it's not healthy, and 1 when it is.

Tasks:
  • Check for Resource Health Issues Affecting App Service `${APP_SERVICE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}`
  • Check App Service `${APP_SERVICE_NAME}` Health Check Metrics In Resource Group `${AZ_RESOURCE_GROUP}`
  • Check App Service `${APP_SERVICE_NAME}` Configuration Health In Resource Group `${AZ_RESOURCE_GROUP}`
  • Check Deployment Health of App Service `${APP_SERVICE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}`
  • Fetch App Service `${APP_SERVICE_NAME}` Activities In Resource Group `${AZ_RESOURCE_GROUP}`
  • Generate App Service Health Score for `${APP_SERVICE_NAME}` in resource group `${AZ_RESOURCE_GROUP}`

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Counts nodes that have been preempted within the defined time interval.

Tasks:
  • Count the number of nodes in active preempt operation in project `${GCP_PROJECT_ID}`

Icon 1 3 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Runs multiple Kubernetes and psql commands to report on the health of a postgres cluster. Produces a value between 0 (completely failing thet test) and 1 (fully passing the test). Checks for database lag & backup health.

Tasks:
  • Check Patroni Database Lag in Namespace `${NAMESPACE}` on Host `${HOSTNAME}` using `patronictl`
  • Check Database Backup Status for Cluster `${OBJECT_NAME}` in Namespace `${NAMESPACE}`
  • Generate Namespace Score for Namespace `${NAMESPACE}`

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-cli-codecollection


Measures failing reconciliations for fluxcd

Tasks:
  • Health Check Flux Reconciliation

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Count the number of Cloud Functions in an unhealthy state for a GCP Project.

Tasks:
  • Count unhealthy GCP Cloud Functions in GCP Project `${GCP_PROJECT_ID}`

Icon 1 3 Troubleshooting Commands

Icon 2 Contributed by saurabh3460

Icon 2 Codecollection: rw-cli-codecollection


Check Azure App Service Plan health by identifying availability issues, high capacity usage

Tasks:
  • Count App Service Plans with Health Status of `Available` in resource group `${AZURE_RESOURCE_GROUP}`
  • Count App Service Plans with High Capacity Usage in resource group `${AZURE_RESOURCE_GROUP}`
  • Generate Health Score

Icon 1 2 Troubleshooting Commands

Icon 2 Contributed by akshayrw25

Icon 2 Codecollection: rw-cli-codecollection


This SLI monitors stacktrace health in kubernetes workload application logs. Produces a value between 0 (stacktraces detected) and 1 (no stacktraces found). Focuses specifically on application error detection through stacktrace analysis.

Tasks:
  • Get Stacktrace Health Score for ${WORKLOAD_TYPE} `${WORKLOAD_NAME}`
  • Generate Stacktrace Health Score for `${WORKLOAD_NAME}`

Icon 1 2 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Evaluate cluster node health using kubectl.

Tasks:
  • Check for Node Restarts in Cluster `${CONTEXT}`
  • Generate Namespace Score in Kubernetes Cluster `$${CONTEXT}`

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-cli-codecollection


Checks VM Scale Set key metrics and returns a 1 when healthy, or 0 when not healthy.

Tasks:
  • Check Scale Set `${VMSCALESET}` Key Metrics In Resource Group `${AZ_RESOURCE_GROUP}`

Icon 1 4 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Performs a health check on Azure Service Bus instances and the components using them, generating a report of issues and next steps.

Tasks:
  • Check for Resource Health Issues Service Bus `${SB_NAMESPACE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}`
  • Check Basic Connectivity for Service Bus `${SB_NAMESPACE_NAME}`
  • Check Critical Metrics for Service Bus `${SB_NAMESPACE_NAME}`
  • Generate Enhanced Service Bus Health Score

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-cli-codecollection


Measures the number of exception stacktraces present in an application's logs over a time period.

Tasks:
  • Tail `${CONTAINER_NAME}` Application Logs For Stacktraces

Icon 1 3 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Counts the number of nodes above 90% CPU or Memory Utilization from kubectl top.

Tasks:
  • Identify High Utilization Nodes for Cluster `${CONTEXT}`
  • Identify Pods with Resource Limits Exceeding Node Capacity in Cluster `${CONTEXT}`
  • Generate Cluster Resource Health Score

Icon 1 7 Troubleshooting Commands

Icon 2 Contributed by saurabh3460

Icon 2 Codecollection: rw-cli-codecollection


Check Jenkins health, failed builds, tests and long running builds

Tasks:
  • Check For Failed Build Logs in Jenkins Instance `${JENKINS_INSTANCE_NAME}`
  • Check For Long Running Builds in Jenkins Instance `${JENKINS_INSTANCE_NAME}`
  • Check For Recent Failed Tests in Jenkins Instance `${JENKINS_INSTANCE_NAME}`
  • Check For Jenkins Instance `${JENKINS_INSTANCE_NAME}` Health
  • Check For Long Queued Builds in Jenkins Instance `${JENKINS_INSTANCE_NAME}`
  • Check Jenkins Executor Utilization in Jenkins Instance `${JENKINS_INSTANCE_NAME}`
  • Generate Health Score

Icon 1 4 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


This SLI uses the GCP API or gcloud to score bucket health. Produces a value between 0 (completely failing thet test) and 1 (fully passing the test). Looks for usage above a threshold and public buckets.

Tasks:
  • Fetch GCP Bucket Storage Utilization for `${PROJECT_IDS}`
  • Check GCP Bucket Security Configuration for `${PROJECT_IDS}`
  • Fetch GCP Bucket Storage Operations Rate for `${PROJECT_IDS}`
  • Generate Bucket Score in Project `${PROJECT_IDS}`

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


This taskset uses curl to validate the response code of the endpoint. Returns ascore of 1 if healthy, an 0 if unhealthy.

Tasks:
  • Validate HTTP URL Availability and Timeliness for ${URL}

Icon 1 3 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


This codebundle checks for unhealthy or suspended FluxCD Kustomization objects.

Tasks:
  • List Suspended FluxCD Kustomization objects in Namespace `${NAMESPACE}` in Cluster `${CONTEXT}`
  • List Unready FluxCD Kustomizations in Namespace `${NAMESPACE}` in Cluster `${CONTEXT}`
  • Generate FluxCD Kustomization Health Score for Namespace `${NAMESPACE}` in Cluster `${CONTEXT}`

Icon 1 7 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Calculates SLI for GCP Vertex AI Model Garden health using Google Cloud Monitoring Python SDK. Required IAM Roles: - roles/monitoring.viewer (for metrics access) - roles/logging.privateLogViewer (for quick log health check) Required Permissions: - monitoring.timeSeries.list - logging.privateLogEntries.list

Tasks:
  • Quick Vertex AI Log Health Check for `${GCP_PROJECT_ID}`
  • Calculate Error Rate Score for `${GCP_PROJECT_ID}`
  • Calculate Latency Performance Score for `${GCP_PROJECT_ID}`
  • Calculate Throughput Usage Score for `${GCP_PROJECT_ID}`
  • Discover All Deployed Models for `${GCP_PROJECT_ID}`
  • Check Service Availability Score for `${GCP_PROJECT_ID}`
  • Generate Final Vertex AI Model Garden Health Score for `${GCP_PROJECT_ID}`

Icon 1 6 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Queries the health status of an Function App, and returns 0 when it's not healthy, and 1 when it is.

Tasks:
  • Check for Resource Health Issues Affecting Function App `${FUNCTION_APP_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}`
  • Check Function App `${FUNCTION_APP_NAME}` Health Check Metrics In Resource Group `${AZ_RESOURCE_GROUP}`
  • Check Function App `${FUNCTION_APP_NAME}` Configuration Health In Resource Group `${AZ_RESOURCE_GROUP}`
  • Check Deployment Health of Function App `${FUNCTION_APP_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}`
  • Fetch Function App `${FUNCTION_APP_NAME}` Activities In Resource Group `${AZ_RESOURCE_GROUP}`
  • Generate Function App Health Score for `${FUNCTION_APP_NAME}` in resource group `${AZ_RESOURCE_GROUP}`

Icon 1 7 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Queries the health of an Azure Application Gateway, returning 1 when it's healthy and 0 when it's unhealthy.

Tasks:
  • Check for Resource Health Issues Affecting Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}`
  • Check Configuration Health of Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}`
  • Check Backend Pool Health for Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}`
  • Fetch Metrics for Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}`
  • Check SSL Certificate Health for Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}`
  • Check Logs for Errors with Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}`
  • Generate Application Gateway Health Score

Icon 1 7 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Runs diagnostic checks to check the health of APIM instances

Tasks:
  • Check for Resource Health Issues Affecting APIM `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}`
  • Fetch Key Metrics for APIM `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}`
  • Check Logs for Errors with APIM `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}`
  • Verify APIM Policy Configurations for `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}`
  • Check APIM SSL Certificates for `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}`
  • Inspect Dependencies and Related Resources for APIM `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}`
  • Generate APIM Health Score

Icon 1 5 Troubleshooting Commands

Icon 2 Contributed by Nbarola

Icon 2 Codecollection: rw-cli-codecollection


Calculates Azure VM health by checking disk, memory, uptime, and patch status.

Tasks:
  • Check Disk Utilization for VMs in Resource Group `${AZ_RESOURCE_GROUP}`
  • Check Memory Utilization for VMs in Resource Group `${AZ_RESOURCE_GROUP}`
  • Check Uptime for VMs in Resource Group `${AZ_RESOURCE_GROUP}`
  • Check Last Patch Status for VMs in Resource Group `${AZ_RESOURCE_GROUP}`
  • Generate Comprehensive VM Health Score

Icon 1 7 Troubleshooting Commands

Icon 2 Contributed by saurabh3460

Icon 2 Codecollection: rw-cli-codecollection


Counts Azure Key Vault health by checking availability metrics, configuration settings, expiring items (secrets/certificates/keys), log issues, and performance metrics

Tasks:
  • Count Key Vault Resource Health in resource group `${AZURE_RESOURCE_GROUP}`
  • Count Key Vault Availability in resource group `${AZURE_RESOURCE_GROUP}`
  • Count Key Vault configuration in resource group `${AZURE_RESOURCE_GROUP}`
  • Count Expiring Key Vault Items in resource group `${AZURE_RESOURCE_GROUP}`
  • Count Key Vault Log Issues in resource group `${AZURE_RESOURCE_GROUP}`
  • Count Key Vault Performance Metrics in resource group `${AZURE_RESOURCE_GROUP}`
  • Generate Comprehensive Key Vault Health Score

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


This SLI fetches the latest GitHub Actions worflow run artifact pushes a metric based on a user provided command.

Tasks:
  • Analyze artifact from GitHub Workflow `${WORKFLOW_NAME}` in repository `${GITHUB_REPO}` and push metric

Icon 1 8 Troubleshooting Commands

Icon 2 Contributed by Nbarola

Icon 2 Codecollection: rw-cli-codecollection


Checks istio proxy sidecar injection status, high memory and cpu usage, warnings and errors in logs, valid certificates, configuration and verify istio installation.

Tasks:
  • Verify Istio Sidecar Injection for Cluster `${CONTEXT}`
  • Check Istio Sidecar Resource Usage for Cluster `${CONTEXT}`
  • Validate Istio Installation in Cluster `${CONTEXT}`
  • Check Istio Controlplane Logs For Errors in Cluster `${CONTEXT}`
  • Fetch Istio Proxy Logs in Cluster `${CONTEXT}`
  • Verify Istio SSL Certificates in Cluster `${CONTEXT}`
  • Check Istio Configuration Health in Cluster `${CONTEXT}`
  • Generate Health Score for Cluster ${CONTEXT}

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-cli-codecollection


Monitors the status of EKS / Fargate in the given AWS region.

Tasks:
  • Check Amazon EKS Cluster Health Status in AWS Region `${AWS_REGION}`

Icon 1 6 Troubleshooting Commands

Icon 2 Contributed by saurabh3460

Icon 2 Codecollection: rw-cli-codecollection


Azure Data Factories health checks including resource health status, frequent pipeline errors, failed pipeline runs, and large data operations monitoring.

Tasks:
  • Identify Health Issues Affecting Data Factories in resource group `${AZURE_RESOURCE_GROUP}`
  • Count Frequent Pipeline Errors in Data Factories in resource group `${AZURE_RESOURCE_GROUP}`
  • Count Failed Pipelines in Data Factories in resource group `${AZURE_RESOURCE_GROUP}`
  • Count Large Data Operations in Data Factories in resource group `${AZURE_RESOURCE_GROUP}`
  • Count Long Running Pipeline Runs in Data Factories in resource group `${AZURE_RESOURCE_GROUP}`
  • Generate Health Score

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-cli-codecollection


Measures the number of exception stacktraces present in an application's logs over a time period.

Tasks:
  • Measure Application Exceptions in `${NAMESPACE}`

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by jon-funk

Icon 2 Codecollection: rw-cli-codecollection


Monitors the health status of elasticache redis in the AWS region.

Tasks:
  • Scan ElastiCaches in AWS Region `${AWS_REGION}`

Icon 1 6 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


Identify issues affecting GKE Clusters in a GCP Project and creates a health score. A score of 1 is healthy, a score between 0 and 1 indicates unhealthy components.

Tasks:
  • Identify GKE Service Account Issues in GCP Project `${GCP_PROJECT_ID}`
  • Fetch GKE Recommendations for GCP Project `${GCP_PROJECT_ID}`
  • Fetch GKE Cluster Health for GCP Project `${GCP_PROJECT_ID}`
  • Check for Quota Related GKE Autoscaling Issues in GCP Project `${GCP_PROJECT_ID}`
  • Quick Node Instance Group Health Check for GCP Project `${GCP_PROJECT_ID}`
  • Generate GKE Cluster Health Score

Icon 1 4 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


This SLI uses kubectl to score namespace health. Produces a value between 0 (completely failing thet test) and 1 (fully passing the test). Looks for container restarts, events, and pods not ready.

Tasks:
  • Get Error Event Count within ${EVENT_AGE} and calculate Score
  • Get Container Restarts and Score in Namespace `${NAMESPACE}`
  • Get NotReady Pods in `${NAMESPACE}`
  • Generate Namespace Score in `${NAMESPACE}`

Icon 1 2 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


This SLI collects information about storage such as PersistentVolumes and PersistentVolumeClaims and generates an aggregated health score for the namespace. 1 = Healthy, 0 = Failed, >0 <1 = Degraded

Tasks:
  • Fetch the Storage Utilization for PVC Mounts in Namespace `${NAMESPACE}`
  • Generate Namespace Score for Namespace `${NAMESPACE}`

Icon 1 6 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


This SLI uses kubectl to score deployment health. Produces a value between 0 (completely failing the test) and 1 (fully passing the test). Looks for container restarts, critical log errors, pods not ready, deployment status, and recent events.

Tasks:
  • Get Container Restarts and Score for Deployment `${DEPLOYMENT_NAME}`
  • Get Critical Log Errors and Score for Deployment `${DEPLOYMENT_NAME}`
  • Get NotReady Pods Score for Deployment `${DEPLOYMENT_NAME}`
  • Get Deployment Replica Status and Score for `${DEPLOYMENT_NAME}`
  • Get Recent Warning Events Score for `${DEPLOYMENT_NAME}`
  • Generate Deployment Health Score for `${DEPLOYMENT_NAME}`

Icon 1 1 Troubleshooting Commands

Icon 2 Contributed by stewartshea

Icon 2 Codecollection: rw-cli-codecollection


This codebundle fetches the number of running pods with the set of provided labels, letting you measure the number of running pods.

Tasks:
  • Measure Number of Running Pods with Label in `${NAMESPACE}`