CodeCollection Registry

SLI

SLI

Metric from Azure CLI Command

1 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-generic-codecollection

Runs an ad-hoc user-provided command, and if the provided command outputs a non-empty string to stdout then a health score of 0 (unhealthy) is pushed, otherwise if there is no output, indicating no issues, then a 1 is pushed. User commands should filter expected/healthy content (eg: with grep) and only output found errors.

Tasks:

${TASK_TITLE}

cURL CLI Command Metric with Headers

1 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-generic-codecollection

This SLI runs a user-provided cURL command and can push the result as a metric. Optional headers and post-processing commands are supported. If HEADERS is provided, the file is appended to the cURL command using -K. If POST_PROCESS is provided, it is appended as a pipe (|) to further process the output (e.g., jq).

Tasks:

${TASK_TITLE}

Metric from Kubernetes CLI Command

1 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-generic-codecollection

This taskset runs a user provided kubectl command and pushes the metric. The supplied command must result in distinct single metric. Command line tools like jq are available.

Tasks:

${TASK_TITLE}

Metric from cURL CLI Command

1 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-generic-codecollection

Runs an ad-hoc user-provided command, and if the provided command outputs a non-empty string to stdout then a health score of 0 (unhealthy) is pushed, otherwise if there is no output, indicating no issues, then a 1 is pushed. User commands should filter expected/healthy content (eg: with grep) and only output found errors.

Tasks:

${TASK_TITLE}

cURL CLI Command Metric with Headers

1 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-generic-codecollection

Runs an ad-hoc user-provided command, and pushes the command's stdout as the health metric. If no output is produced, the resulting metric is empty; if the command produces output, that exact text is used as the metric. User commands should produce the desired health metric or numeric value if needed—e.g., output "0" if unhealthy or "1" if healthy.

Tasks:

${TASK_TITLE}

Metric from Azure CLI Command

1 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-generic-codecollection

This sli runs a user provided azure cli command and pushes the metric. The supplied command must result in distinct single metric. Command line tools like jq are available.

Tasks:

${TASK_TITLE}

Environment Script Command Metric with Individual Secrets

1 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-generic-codecollection

This SLI runs a user-provided script/command with up to 10 configurable environment variables from secrets and pushes the result as a metric. Each environment variable is configured as an individual secret for maximum security and clarity. The supplied command must result in a distinct single metric.

Tasks:

${TASK_TITLE}

AWS CLI Command with Issue

1 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-generic-codecollection

Runs an ad-hoc user-provided command, and if the provided command outputs a non-empty string to stdout then a health score of 0 (unhealthy) is pushed, otherwise if there is no output, indicating no issues, then a 1 is pushed. User commands should filter expected/healthy content (eg: with grep) and only output found errors.

Tasks:

${TASK_TITLE}

AWS CLI Command

1 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-generic-codecollection

This sli runs a user provided awscli command and pushes the metric. The supplied command must result in distinct single metric. Command line tools like jq are available.

Tasks:

${TASK_TITLE}

Metric from GCP CLI Command

1 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-generic-codecollection

Runs a user provided gcloud command and pushes the metric to the RunWhen Platform. The supplied command must result in distinct single metric. Command line tools like jq are available.

Tasks:

${TASK_TITLE}

Git Script Command Metric with Secrets

1 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-generic-codecollection

This SLI runs a user-provided script/command with flexible environment variable support and pushes the result as a metric. Supports SSH keys, git credentials, and arbitrary environment variables from secrets. The supplied command must result in a distinct single metric.

Tasks:

${TASK_TITLE}

Metric from Kubernetes CLI Command

1 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-generic-codecollection

Runs an ad-hoc user-provided command, and if the provided command outputs a non-empty string to stdout then a health score of 0 (unhealthy) is pushed, otherwise if there is no output, indicating no issues, then a 1 is pushed. User commands should filter expected/healthy content (eg: with grep) and only output found errors.

Tasks:

${TASK_TITLE}

cURL CLI Command

1 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-generic-codecollection

This SLI runs a user provided curl command and can push the result as a metric. Command line tools like jq are available. Accepts HEADERS as a secret.

Tasks:

${TASK_TITLE}

Metric from GCP CLI Command

1 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-generic-codecollection

Runs an ad-hoc user-provided command, and if the provided command outputs a non-empty string to stdout then a health score of 0 (unhealthy) is pushed, otherwise if there is no output, indicating no issues, then a 1 is pushed. User commands should filter expected/healthy content (eg: with grep) and only output found errors.

Tasks:

${TASK_TITLE}

1 Troubleshooting Commands

Contributed by stewartshea

Codecollection: aws-c7n-codecollection

Counts the number of S3 buckets in an Account that are insecure or unhealthy.

Tasks:

Count S3 Buckets With Public Access in AWS Account `${AWS_ACCOUNT_NAME}`

4 Troubleshooting Commands

Contributed by saurabh3460

Codecollection: aws-c7n-codecollection

Counts the number of EBS resources by identifying unattached volumes, unused and aged snapshots, and unencrypted volumes.

Tasks:

Check Unattached EBS Volumes in `${AWS_REGION}`
Check Unencrypted EBS Volumes in `${AWS_REGION}`
Check Unused EBS Snapshots in `${AWS_REGION}`
Generate EBS Score

4 Troubleshooting Commands

Contributed by saurabh3460

Codecollection: aws-c7n-codecollection

Count the number of EC2 instances that are stale or stopped

Tasks:

Check for stale AWS EC2 instances in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}`
Check for stopped AWS EC2 instances in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}`
Check for invalid AWS Auto Scaling Groups in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}`
Generate Health Score

AWS network health

5 Troubleshooting Commands

Contributed by saurabh3460

Codecollection: aws-c7n-codecollection

Count publicly accessible security groups, unused EIPs, unused ELBs, and VPCs with flow logs disabled

Tasks:

Check for publicly accessible security groups in AWS account `${AWS_ACCOUNT_ID}`
Check for unused Elastic IPs in AWS account `${AWS_ACCOUNT_ID}`
Check for unused ELBs in AWS account `${AWS_ACCOUNT_ID}`
Check for VPCs with Flow Logs disabled in AWS account `${AWS_ACCOUNT_ID}`
Generate Health Score

4 Troubleshooting Commands

Contributed by saurabh3460

Codecollection: aws-c7n-codecollection

Check AWS RDS instances that are unencrypted, publicly accessible, or have backups disabled.

Tasks:

Check for unencrypted RDS instances in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}`
Check for publicly accessible RDS instances in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}`
Check for disabled backup RDS instances in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}`
Generate Health Score

6 Troubleshooting Commands

Contributed by saurabh3460

Codecollection: aws-c7n-codecollection

Count AWS ACM certificates that are unused, Expiring, or expired and failed status.

Tasks:

Check for unused ACM certificates in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}`
Check for Expiring ACM certificates in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}`
Check for expired ACM certificates in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}`
Check for Failed Status ACM Certificates in AWS Region `${AWS_REGION}` in AWS Account `${AWS_ACCOUNT_ID}`
Check for Pending Validation ACM Certificates in AWS Region `${AWS_REGION}` in AWS Account `${AWS_ACCOUNT_ID}`
Generate Health Score

AWS CloudWatch Logs health

4 Troubleshooting Commands

Contributed by saurabh3460

Codecollection: aws-c7n-codecollection

Check AWS Monitoring Configuration Health

Tasks:

Check CloudWatch Log Groups Without Retention Period in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}`
Check if CloudTrail exists and is configured for multi-region in AWS Region `${AWS_REGION}` in AWS Account `${AWS_ACCOUNT_ID}`
Check CloudTrail Without CloudWatch Logs in AWS Region `${AWS_REGION}` in AWS Account `${AWS_ACCOUNT_ID}`
Generate Health Score

8 Troubleshooting Commands

Contributed by saurabh3460

Codecollection: azure-c7n-codecollection

Check Azure storage health by identifying unused disks, snapshots, and storage accounts

Tasks:

Count Azure Storage Accounts with Health Status of `Available` in resource group `${AZURE_RESOURCE_GROUP}`
Count Unused Disks in resource group `${AZURE_RESOURCE_GROUP}`
Count Unused Snapshots in resource group `${AZURE_RESOURCE_GROUP}`
Count Unused Storage Accounts in resource group `${AZURE_RESOURCE_GROUP}`
Count Storage Containers with Public Access in resource group `${AZURE_RESOURCE_GROUP}`
Count Storage Account Misconfigurations in resource group `${AZURE_RESOURCE_GROUP}`
Count Storage Account Changes with Critical/High Security Risk in resource group `${AZURE_RESOURCE_GROUP}`
Generate Health Score

Azure Virtual Machine Health

11 Troubleshooting Commands

Contributed by saurabh3460

Codecollection: azure-c7n-codecollection

Count Virtual machines that are publicly accessible, have high CPU usage, underutilized memory, stopped state, unused network interfaces, and unused public IPs in Azure

Tasks:

Check Azure VM Health in resource group `${AZURE_RESOURCE_GROUP}`
Check for VMs With Public IP in resource group `${AZURE_RESOURCE_GROUP}`
Check for Stopped VMs in resource group `${AZURE_RESOURCE_GROUP}`
Check for VMs With High CPU Usage in resource group `${AZURE_RESOURCE_GROUP}`
Check for Underutilized VMs Based on CPU Usage in resource group `${AZURE_RESOURCE_GROUP}`
Check for VMs With High Memory Usage in resource group `${AZURE_RESOURCE_GROUP}`
Check for Underutilized VMs Based on Memory Usage in resource group `${AZURE_RESOURCE_GROUP}`
Check for Unused Network Interfaces in resource group `${AZURE_RESOURCE_GROUP}`
Check for Unused Public IPs in resource group `${AZURE_RESOURCE_GROUP}`
Check VMs Agent Status in resource group `${AZURE_RESOURCE_GROUP}`
Generate Health Score

Azure Database Health

10 Troubleshooting Commands

Contributed by saurabh3460

Codecollection: azure-c7n-codecollection

Count databases that are publicly accessible, without replication, without high availability configuration, with high CPU usage, high memory usage, high cache miss rate, low availability, and risky configuration changes in Azure

Tasks:

Score Database Availability in resource group `${AZURE_RESOURCE_GROUP}`
Count Publicly Accessible Databases in resource group `${AZURE_RESOURCE_GROUP}`
Count Databases Without Replication in resource group `${AZURE_RESOURCE_GROUP}`
Count Databases Without High Availability in resource group `${AZURE_RESOURCE_GROUP}`
Count Databases With High CPU Usage in resource group `${AZURE_RESOURCE_GROUP}`
Count Databases With High Memory Usage in resource group `${AZURE_RESOURCE_GROUP}`
Count Redis Caches With High Cache Miss Rate in resource group `${AZURE_RESOURCE_GROUP}`
Count Databases With Health Issues in resource group `${AZURE_RESOURCE_GROUP}`
Count Risky Database Configuration Changes in resource group `${AZURE_RESOURCE_GROUP}`
Generate Health Score

RunWhen Platform Azure ACR Image Sync

3 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-workspace-utils

Determines if any RunWhen CodeCollection or private runner components require image updates.

Tasks:

Check for CodeCollection Updates against ACR Registry`${REGISTRY_NAME}`
Check for RunWhen Local Image Updates against ACR Registry`${REGISTRY_NAME}`
Count Images Needing Update and Push Metric

RunWhen Local Helm Update Check (ACR)

1 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-workspace-utils

Determines if any RunWhen Local images have available updates in the private Azure Container Registry service.

Tasks:

Check for Available RunWhen Helm Images in ACR Registry`${REGISTRY_NAME}`

GitLab Get Repo Latency

1 Troubleshooting Commands

Contributed by Vui Le

Codecollection: rw-public-codecollection

Check GitLab latency by getting a list of repo names.

Tasks:

Check GitLab Latency With Get Repos

gRPC cURL Unary

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

A gRPC curl SLI for querying and extracting data from a generic grpcurl call.

Tasks:

Run gRPCurl Command and Push Metric

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Checks an Artifactory instance health endpoint to determine its operational status. The response is parsed to determine if the service is healthy, resulting in a metric of 1 if it is, or 0 if not.

Tasks:

Check If Artifactory Endpoint Is Healthy

Prometheus Query (Instant) Metric

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Run a PromQL query against Prometheus instant query API, perform a provided transform, and return the result.

Tasks:

Querying Prometheus Instance And Pushing Aggregated Data

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Retreieve aggregate data via kubectl top command.

Tasks:

Running Kubectl Top And Extracting Metric Data

Twitter Query Handle

1 Troubleshooting Commands

Contributed by Shea Stewart

Codecollection: rw-public-codecollection

Queries Twitter to count amount of tweets within a specified time range for a specific user handle.

Tasks:

Query Twitter

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Retrieve the result of an AWS CloudWatch Metrics Insights query.

Tasks:

Running CloudWatch Metric Query And Pushing The Result

GCP GCloud Generic Metric

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Run arbitrary gcloud commands and parse their output for arbitrary values such as json to be submitted as a metric.

Tasks:

Run Gcloud CLI Command and Push metric

Cortex Metrics Ingester Health

1 Troubleshooting Commands

Contributed by Shea Stewart

Codecollection: rw-public-codecollection

Uses kubectl to query the state of a ingestor ring and determine if it's healthy. Returns 1 if healthy, 0 if unhealthy.

Tasks:

Determine Cortex Ingester Ring Health

1 Troubleshooting Commands

Contributed by Vui Le

Codecollection: rw-public-codecollection

Check DNS latency for Google Resolver.

Tasks:

Check DNS latency for Google Resolver

cURL Generic Metric

1 Troubleshooting Commands

Contributed by Shea Stewart

Codecollection: rw-public-codecollection

A curl SLI for querying and extracting data from a generic curl call. Supports jq. Should prodice a single metric.

Tasks:

Run Curl Command and Push Metric

REST Metric (Explicit OAuth2 with Bearer Token)

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

A REST SLI for querying and extracting data from a REST endpoint that needs an explicit oauth2 flow. Where an access token must be acquired with a bearer token.

Tasks:

Request Data From Rest Endpoint

MongoDB Health (GCP PromQL)

8 Troubleshooting Commands

Contributed by Shea Stewart

Codecollection: rw-public-codecollection

Uses promql on the Ops Suite API to determine the health of a MongoDB database instance and pushes the result as an SLI metric. Produces a 1 for a healthy resource, or 0 for an unhealthy resource.

Tasks:

Get Access Token
Get Instance Status
Get Connection Utilization Rate
Get MongoDB Member State Health
Get MongoDB Replication Lag
Get MongoDB Queue Size
Get Assertion Rate
Generate MongoDB Score

Kubernetes Patroni Health Check

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Uses kubectl (or equivalent) to query the state of a patroni cluster and determine if it's healthy.

Tasks:

Determine Patroni Health

Kubernetes API Server Health

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Check the health of a Kubernetes API server using kubectl. Returns 1 when OK, or a 0 in the case of an unhealthy API server.

Tasks:

Running Kubectl Check Against API Server

GCP Service Status

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

This codebundle sets up a monitor for a specific region and GCP Product, which is then periodically checked for ongoing incidents based on the history available at https://status.cloud.google.com/incidents.json filtered based on severity level.

Tasks:

Get Number of GCP Incidents Effecting My Workspace

HahiCorp Vault Health

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Check the health of a Vault server. The response code is used to determine if the service is healthy, resulting in a metric of 1 if it is, or 0 if not.

Tasks:

Check If Vault Endpoint Is Healthy

Cert-manager Expirations

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Retrieve number of expired TLS certificates managed by cert-manager within a given window. The metric pushed is the number of certs within the configured expiration window.

Tasks:

Inspect Certification Expiration Dates

REST Metric (Explicit OAuth2 with BasicAuth)

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

A REST SLI for querying and extracting data from a REST endpoint that needs an explicit oauth2 flow. Where the token acquisition is handled using basic auth.

Tasks:

Request Data From Rest Endpoint

GitHub Status Incidents

1 Troubleshooting Commands

Contributed by Paul Dittaro

Codecollection: rw-public-codecollection

Check for unresolved incidents related to GitHub services, and provides a count of ongoing incidents as a metric.

Tasks:

Get Number of Incidents Affecting GitHub

AWS Organization Accounts

1 Troubleshooting Commands

Contributed by Vui Le

Codecollection: rw-public-codecollection

Retrieve the count of all AWS accounts in an organization.

Tasks:

Get Count Of AWS Accounts In Organization

1 Troubleshooting Commands

Contributed by Vui Le

Codecollection: rw-public-codecollection

Check Grafana server health.

Tasks:

Check Grafana Server Health

GCP Operations Suite Log Query

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Retrieve the number of results of a GCP Log Explorer query.

Tasks:

Running GCE Logging Query And Pushing Result Count Metric

AWS Billing Period Costs by Tag

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Monitors AWS cost and usage data for the latest billing period. Accepts one tag for continuous monitoring.

Tasks:

Get All Billing Sliced By Tags

Kubernetes Namespace Healthcheck

4 Troubleshooting Commands

Contributed by Shea Stewart

Codecollection: rw-public-codecollection

This SLI uses kubectl to score namespace health. Produces a value between 0 (completely failing thet test) and 1 (fully passing the test). Looks for container restarts, events, and pods not ready.

Tasks:

Get Event Count and Score
Get Container Restarts and Score
Get NotReady Pods
Generate Namspace Score

Kong Ingress Health (GCP PromQL)

5 Troubleshooting Commands

Contributed by Shea Stewart

Codecollection: rw-public-codecollection

Uses promql on the Ops Suite API to determine the health of a Kong managed ingress resource and pushes the result as an SLI metric. Produces a 1 for a healthy resource, or 0 for an unhealthy resource.

Tasks:

Get Access Token
Get HTTP Error Rate
Get Upstream Health
Get Request Latency Rate
Generate Kong Ingress Score

GitHub Actions Workflow Timing

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Monitors the average timing of a github actions workflow file within a repo and returns the average runtime in minutes.

Tasks:

Get Average Run Time For Workflow

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Measure HTTP latency against a given URL. The returned metric is the number of seconds the request took as a float value.

Tasks:

Check HTTP Latency to Well Known URL

AWS CloudWatch Log Query (Pass/Fail)

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Retrieve binary result from an AWS CloudWatch Insights query. Pushes 0 (success) if logs are found (activity) or 1 if no logs were found in the time window.

Tasks:

Running CloudWatch Log Query And Pushing 1 If No Results Found

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Fetch the results of a datadog metric timeseries and push the extracted value as an SLI metric.

Tasks:

Query Datadog Metrics

ElasticSearch Health

1 Troubleshooting Commands

Contributed by Vui Le

Codecollection: rw-public-codecollection

Check Elasticsearch cluster health

Tasks:

Check Elasticsearch Cluster Health

Uptime.com Component Health

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Check the status of an Uptime.com component for a given site. It compares the operational state of the component with the list of allowed states, resulting in a 1 when acceptable, and 0 when not.

Tasks:

Check If Vault Endpoint Is Healthy

Kubernetes Workload Metric

1 Troubleshooting Commands

Contributed by Shea Stewart

Codecollection: rw-public-codecollection

This codebundle runs a kubectl get command that produces a value and pushes the metric. Uses jmespath for filtering and allows calculations such as count, sum, avg on specified fields.

Tasks:

Running Kubectl get and push the metric

AWS CloudWatch Tag Metric Query

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Retrieve aggregate results from multiple AWS Cloudwatch Metrics Insights queries ran against tagged resources. This codebundle fetches a list of instance IDs filtered by tags, and uses them to run a set of AWS metric queries against the CloudWatch metrics insights API and pushes an aggregated/transformed value provided by the API as a metric.

Tasks:

Run CloudWatch Metric Query Across Set Of IDs And Push Metric

Kubernetes Event Query

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Returns the number of events with matching messages as an SLI metric.

Tasks:

Get Number Of Matching Events

AWS CloudFormation Event Rate

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Retrieve the number of detected AWS CloudFormation stack events over a given history

Tasks:

Fetch CloudFormation Stack Events

Cert-Manager Health Check

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Check the health of pods deployed by cert-manager.

Tasks:

Health Check cert-manager Pods

GCP Operations Suite Metric Query

1 Troubleshooting Commands

Contributed by

Codecollection: rw-public-codecollection

Performs a metric query using a Google MQL statement on the Ops Suite API and pushes the result as an SLI metric.

Tasks:

Running GCP OpsSuite Metric Query

Datadog System Load

1 Troubleshooting Commands

Contributed by Vui Le

Codecollection: rw-public-codecollection

Retrieve a DataDog instance's "System Load" metric

Tasks:

Check Datadog System Load

Sysdig Monitor PromQL Metric

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Queries the Sysdig data API with a PromQL query to fetch metric data.

Tasks:

Querying PromQL Endpoint And Pushing Metric Data

REST Metric (Basic Auth)

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

A general purpose REST SLI for querying and extracting data from a REST endpoint that uses a basic auth flow.

Tasks:

Request Data From Rest Endpoint

GitLab Availability

1 Troubleshooting Commands

Contributed by Vui Le

Codecollection: rw-public-codecollection

Check availability of a GitLab server.

Tasks:

Check GitLab Server Status

Ping Host Availability

1 Troubleshooting Commands

Contributed by Vui Le

Codecollection: rw-public-codecollection

Ping a host and retrieve packet loss percentage.

Tasks:

Ping host and collect packet lost percentage

Kubernetes Patroni Lag Metric

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Measures the maximum replica lag across a Patroni cluster.

Tasks:

Measure Patroni Member Lag

AWS CloudWatch Log Query (Total Count)

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Retrieve number of results from an AWS CloudWatch Insights query.

Tasks:

Running CloudWatch Log Query And Pushing The Count Of Results

Kubernetes Daemonset Health Check

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Checks that the current state of a daemonset is healthy and returns a score of either 1 (healthy) or 0 (unhealthy).

Tasks:

Health Check Daemonset

GitHub Status Maintenance

1 Troubleshooting Commands

Contributed by Paul Dittaro

Codecollection: rw-public-codecollection

Retrieve number of upcoming Github platform maintenances over a given window.

Tasks:

Get Scheduled and Active GitHub Maintenance Windows

Prometheus Query (Range) Metric

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Run a PromQL query against Prometheus range query API, perform a provided transform, and return the result.

Tasks:

Querying Prometheus Instance And Pushing Aggregated Data

1 Troubleshooting Commands

Contributed by Vui Le

Codecollection: rw-public-codecollection

Check health of Pingdom platform.

Tasks:

Check Pingdom Health

SLI Alert Threshold

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

An SLI which monitors another SLI that's submitting a 0-1 health score and when that health score falls below a threshold, will immediately trigger a taskset. When this SLI detects a rate below the threshold rate it submits a 1 to denote a signal was sent before returning to 0 when the monitored SLI is healthy.

Tasks:

Check If SLI Within Incident Threshold

Jira Search Issue Latency

1 Troubleshooting Commands

Contributed by Vui Le

Codecollection: rw-public-codecollection

Check Jira latency when searching issues by current user.

Tasks:

Search Jira Issues By Current User

GitHub Service Status

1 Troubleshooting Commands

Contributed by Paul Dittaro

Codecollection: rw-public-codecollection

Check status of the GitHub platform (https://www.githubstatus.com/) for a specified set of GitHub service components. The metric supplied is a aggregated percentage indicating the availability of the components with 1 = 100% available.

Tasks:

Get Availability of GitHub or Individual GitHub Components

Kubernetes Synthetic PVC Test

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Creates an adhoc one-shot job which mounts a PVC as a canary test, which is polled for success before being torn down.

Tasks:

Run Canary Job

GCP Operations Suite Prometheus Query

1 Troubleshooting Commands

Contributed by Shea Stewart

Codecollection: rw-public-codecollection

Performs a metric query using a PromQL statement on the Ops Suite API and pushes the result as an SLI metric.

Tasks:

Run Prometheus Instant Query Against Google Prom API Endpoint

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

A general purpose REST SLI for querying and extracting data from a REST endpoint that uses a implicit oauth2 flow.

Tasks:

Request Data From Rest Endpoint

GitHub API Latency

1 Troubleshooting Commands

Contributed by Vui Le

Codecollection: rw-public-codecollection

Check GitHub latency by getting a list of repo names.

Tasks:

Check GitHub Latency With Get Repos

Sysdig Monitor Metric

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Queries the Sysdig data API to fetch metric data.

Tasks:

Query Sysdig Metric Data And Pushing Metric

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Check if an HTTP request against a URL fails or times out of a given latency window. A return of 1 is considered a success, while a 0 is failure.

Tasks:

Checking HTTP URL Is Available And Timely

1 Troubleshooting Commands

Contributed by Jonathan Funk

Codecollection: rw-public-codecollection

Runs a postgres SQL query and pushes the returned query result as an SLI metric. During execution, the SQL query should be passed to a Kubernetes workload that has access to the psql binary. The workload will run the query and return the result from stdout.

Tasks:

Run Postgres Query And Return Result As Metric

Azure AKS Triage

4 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Generates a composite score about the health of an AKS cluster using the AZ CLI. Returns a 1 if all checks pass, 0 if they all fail, and value between 0 and 1 for partial success/fail. Checks the upstream service for reported errors. Looks for Critical or Error activities within a specified time period. Checks the overall configuration for provisioning failures.

Tasks:

Check for Resource Health Issues Affecting AKS Cluster `${AKS_CLUSTER}` In Resource Group `${AZ_RESOURCE_GROUP}`
Fetch Activities for AKS Cluster `${AKS_CLUSTER}` In Resource Group `${AZ_RESOURCE_GROUP}`
Check Configuration Health of AKS Cluster `${AKS_CLUSTER}` In Resource Group `${AZ_RESOURCE_GROUP}`
Generate AKS Cluster Health Score

Outdated Azure Container Registry Image Count

1 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

This CodeBundle counts the number of container images (from a configured list) outdated. It compares upstream images with those in the registry and counts the number that are outdated.

Tasks:

Count Outdated Images in Azure Container Registry `${ACR_REGISTRY}`

Kubernetes cert-manager Healthcheck

1 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Counts the number of unhealthy cert-manager managed certificates in a namespace.

Tasks:

Count Unready and Expired Certificates in Namespace `${NAMESPACE}`

AWS Lambda Health Monitor

1 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

Monitor AWS Lambda Invocation Errors

Tasks:

Analyze AWS Lambda Invocation Errors in Region `${AWS_REGION}`

Azure DNS Health Metrics (Multi-Zone)

5 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

This SLI measures DNS health metrics for Azure environments including resolution success rates, latency measurements, private DNS zone health, and external DNS resolver availability. Provides binary scoring (0/1) for each metric and calculates an overall DNS health score. Supports multiple FQDNs, private/public DNS zones, forward lookup zones, and external resolver testing.

Tasks:

DNS Resolution Success Rate
DNS Query Latency
Private DNS Zone Health
External DNS Resolver Availability
Generate DNS Health Score

GitHub Actions Health SLI

7 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Service Level Indicators for GitHub Actions Health Monitoring

Tasks:

Calculate Workflow Success Rate Across Specified Repositories
Calculate Organization Health Score Across Specified Organizations
Calculate Runner Availability Score Across Specified Organizations
Calculate Security Workflow Score Across Specified Repositories
Calculate Performance Score Across Specified Repositories
Calculate API Rate Limit Health Score
Generate Overall GitHub Actions Health Score

Azure ACR Health SLI

7 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Calculates Azure ACR health by checking reachability, SKU, pull/push ratio, and storage utilization.

Tasks:

Check ACR Reachability for Registry `${ACR_NAME}`
Check ACR Usage SKU Metric for Registry `${ACR_NAME}`
Check ACR Pull/Push Success Ratio for Registry `${ACR_NAME}`
Check ACR Storage Utilization for Registry `${ACR_NAME}`
Check ACR Network Configuration for Registry `${ACR_NAME}`
Check ACR Security Configuration
Generate Comprehensive ACR Health Score for Registry `${ACR_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}`

Azure App Service Triage

6 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Queries the health status of an App Service, and returns 0 when it's not healthy, and 1 when it is.

Tasks:

Check for Resource Health Issues Affecting App Service `${APP_SERVICE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}`
Check App Service `${APP_SERVICE_NAME}` Health Check Metrics In Resource Group `${AZ_RESOURCE_GROUP}`
Check App Service `${APP_SERVICE_NAME}` Configuration Health In Resource Group `${AZ_RESOURCE_GROUP}`
Check Deployment Health of App Service `${APP_SERVICE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}`
Fetch App Service `${APP_SERVICE_NAME}` Activities In Resource Group `${AZ_RESOURCE_GROUP}`
Generate App Service Health Score for `${APP_SERVICE_NAME}` in resource group `${AZ_RESOURCE_GROUP}`

GCP Node Prempt List

1 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Counts nodes that have been preempted within the defined time interval.

Tasks:

Count the number of nodes in active preempt operation in project `${GCP_PROJECT_ID}`

Kubernetes Postgres Healthcheck

3 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Runs multiple Kubernetes and psql commands to report on the health of a postgres cluster. Produces a value between 0 (completely failing thet test) and 1 (fully passing the test). Checks for database lag & backup health.

Tasks:

Check Patroni Database Lag in Namespace `${NAMESPACE}` on Host `${HOSTNAME}` using `patronictl`
Check Database Backup Status for Cluster `${OBJECT_NAME}` in Namespace `${NAMESPACE}`
Generate Namespace Score for Namespace `${NAMESPACE}`

Kubernetes Fluxcd Reconciliation Monitor

1 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

Measures failing reconciliations for fluxcd

Tasks:

Health Check Flux Reconciliation

GCP Cloud Function Health

1 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Count the number of Cloud Functions in an unhealthy state for a GCP Project.

Tasks:

Count unhealthy GCP Cloud Functions in GCP Project `${GCP_PROJECT_ID}`

Azure App Service Plan

3 Troubleshooting Commands

Contributed by saurabh3460

Codecollection: rw-cli-codecollection

Check Azure App Service Plan health by identifying availability issues, high capacity usage

Tasks:

Count App Service Plans with Health Status of `Available` in resource group `${AZURE_RESOURCE_GROUP}`
Count App Service Plans with High Capacity Usage in resource group `${AZURE_RESOURCE_GROUP}`
Generate Health Score

Kubernetes Workload Stacktrace Health SLI

2 Troubleshooting Commands

Contributed by akshayrw25

Codecollection: rw-cli-codecollection

This SLI monitors stacktrace health in kubernetes workload application logs. Produces a value between 0 (stacktraces detected) and 1 (no stacktraces found). Focuses specifically on application error detection through stacktrace analysis.

Tasks:

Get Stacktrace Health Score for ${WORKLOAD_TYPE} `${WORKLOAD_NAME}`
Generate Stacktrace Health Score for `${WORKLOAD_NAME}`

Kubernetes Cluster Node Health

2 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Evaluate cluster node health using kubectl.

Tasks:

Check for Node Restarts in Cluster `${CONTEXT}`
Generate Namespace Score in Kubernetes Cluster `$${CONTEXT}`

Azure Virtual Machine Scale Set Health

1 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

Checks VM Scale Set key metrics and returns a 1 when healthy, or 0 when not healthy.

Tasks:

Check Scale Set `${VMSCALESET}` Key Metrics In Resource Group `${AZ_RESOURCE_GROUP}`

Azure Service Bus Health

4 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Performs a health check on Azure Service Bus instances and the components using them, generating a report of issues and next steps.

Tasks:

Check for Resource Health Issues Service Bus `${SB_NAMESPACE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}`
Check Basic Connectivity for Service Bus `${SB_NAMESPACE_NAME}`
Check Critical Metrics for Service Bus `${SB_NAMESPACE_NAME}`
Generate Enhanced Service Bus Health Score

Kubernetes Tail Application Logs

1 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

Measures the number of exception stacktraces present in an application's logs over a time period.

Tasks:

Tail `${CONTAINER_NAME}` Application Logs For Stacktraces

Kubernetes Cluster Resource Health

3 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Counts the number of nodes above 90% CPU or Memory Utilization from kubectl top.

Tasks:

Identify High Utilization Nodes for Cluster `${CONTEXT}`
Identify Pods with Resource Limits Exceeding Node Capacity in Cluster `${CONTEXT}`
Generate Cluster Resource Health Score

7 Troubleshooting Commands

Contributed by saurabh3460

Codecollection: rw-cli-codecollection

Check Jenkins health, failed builds, tests and long running builds

Tasks:

Check For Failed Build Logs in Jenkins Instance `${JENKINS_INSTANCE_NAME}`
Check For Long Running Builds in Jenkins Instance `${JENKINS_INSTANCE_NAME}`
Check For Recent Failed Tests in Jenkins Instance `${JENKINS_INSTANCE_NAME}`
Check For Jenkins Instance `${JENKINS_INSTANCE_NAME}` Health
Check For Long Queued Builds in Jenkins Instance `${JENKINS_INSTANCE_NAME}`
Check Jenkins Executor Utilization in Jenkins Instance `${JENKINS_INSTANCE_NAME}`
Generate Health Score

GCP Storage Bucket Health

4 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

This SLI uses the GCP API or gcloud to score bucket health. Produces a value between 0 (completely failing thet test) and 1 (fully passing the test). Looks for usage above a threshold and public buckets.

Tasks:

Fetch GCP Bucket Storage Utilization for `${PROJECT_IDS}`
Check GCP Bucket Security Configuration for `${PROJECT_IDS}`
Fetch GCP Bucket Storage Operations Rate for `${PROJECT_IDS}`
Generate Bucket Score in Project `${PROJECT_IDS}`

1 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

This taskset uses curl to validate the response code of the endpoint. Returns ascore of 1 if healthy, an 0 if unhealthy.

Tasks:

Validate HTTP URL Availability and Timeliness for ${URL}

Kubernetes FluxCD Kustomization Health

3 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

This codebundle checks for unhealthy or suspended FluxCD Kustomization objects.

Tasks:

List Suspended FluxCD Kustomization objects in Namespace `${NAMESPACE}` in Cluster `${CONTEXT}`
List Unready FluxCD Kustomizations in Namespace `${NAMESPACE}` in Cluster `${CONTEXT}`
Generate FluxCD Kustomization Health Score for Namespace `${NAMESPACE}` in Cluster `${CONTEXT}`

GCP Vertex AI Model Garden Health SLI

7 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Calculates SLI for GCP Vertex AI Model Garden health using Google Cloud Monitoring Python SDK. Required IAM Roles: - roles/monitoring.viewer (for metrics access) - roles/logging.privateLogViewer (for quick log health check) Required Permissions: - monitoring.timeSeries.list - logging.privateLogEntries.list

Tasks:

Quick Vertex AI Log Health Check for `${GCP_PROJECT_ID}`
Calculate Error Rate Score for `${GCP_PROJECT_ID}`
Calculate Latency Performance Score for `${GCP_PROJECT_ID}`
Calculate Throughput Usage Score for `${GCP_PROJECT_ID}`
Discover All Deployed Models for `${GCP_PROJECT_ID}`
Check Service Availability Score for `${GCP_PROJECT_ID}`
Generate Final Vertex AI Model Garden Health Score for `${GCP_PROJECT_ID}`

Azure Function App Triage

6 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Queries the health status of an Function App, and returns 0 when it's not healthy, and 1 when it is.

Tasks:

Check for Resource Health Issues Affecting Function App `${FUNCTION_APP_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}`
Check Function App `${FUNCTION_APP_NAME}` Health Check Metrics In Resource Group `${AZ_RESOURCE_GROUP}`
Check Function App `${FUNCTION_APP_NAME}` Configuration Health In Resource Group `${AZ_RESOURCE_GROUP}`
Check Deployment Health of Function App `${FUNCTION_APP_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}`
Fetch Function App `${FUNCTION_APP_NAME}` Activities In Resource Group `${AZ_RESOURCE_GROUP}`
Generate Function App Health Score for `${FUNCTION_APP_NAME}` in resource group `${AZ_RESOURCE_GROUP}`

Azure Application Gateway Health

7 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Queries the health of an Azure Application Gateway, returning 1 when it's healthy and 0 when it's unhealthy.

Tasks:

Check for Resource Health Issues Affecting Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}`
Check Configuration Health of Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}`
Check Backend Pool Health for Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}`
Fetch Metrics for Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}`
Check SSL Certificate Health for Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}`
Check Logs for Errors with Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}`
Generate Application Gateway Health Score

Azure APIM Health

7 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Runs diagnostic checks to check the health of APIM instances

Tasks:

Check for Resource Health Issues Affecting APIM `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}`
Fetch Key Metrics for APIM `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}`
Check Logs for Errors with APIM `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}`
Verify APIM Policy Configurations for `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}`
Check APIM SSL Certificates for `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}`
Inspect Dependencies and Related Resources for APIM `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}`
Generate APIM Health Score

Azure VM Health SLI

5 Troubleshooting Commands

Contributed by Nbarola

Codecollection: rw-cli-codecollection

Calculates Azure VM health by checking disk, memory, uptime, and patch status.

Tasks:

Check Disk Utilization for VMs in Resource Group `${AZ_RESOURCE_GROUP}`
Check Memory Utilization for VMs in Resource Group `${AZ_RESOURCE_GROUP}`
Check Uptime for VMs in Resource Group `${AZ_RESOURCE_GROUP}`
Check Last Patch Status for VMs in Resource Group `${AZ_RESOURCE_GROUP}`
Generate Comprehensive VM Health Score

Azure Key Vault Health

7 Troubleshooting Commands

Contributed by saurabh3460

Codecollection: rw-cli-codecollection

Counts Azure Key Vault health by checking availability metrics, configuration settings, expiring items (secrets/certificates/keys), log issues, and performance metrics

Tasks:

Count Key Vault Resource Health in resource group `${AZURE_RESOURCE_GROUP}`
Count Key Vault Availability in resource group `${AZURE_RESOURCE_GROUP}`
Count Key Vault configuration in resource group `${AZURE_RESOURCE_GROUP}`
Count Expiring Key Vault Items in resource group `${AZURE_RESOURCE_GROUP}`
Count Key Vault Log Issues in resource group `${AZURE_RESOURCE_GROUP}`
Count Key Vault Performance Metrics in resource group `${AZURE_RESOURCE_GROUP}`
Generate Comprehensive Key Vault Health Score

GitHub Actions Artifact Analysis

1 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

This SLI fetches the latest GitHub Actions worflow run artifact pushes a metric based on a user provided command.

Tasks:

Analyze artifact from GitHub Workflow `${WORKFLOW_NAME}` in repository `${GITHUB_REPO}` and push metric

Kubernetes Istio System Health

8 Troubleshooting Commands

Contributed by Nbarola

Codecollection: rw-cli-codecollection

Checks istio proxy sidecar injection status, high memory and cpu usage, warnings and errors in logs, valid certificates, configuration and verify istio installation.

Tasks:

Verify Istio Sidecar Injection for Cluster `${CONTEXT}`
Check Istio Sidecar Resource Usage for Cluster `${CONTEXT}`
Validate Istio Installation in Cluster `${CONTEXT}`
Check Istio Controlplane Logs For Errors in Cluster `${CONTEXT}`
Fetch Istio Proxy Logs in Cluster `${CONTEXT}`
Verify Istio SSL Certificates in Cluster `${CONTEXT}`
Check Istio Configuration Health in Cluster `${CONTEXT}`
Generate Health Score for Cluster ${CONTEXT}

AWS EKS Health Scan

1 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

Monitors the status of EKS / Fargate in the given AWS region.

Tasks:

Check Amazon EKS Cluster Health Status in AWS Region `${AWS_REGION}`

Azure Data factories Health

6 Troubleshooting Commands

Contributed by saurabh3460

Codecollection: rw-cli-codecollection

Azure Data Factories health checks including resource health status, frequent pipeline errors, failed pipeline runs, and large data operations monitoring.

Tasks:

Identify Health Issues Affecting Data Factories in resource group `${AZURE_RESOURCE_GROUP}`
Count Frequent Pipeline Errors in Data Factories in resource group `${AZURE_RESOURCE_GROUP}`
Count Failed Pipelines in Data Factories in resource group `${AZURE_RESOURCE_GROUP}`
Count Large Data Operations in Data Factories in resource group `${AZURE_RESOURCE_GROUP}`
Count Long Running Pipeline Runs in Data Factories in resource group `${AZURE_RESOURCE_GROUP}`
Generate Health Score

Kubernetes Application Monitor

1 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

Measures the number of exception stacktraces present in an application's logs over a time period.

Tasks:

Measure Application Exceptions in `${NAMESPACE}`

AWS ElastiCache Health Monitor

1 Troubleshooting Commands

Contributed by jon-funk

Codecollection: rw-cli-codecollection

Monitors the health status of elasticache redis in the AWS region.

Tasks:

Scan ElastiCaches in AWS Region `${AWS_REGION}`

GKE Cluster Health

6 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

Identify issues affecting GKE Clusters in a GCP Project and creates a health score. A score of 1 is healthy, a score between 0 and 1 indicates unhealthy components.

Tasks:

Identify GKE Service Account Issues in GCP Project `${GCP_PROJECT_ID}`
Fetch GKE Recommendations for GCP Project `${GCP_PROJECT_ID}`
Fetch GKE Cluster Health for GCP Project `${GCP_PROJECT_ID}`
Check for Quota Related GKE Autoscaling Issues in GCP Project `${GCP_PROJECT_ID}`
Quick Node Instance Group Health Check for GCP Project `${GCP_PROJECT_ID}`
Generate GKE Cluster Health Score

Kubernetes Namespace Healthcheck

4 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

This SLI uses kubectl to score namespace health. Produces a value between 0 (completely failing thet test) and 1 (fully passing the test). Looks for container restarts, events, and pods not ready.

Tasks:

Get Error Event Count within ${EVENT_AGE} and calculate Score
Get Container Restarts and Score in Namespace `${NAMESPACE}`
Get NotReady Pods in `${NAMESPACE}`
Generate Namespace Score in `${NAMESPACE}`

Kubernetes Persistent Volume Healthcheck

2 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

This SLI collects information about storage such as PersistentVolumes and PersistentVolumeClaims and generates an aggregated health score for the namespace. 1 = Healthy, 0 = Failed, >0 <1 = Degraded

Tasks:

Fetch the Storage Utilization for PVC Mounts in Namespace `${NAMESPACE}`
Generate Namespace Score for Namespace `${NAMESPACE}`

Kubernetes Deployment Healthcheck

6 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

This SLI uses kubectl to score deployment health. Produces a value between 0 (completely failing the test) and 1 (fully passing the test). Looks for container restarts, critical log errors, pods not ready, deployment status, and recent events.

Tasks:

Get Container Restarts and Score for Deployment `${DEPLOYMENT_NAME}`
Get Critical Log Errors and Score for Deployment `${DEPLOYMENT_NAME}`
Get NotReady Pods Score for Deployment `${DEPLOYMENT_NAME}`
Get Deployment Replica Status and Score for `${DEPLOYMENT_NAME}`
Get Recent Warning Events Score for `${DEPLOYMENT_NAME}`
Generate Deployment Health Score for `${DEPLOYMENT_NAME}`

Kubernetes Labeled Pod Count

1 Troubleshooting Commands

Contributed by stewartshea

Codecollection: rw-cli-codecollection

This codebundle fetches the number of running pods with the set of provided labels, letting you measure the number of running pods.

Tasks:

Measure Number of Running Pods with Label in `${NAMESPACE}`