All Tasks
AlertManager Webhook Handler |
||
TaskSet | Run SLX Tasks with matching AlertManager Webhook commonLabels | |
Artifactory OK |
||
SLI | Check If Artifactory Endpoint Is Healthy | |
AWS Account Creation Notification |
||
TaskSet | Get The Recently Created AWS Accounts | |
SLI | Get Count Of AWS Accounts In Organization | |
AWS ACM health |
||
TaskSet | List Unused ACM Certificates in AWS Region `${AWS_REGION}` in AWS Account `${AWS_ACCOUNT_ID}` List Expiring ACM Certificates in AWS Region `${AWS_REGION}` in AWS Account `${AWS_ACCOUNT_ID}` List Expired ACM Certificates in AWS Region `${AWS_REGION}` in AWS Account `${AWS_ACCOUNT_ID}` List Failed Status ACM Certificates in AWS Region `${AWS_REGION}` in AWS Account `${AWS_ACCOUNT_ID}` List Pending Validation ACM Certificates in AWS Region `${AWS_REGION}` in AWS Account `${AWS_ACCOUNT_ID}` |
|
SLI | Check for unused ACM certificates in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}` Check for Expiring ACM certificates in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}` Check for expired ACM certificates in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}` Check for Failed Status ACM Certificates in AWS Region `${AWS_REGION}` in AWS Account `${AWS_ACCOUNT_ID}` Check for Pending Validation ACM Certificates in AWS Region `${AWS_REGION}` in AWS Account `${AWS_ACCOUNT_ID}` Generate Health Score |
|
AWS Billing Period Costs by Tag |
||
SLI | Get All Billing Sliced By Tags | |
AWS CLI Command |
||
TaskSet | ${TASK_TITLE} | |
SLI | ${TASK_TITLE} | |
AWS CLI Command with Issue |
||
TaskSet | ${TASK_TITLE} | |
SLI | ${TASK_TITLE} | |
AWS CloudFormation Event Rate |
||
SLI | Fetch CloudFormation Stack Events | |
AWS CloudFormation Triage |
||
TaskSet | Get All Recent Stack Events | |
AWS CloudWatch Log Query (Pass/Fail) |
||
SLI | Running CloudWatch Log Query And Pushing 1 If No Results Found | |
AWS CloudWatch Log Query (Total Count) |
||
SLI | Running CloudWatch Log Query And Pushing The Count Of Results | |
AWS CloudWatch Logs health |
||
TaskSet | List CloudWatch Log Groups Without Retention Period in AWS Region `${AWS_REGION}` in AWS Account `${AWS_ACCOUNT_ID}` Check CloudTrail Configuration in AWS Region `${AWS_REGION}` in AWS Account `${AWS_ACCOUNT_ID}` Check for CloudTrail integration with CloudWatch Logs in AWS Region `${AWS_REGION}` in AWS Account `${AWS_ACCOUNT_ID}` |
|
SLI | Check CloudWatch Log Groups Without Retention Period in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}` Check if CloudTrail exists and is configured for multi-region in AWS Region `${AWS_REGION}` in AWS Account `${AWS_ACCOUNT_ID}` Check CloudTrail Without CloudWatch Logs in AWS Region `${AWS_REGION}` in AWS Account `${AWS_ACCOUNT_ID}` Generate Health Score |
|
AWS CloudWatch Metric Query Dashboard |
||
TaskSet | Get CloudWatch MetricQuery Insights URL | |
AWS CloudWatch Overutlized EC2 Inspection |
||
TaskSet | Check For Overutilized Ec2 Instances | |
AWS CloudWatch Tag Metric Query |
||
SLI | Run CloudWatch Metric Query Across Set Of IDs And Push Metric | |
AWS Costs by Tag |
||
TaskSet | Get All Billing Sliced By Tags | |
AWS EBS Health |
||
TaskSet | List Unattached EBS Volumes in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}` List Unencrypted EBS Volumes in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}` List Unused EBS Snapshots in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}` |
|
SLI | Check Unattached EBS Volumes in `${AWS_REGION}` Check Unencrypted EBS Volumes in `${AWS_REGION}` Check Unused EBS Snapshots in `${AWS_REGION}` Generate EBS Score |
|
AWS EC2 Health |
||
TaskSet | List stale AWS EC2 instances in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}` List stopped AWS EC2 instances in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}` List invalid AWS Auto Scaling Groups in AWS Region ${AWS_REGION} in AWS account ${AWS_ACCOUNT_ID} |
|
SLI | Check for stale AWS EC2 instances in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}` Check for stopped AWS EC2 instances in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}` Check for invalid AWS Auto Scaling Groups in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}` Generate Health Score |
|
AWS EC2 Security Check |
||
TaskSet | Check For Untagged instances Check For Dangling Volumes Check For Open Routes Check For Overused Instances Check For Underused Instances Check For Underused Volumes Check For Overused Volumes |
|
AWS EKS Cluster Health |
||
TaskSet | Check EKS Fargate Cluster Health Status in AWS Region `${AWS_REGION}` Check Amazon EKS Cluster Health Status in AWS Region `${AWS_REGION}` Monitor EKS Cluster Health in AWS Region `${AWS_REGION}` |
|
SLI | Check Amazon EKS Cluster Health Status in AWS Region `${AWS_REGION}` | |
AWS EKS Nodegroup Status Check |
||
TaskSet | Check EKS Nodegroup Status in `${EKS_CLUSTER_NAME}` | |
AWS ElastiCache Health Check |
||
TaskSet | Scan AWS Elasticache Redis Status in AWS Region `${AWS_REGION}` | |
SLI | Scan ElastiCaches in AWS Region `${AWS_REGION}` | |
AWS Lambda Health Check |
||
TaskSet | List Lambda Versions and Runtimes in AWS Region `${AWS_REGION}` Analyze AWS Lambda Invocation Errors in Region `${AWS_REGION}` Monitor AWS Lambda Performance Metrics in AWS Region `${AWS_REGION}` |
|
SLI | Analyze AWS Lambda Invocation Errors in Region `${AWS_REGION}` | |
AWS network health |
||
TaskSet | List Publicly Accessible Security Groups in AWS account `${AWS_ACCOUNT_ID}` List unused Elastic IPs in AWS account `${AWS_ACCOUNT_ID}` List unused ELBs in AWS account `${AWS_ACCOUNT_ID}` List VPCs with Flow Logs Disabled in AWS account `${AWS_ACCOUNT_ID}` |
|
SLI | Check for publicly accessible security groups in AWS account `${AWS_ACCOUNT_ID}` Check for unused Elastic IPs in AWS account `${AWS_ACCOUNT_ID}` Check for unused ELBs in AWS account `${AWS_ACCOUNT_ID}` Check for VPCs with Flow Logs disabled in AWS account `${AWS_ACCOUNT_ID}` Generate Health Score |
|
AWS RDS health |
||
TaskSet | List Unencrypted RDS Instances in AWS Region `${AWS_REGION}` in AWS Account `${AWS_ACCOUNT_ID}` List Publicly Accessible RDS Instances in AWS Region `${AWS_REGION}` in AWS Account `${AWS_ACCOUNT_ID}` List RDS Instances with Backups Disabled in AWS Region `${AWS_REGION}` in AWS Account `${AWS_ACCOUNT_ID}` |
|
SLI | Check for unencrypted RDS instances in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}` Check for publicly accessible RDS instances in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}` Check for disabled backup RDS instances in AWS Region `${AWS_REGION}` in AWS account `${AWS_ACCOUNT_ID}` Generate Health Score |
|
AWS S3 Bucket Info Report |
||
TaskSet | Check AWS S3 Bucket Storage Utilization | |
AWS S3 Health |
||
TaskSet | List S3 Buckets With Public Access in AWS Account `${AWS_ACCOUNT_NAME}` | |
SLI | Count S3 Buckets With Public Access in AWS Account `${AWS_ACCOUNT_NAME}` | |
AWS S3 Stale Check |
||
TaskSet | Create Report For Stale Buckets | |
AWS VM Triage |
||
TaskSet | Get Max VM CPU Utilization In Last 3 Hours Get Lowest VM CPU Credits In Last 3 Hours Get Max VM CPU Credit Usage In Last 3 hours Get Max VM Memory Utilization In Last 3 Hours Get Max VM Volume Usage In Last 3 Hours |
|
aws-cloudwatch-metricquery |
||
SLI | Running CloudWatch Metric Query And Pushing The Result | |
Azure ACR Image Sync |
||
TaskSet | Sync Container Images into Azure Container Registry `${ACR_REGISTRY}` | |
SLI | Count Outdated Images in Azure Container Registry `${ACR_REGISTRY}` | |
Azure AKS Triage |
||
TaskSet | Check for Resource Health Issues Affecting AKS Cluster `${AKS_CLUSTER}` In Resource Group `${AZ_RESOURCE_GROUP}` Check Configuration Health of AKS Cluster `${AKS_CLUSTER}` In Resource Group `${AZ_RESOURCE_GROUP}` Check Network Configuration of AKS Cluster `${AKS_CLUSTER}` In Resource Group `${AZ_RESOURCE_GROUP}` Fetch Activities for AKS Cluster `${AKS_CLUSTER}` In Resource Group `${AZ_RESOURCE_GROUP}` |
|
SLI | Check for Resource Health Issues Affecting AKS Cluster `${AKS_CLUSTER}` In Resource Group `${AZ_RESOURCE_GROUP}` Fetch Activities for AKS Cluster `${AKS_CLUSTER}` In Resource Group `${AZ_RESOURCE_GROUP}` Check Configuration Health of AKS Cluster `${AKS_CLUSTER}` In Resource Group `${AZ_RESOURCE_GROUP}` Generate AKS Cluster Health Score |
|
Azure APIM Health |
||
TaskSet | Gather APIM Resource Information for APIM `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}` Check for Resource Health Issues Affecting APIM `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}` Fetch Key Metrics for APIM `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}` Check Logs for Errors with APIM `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}` Verify APIM Policy Configurations for `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}` Check APIM SSL Certificates for `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}` Inspect Dependencies and Related Resources for APIM `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}` |
|
SLI | Check for Resource Health Issues Affecting APIM `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}` Fetch Key Metrics for APIM `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}` Check Logs for Errors with APIM `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}` Verify APIM Policy Configurations for `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}` Check APIM SSL Certificates for `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}` Inspect Dependencies and Related Resources for APIM `${APIM_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}` Generate APIM Health Score |
|
Azure App Service Operations |
||
TaskSet | Restart App Service `${APP_SERVICE_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}` Swap Deployment Slots for App Service `${APP_SERVICE_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}` Scale Up App Service `${APP_SERVICE_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}` Scale Down App Service `${APP_SERVICE_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}` Scale Out Instances for App Service `${APP_SERVICE_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}` by ${SCALE_OUT_FACTOR}x Scale In Instances for App Service `${APP_SERVICE_NAME}` in Resource Group `${AZ_RESOURCE_GROUP}` to 1/${SCALE_IN_FACTOR} Redeploy App Service `${APP_SERVICE_NAME}` from Latest Source in Resource Group `${AZ_RESOURCE_GROUP}` |
|
Azure App Service Triage |
||
TaskSet | Check for Resource Health Issues Affecting App Service `${APP_SERVICE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Check App Service `${APP_SERVICE_NAME}` Health in Resource Group `${AZ_RESOURCE_GROUP}` Fetch App Service `${APP_SERVICE_NAME}` Utilization Metrics In Resource Group `${AZ_RESOURCE_GROUP}` Get App Service `${APP_SERVICE_NAME}` Logs In Resource Group `${AZ_RESOURCE_GROUP}` Check Configuration Health of App Service `${APP_SERVICE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Check Deployment Health of App Service `${APP_SERVICE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Fetch App Service `${APP_SERVICE_NAME}` Activities In Resource Group `${AZ_RESOURCE_GROUP}` Check Logs for Errors in App Service `${APP_SERVICE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` |
|
SLI | Check for Resource Health Issues Affecting App Service `${APP_SERVICE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Check App Service `${APP_SERVICE_NAME}` Health Check Metrics In Resource Group `${AZ_RESOURCE_GROUP}` Check App Service `${APP_SERVICE_NAME}` Configuration Health In Resource Group `${AZ_RESOURCE_GROUP}` Check Deployment Health of App Service `${APP_SERVICE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Fetch App Service `${APP_SERVICE_NAME}` Activities In Resource Group `${AZ_RESOURCE_GROUP}` Generate App Service Health Score for `${APP_SERVICE_NAME}` in resource group `${AZ_RESOURCE_GROUP}` |
|
Azure Application Gateway Health |
||
TaskSet | Check for Resource Health Issues Affecting Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Check Configuration Health of Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Check Backend Pool Health for Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Fetch Log Analytics for Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Fetch Metrics for Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Check SSL Certificate Health for Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Check Logs for Errors with Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` List Related Azure Resources for Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` |
|
SLI | Check for Resource Health Issues Affecting Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Check Configuration Health of Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Check Backend Pool Health for Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Fetch Metrics for Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Check SSL Certificate Health for Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Check Logs for Errors with Application Gateway `${APP_GATEWAY_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Generate Application Gateway Health Score |
|
Azure CLI Command |
||
TaskSet | ${TASK_TITLE} | |
SLI | ${TASK_TITLE} | |
Azure CLI Command with Issue |
||
TaskSet | ${TASK_TITLE} | |
SLI | ${TASK_TITLE} | |
Azure Data factories Health |
||
TaskSet | Check for Resource Health Issues Affecting Data Factories in resource group `${AZURE_RESOURCE_GROUP}` List Frequent Pipeline Errors in Data Factories in resource group `${AZURE_RESOURCE_GROUP}` List Failed Pipelines in Data Factories in resource group `${AZURE_RESOURCE_GROUP}` Find Large Data Operations in Data Factories in resource group `${AZURE_RESOURCE_GROUP}` Fetch Azure Data Factory Details in resource group `${AZURE_RESOURCE_GROUP}` List Long Running Pipeline Runs in Data Factories in resource group `${AZURE_RESOURCE_GROUP}` |
|
SLI | Identify Health Issues Affecting Data Factories in resource group `${AZURE_RESOURCE_GROUP}` Count Frequent Pipeline Errors in Data Factories in resource group `${AZURE_RESOURCE_GROUP}` Count Failed Pipelines in Data Factories in resource group `${AZURE_RESOURCE_GROUP}` Count Large Data Operations in Data Factories in resource group `${AZURE_RESOURCE_GROUP}` Count Long Running Pipeline Runs in Data Factories in resource group `${AZURE_RESOURCE_GROUP}` Generate Health Score |
|
Azure Database Health |
||
TaskSet | List Database Availability in resource group `${AZURE_RESOURCE_GROUP}` List Publicly Accessible Databases in resource group `${AZURE_RESOURCE_GROUP}` List Databases Without Replication in resource group `${AZURE_RESOURCE_GROUP}` List Databases Without High Availability in resource group `${AZURE_RESOURCE_GROUP}` List Databases With High CPU Usage in resource group `${AZURE_RESOURCE_GROUP}` List All Databases With High Memory Usage in resource group `${AZURE_RESOURCE_GROUP}` List Redis Caches With High Cache Miss Rate in resource group `${AZURE_RESOURCE_GROUP}` List Database Resource Health in resource group `${AZURE_RESOURCE_GROUP}` |
|
SLI | Score Database Availability in resource group `${AZURE_RESOURCE_GROUP}` Count Publicly Accessible Databases in resource group `${AZURE_RESOURCE_GROUP}` Count Databases Without Replication in resource group `${AZURE_RESOURCE_GROUP}` Count Databases Without High Availability in resource group `${AZURE_RESOURCE_GROUP}` Count Databases With High CPU Usage in resource group `${AZURE_RESOURCE_GROUP}` Count Databases With High Memory Usage in resource group `${AZURE_RESOURCE_GROUP}` Count Redis Caches With High Cache Miss Rate in resource group `${AZURE_RESOURCE_GROUP}` Count Databases With Health Issues in resource group `${AZURE_RESOURCE_GROUP}` Generate Health Score |
|
Azure Function App Health |
||
TaskSet | Check for Resource Health Issues Affecting Function App `${FUNCTION_APP_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Check Function App `${FUNCTION_APP_NAME}` Health in Resource Group `${AZ_RESOURCE_GROUP}` Fetch Function App `${FUNCTION_APP_NAME}` Plan Utilization Metrics In Resource Group `${AZ_RESOURCE_GROUP}` Get Function App `${FUNCTION_APP_NAME}` Logs In Resource Group `${AZ_RESOURCE_GROUP}` Check Configuration Health of Function App `${FUNCTION_APP_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Check Deployment Health of Function App `${FUNCTION_APP_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Fetch Function App `${FUNCTION_APP_NAME}` Activities In Resource Group `${AZ_RESOURCE_GROUP}` Check Logs for Errors in Function App `${FUNCTION_APP_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` |
|
SLI | Check for Resource Health Issues Affecting Function App `${FUNCTION_APP_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Check Function App `${FUNCTION_APP_NAME}` Health Check Metrics In Resource Group `${AZ_RESOURCE_GROUP}` Check Function App `${FUNCTION_APP_NAME}` Configuration Health In Resource Group `${AZ_RESOURCE_GROUP}` Check Deployment Health of Function App `${FUNCTION_APP_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Fetch Function App `${FUNCTION_APP_NAME}` Activities In Resource Group `${AZ_RESOURCE_GROUP}` Generate Function App Health Score for `${FUNCTION_APP_NAME}` in resource group `${AZ_RESOURCE_GROUP}` |
|
Azure Internal LoadBalancer Triage |
||
TaskSet | Check Activity Logs for Azure Load Balancer `${AZ_LB_NAME}` | |
Azure Key Vault Health |
||
TaskSet | Check Key Vault Resource Health in resource group `${AZURE_RESOURCE_GROUP}` in Subscription `${AZURE_SUBSCRIPTION_NAME}` Check Key Vault Availability in resource group `${AZURE_RESOURCE_GROUP}` in Subscription `${AZURE_SUBSCRIPTION_NAME}` Check Key Vault Configuration in resource group `${AZURE_RESOURCE_GROUP}` in Subscription `${AZURE_SUBSCRIPTION_NAME}` Check Expiring Key Vault Items in resource group `${AZURE_RESOURCE_GROUP}` in Subscription `${AZURE_SUBSCRIPTION_NAME}` Check Key Vault Logs for Issues in resource group `${AZURE_RESOURCE_GROUP}` in Subscription `${AZURE_SUBSCRIPTION_NAME}` Check Key Vault Performance Metrics in resource group `${AZURE_RESOURCE_GROUP}` in Subscription `${AZURE_SUBSCRIPTION_NAME}` |
|
SLI | Count Key Vault Resource Health in resource group `${AZURE_RESOURCE_GROUP}` in Subscription `${AZURE_SUBSCRIPTION_NAME}` Count Key Vault Availability in resource group `${AZURE_RESOURCE_GROUP}` in Subscription `${AZURE_SUBSCRIPTION_NAME}` Count Key Vault configuration in resource group `${AZURE_RESOURCE_GROUP}` in Subscription `${AZURE_SUBSCRIPTION_NAME}` Count Expiring Key Vault Items in resource group `${AZURE_RESOURCE_GROUP}` in Subscription `${AZURE_SUBSCRIPTION_NAME}` Count Key Vault Log Issues in resource group `${AZURE_RESOURCE_GROUP}` in Subscription `${AZURE_SUBSCRIPTION_NAME}` Count Key Vault Performance Metrics in resource group `${AZURE_RESOURCE_GROUP}` in Subscription `${AZURE_SUBSCRIPTION_NAME}` Generate Comprehensive Key Vault Health Score |
|
Azure Monitor Webhook Handler |
||
TaskSet | Start RunSession From Azure Monitor Webhook Details | |
Azure Service Bus Health |
||
TaskSet | Check for Resource Health Issues Service Bus `${SB_NAMESPACE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Check Configuration Health for Service Bus `${SB_NAMESPACE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Check Metrics for Service Bus `${SB_NAMESPACE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Check Queue Health for Service Bus `${SB_NAMESPACE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Check Topic Health for Service Bus `${SB_NAMESPACE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Check Log Analytics for Service Bus `${SB_NAMESPACE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Check Capacity and Quota Headroom for Service Bus `${SB_NAMESPACE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Check Geo-Disaster Recovery for Service Bus `${SB_NAMESPACE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Check Security Configuration for Service Bus `${SB_NAMESPACE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Discover Related Resources for Service Bus `${SB_NAMESPACE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Test Connectivity to Service Bus `${SB_NAMESPACE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Check Azure Monitor Alerts for Service Bus `${SB_NAMESPACE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` |
|
SLI | Check for Resource Health Issues Service Bus `${SB_NAMESPACE_NAME}` In Resource Group `${AZ_RESOURCE_GROUP}` Generate Service Bus Health Score |
|
Azure Storage Health |
||
TaskSet | Check Azure Storage Resource Health in resource group `${AZURE_RESOURCE_GROUP}` List Unused Azure Disks in resource group `${AZURE_RESOURCE_GROUP}` List Unused Azure Snapshots in resource group `${AZURE_RESOURCE_GROUP}` List Unused Azure Storage Accounts in resource group `${AZURE_RESOURCE_GROUP}` List Public Accessible Azure Storage Accounts in resource group `${AZURE_RESOURCE_GROUP}` |
|
SLI | Count Azure Storage Accounts with Health Status of `Available` in resource group `${AZURE_RESOURCE_GROUP}` Count Unused Disks in resource group `${AZURE_RESOURCE_GROUP}` Count Unused Snapshots in resource group `${AZURE_RESOURCE_GROUP}` Count Unused Storage Accounts in resource group `${AZURE_RESOURCE_GROUP}` Count Public Accessible Storage Accounts in resource group `${AZURE_RESOURCE_GROUP}` Generate Health Score |
|
Azure Virtual Machine Health |
||
TaskSet | Check Azure VM Health in resource group `${AZURE_RESOURCE_GROUP}` List VMs With Public IP in resource group `${AZURE_RESOURCE_GROUP}` List for Stopped VMs in resource group `${AZURE_RESOURCE_GROUP}` List VMs With High CPU Usage in resource group `${AZURE_RESOURCE_GROUP}` List Underutilized VMs Based on CPU Usage in resource group `${AZURE_RESOURCE_GROUP}` List VMs With High Memory Usage in resource group `${AZURE_RESOURCE_GROUP}` List Underutilized VMs Based on Memory Usage in resource group `${AZURE_RESOURCE_GROUP}` List Unused Network Interfaces in resource group `${AZURE_RESOURCE_GROUP}` List Unused Public IPs in resource group `${AZURE_RESOURCE_GROUP}` |
|
SLI | Check Azure VM Health in resource group `${AZURE_RESOURCE_GROUP}` Check for VMs With Public IP in resource group `${AZURE_RESOURCE_GROUP}` Check for VMs With High CPU Usage in resource group `${AZURE_RESOURCE_GROUP}` Check for Stopped VMs in resource group `${AZURE_RESOURCE_GROUP}` Check for Underutilized VMs Based on CPU Usage in resource group `${AZURE_RESOURCE_GROUP}` Check for VMs With High Memory Usage in resource group `${AZURE_RESOURCE_GROUP}` Check for Underutilized VMs Based on Memory Usage in resource group `${AZURE_RESOURCE_GROUP}` Check for Unused Network Interfaces in resource group `${AZURE_RESOURCE_GROUP}` Check for Unused Public IPs in resource group `${AZURE_RESOURCE_GROUP}` Generate Health Score |
|
Azure VM Scale Set Triage |
||
TaskSet | Check Scale Set `${VMSCALESET}` Key Metrics In Resource Group `${AZ_RESOURCE_GROUP}` Fetch VM Scale Set `${VMSCALESET}` Config In Resource Group `${AZ_RESOURCE_GROUP}` Fetch Activities for VM Scale Set `${VMSCALESET}` In Resource Group `${AZ_RESOURCE_GROUP}` |
|
SLI | Check Scale Set `${VMSCALESET}` Key Metrics In Resource Group `${AZ_RESOURCE_GROUP}` | |
Cert-manager Expirations |
||
SLI | Inspect Certification Expiration Dates | |
Cert-Manager Health Check |
||
SLI | Health Check cert-manager Pods | |
Cortex Metrics Ingester Health |
||
TaskSet | Fetch Ingestor Ring Member List and Status | |
SLI | Determine Cortex Ingester Ring Health | |
cURL CLI Command |
||
TaskSet | ${TASK_TITLE} | |
SLI | ${TASK_TITLE} | |
cURL CLI Command with Headers |
||
TaskSet | ${TASK_TITLE} | |
SLI | ${TASK_TITLE} | |
cURL CLI Command with Issue |
||
TaskSet | ${TASK_TITLE} | |
SLI | ${TASK_TITLE} | |
cURL CLI Command with Issue and Headers |
||
TaskSet | ${TASK_TITLE} | |
SLI | ${TASK_TITLE} | |
cURL Generic Report |
||
TaskSet | Run Curl Command and Add to Report | |
SLI | Run Curl Command and Push Metric | |
cURL HTTP OK |
||
TaskSet | Check HTTP URL Availability and Timeliness for `${URL}` | |
SLI | Validate HTTP URL Availability and Timeliness for ${URL} | |
Datadog Metric |
||
SLI | Query Datadog Metrics | |
Datadog System Load |
||
SLI | Check Datadog System Load | |
Discord Send Message |
||
TaskSet | Send Chat Message | |
DNS Latency |
||
SLI | Check DNS latency for Google Resolver | |
Dynatrace Webhook Handler |
||
TaskSet | Start RunSession From Dynatrace Webhook Details | |
ElasticSearch Health |
||
SLI | Check Elasticsearch Cluster Health | |
GCP CLI Command |
||
TaskSet | ${TASK_TITLE} | |
SLI | ${TASK_TITLE} | |
GCP CLI Command with Issue |
||
TaskSet | ${TASK_TITLE} | |
SLI | ${TASK_TITLE} | |
GCP Cloud Function Health |
||
TaskSet | List Unhealthy Cloud Functions in GCP Project `${GCP_PROJECT_ID}` Get Error Logs for Unhealthy Cloud Functions in GCP Project `${GCP_PROJECT_ID}` |
|
SLI | Count unhealthy GCP Cloud Functions in GCP Project `${GCP_PROJECT_ID}` | |
GCP GCloud Generic Report |
||
TaskSet | Run Gcloud CLI Command and Push metric | |
SLI | Run Gcloud CLI Command and Push metric | |
GCP Gcloud Log Inspection |
||
TaskSet | Inspect GCP Logs For Common Errors in GCP Project `${GCP_PROJECT_ID}` | |
GCP Node Prempt List |
||
TaskSet | List all nodes in an active preempt operation for GCP Project `${GCP_PROJECT_ID}` within the last `${AGE}` hours | |
SLI | Count the number of nodes in active preempt operation in project `${GCP_PROJECT_ID}` | |
GCP Operations Suite Log Query |
||
SLI | Running GCE Logging Query And Pushing Result Count Metric | |
GCP Operations Suite Log Query Dashboard URL |
||
TaskSet | Get GCP Log Dashboard URL For Given Log Query | |
GCP Operations Suite Metric Query |
||
SLI | Running GCP OpsSuite Metric Query | |
GCP Operations Suite Prometheus Query |
||
SLI | Run Prometheus Instant Query Against Google Prom API Endpoint | |
GCP Service Status |
||
SLI | Get Number of GCP Incidents Effecting My Workspace | |
GCP Storage Bucket Health |
||
TaskSet | Fetch GCP Bucket Storage Utilization for `${PROJECT_IDS}` Add GCP Bucket Storage Configuration for `${PROJECT_IDS}` to Report Check GCP Bucket Security Configuration for `${PROJECT_IDS}` Fetch GCP Bucket Storage Operations Rate for `${PROJECT_IDS}` |
|
SLI | Fetch GCP Bucket Storage Utilization for `${PROJECT_IDS}` Check GCP Bucket Security Configuration for `${PROJECT_IDS}` Fetch GCP Bucket Storage Operations Rate for `${PROJECT_IDS}` Generate Bucket Score in Project `${PROJECT_IDS}` |
|
GitHub - Create Issue From RunSession |
||
TaskSet | Create GitHub Issue in Repository `${GITHUB_REPOSITORY}` from RunSession | |
GitHub Actions Artifact Analysis |
||
TaskSet | Analyze artifact from GitHub workflow `${WORKFLOW_NAME}` in repository `${GITHUB_REPO}` | |
SLI | Analyze artifact from GitHub Workflow `${WORKFLOW_NAME}` in repository `${GITHUB_REPO}` and push metric | |
GitHub Actions Health Monitoring |
||
TaskSet | Check Recent Workflow Failures Across Specified Repositories Check Long Running Workflows Across Specified Repositories Check Repository Health Summary for Specified Repositories Check GitHub Actions Runner Health Across Specified Organizations Check Security Workflow Status Across Specified Repositories Check GitHub Actions Billing and Usage Across Specified Organizations Check GitHub API Rate Limits |
|
SLI | Calculate Workflow Success Rate Across Specified Repositories Calculate Organization Health Score Across Specified Organizations Calculate Runner Availability Score Across Specified Organizations Calculate Security Workflow Score Across Specified Repositories Calculate Performance Score Across Specified Repositories Calculate API Rate Limit Health Score Generate Overall GitHub Actions Health Score |
|
GitHub Actions Workflow Timing |
||
SLI | Get Average Run Time For Workflow | |
GitHub API Latency |
||
TaskSet | Check Latency When Creating a New GitHub Issue | |
SLI | Check GitHub Latency With Get Repos | |
GitHub Service Status |
||
SLI | Get Availability of GitHub or Individual GitHub Components | |
GitHub Status Incidents |
||
SLI | Get Number of Incidents Affecting GitHub | |
GitHub Status Maintenance |
||
SLI | Get Scheduled and Active GitHub Maintenance Windows | |
GitLab Availability |
||
TaskSet | Check GitLab Server Status | |
SLI | Check GitLab Server Status | |
GitLab Get Repo Latency |
||
SLI | Check GitLab Latency With Get Repos | |
GKE Cluster Health |
||
TaskSet | Identify GKE Service Account Issues in GCP Project `${GCP_PROJECT_ID}` Fetch GKE Recommendations for GCP Project `${GCP_PROJECT_ID}` Fetch GKE Cluster Health for GCP Project `${GCP_PROJECT_ID}` Check for Quota Related GKE Autoscaling Issues in GCP Project `${GCP_PROJECT_ID}` Validate GKE Node Sizes for GCP Project `${GCP_PROJECT_ID}` Fetch GKE Cluster Operations for GCP Project `${GCP_PROJECT_ID}` |
|
SLI | Identify GKE Service Account Issues in GCP Project `${GCP_PROJECT_ID}` Fetch GKE Recommendations for GCP Project `${GCP_PROJECT_ID}` Fetch GKE Cluster Health for GCP Project `${GCP_PROJECT_ID}` Check for Quota Related GKE Autoscaling Issues in GCP Project `${GCP_PROJECT_ID}` Generate GKE Cluster Health Score |
|
GKE Kong Ingress Host Triage |
||
TaskSet | Check If Kong Ingress HTTP Error Rate Violates HTTP Error Threshold in GCP Project `${GCP_PROJECT_ID}` Check If Kong Ingress HTTP Request Latency Violates Threshold in GCP Project `${GCP_PROJECT_ID}` Check If Kong Ingress Controller Reports Upstream Errors in GCP Project `${GCP_PROJECT_ID}` |
|
GKE Nginx Ingress Host Triage |
||
TaskSet | Fetch Nginx HTTP Errors From GMP for Ingress `${INGRESS_OBJECT_NAME}` Find Owner and Service Health for Ingress `${INGRESS_OBJECT_NAME}` |
|
Google Chat Send Message |
||
TaskSet | Send Chat Message | |
Grafana Health |
||
SLI | Check Grafana Server Health | |
gRPC cURL Unary |
||
TaskSet | Create a new Jira Issue | |
SLI | Search Jira Issues By Current User | |
gRPC cURL Unary |
||
TaskSet | Run gRPCurl Command and Show Output | |
SLI | Run gRPCurl Command and Push Metric | |
HahiCorp Vault Health |
||
SLI | Check If Vault Endpoint Is Healthy | |
HTTP Latency |
||
SLI | Check HTTP Latency to Well Known URL | |
HTTP OK |
||
SLI | Checking HTTP URL Is Available And Timely | |
Jenkins Health |
||
TaskSet | List Failed Build Logs in Jenkins Instance `${JENKINS_INSTANCE_NAME}` List Long Running Builds in Jenkins Instance `${JENKINS_INSTANCE_NAME}` List Recent Failed Tests in Jenkins Instance `${JENKINS_INSTANCE_NAME}` Check Jenkins Instance `${JENKINS_INSTANCE_NAME}` Health List Long Queued Builds in Jenkins Instance `${JENKINS_INSTANCE_NAME}` List Executor Utilization in Jenkins Instance `${JENKINS_INSTANCE_NAME}` Fetch Jenkins Instance `${JENKINS_INSTANCE_NAME}` Logs and Add to Report |
|
SLI | Check For Failed Build Logs in Jenkins Instance `${JENKINS_INSTANCE_NAME}` Check For Long Running Builds in Jenkins Instance `${JENKINS_INSTANCE_NAME}` Check For Recent Failed Tests in Jenkins Instance `${JENKINS_INSTANCE_NAME}` Check For Jenkins Instance `${JENKINS_INSTANCE_NAME}` Health Check For Long Queued Builds in Jenkins Instance `${JENKINS_INSTANCE_NAME}` Check Jenkins Executor Utilization in Jenkins Instance `${JENKINS_INSTANCE_NAME}` Generate Health Score |
|
K8s Jaeger Query |
||
TaskSet | Query Traces in Jaeger for Unhealthy HTTP Response Codes in Namespace `${NAMESPACE}` | |
K8s OpenTelemetry Collector Health |
||
TaskSet | Query Collector Queued Spans in Namespace `${NAMESPACE}` Check OpenTelemetry Collector Logs For Errors In Namespace `${NAMESPACE}` Query OpenTelemetry Logs For Dropped Spans In Namespace `${NAMESPACE}` |
|
Kong Ingress Health (GCP PromQL) |
||
SLI | Get Access Token Get HTTP Error Rate Get Upstream Health Get Request Latency Rate Generate Kong Ingress Score |
|
Kubeprometheus Operator Troubleshoot |
||
TaskSet | Check Prometheus Service Monitors in namespace `${NAMESPACE}` Check For Successful Rule Setup in Kubernetes Namespace `${NAMESPACE}` Verify Prometheus RBAC Can Access ServiceMonitors in Namespace `${PROM_NAMESPACE}` Inspect Prometheus Operator Logs for Scraping Errors in Namespace `${NAMESPACE}` Check Prometheus API Healthy in Namespace `${PROM_NAMESPACE}` |
|
Kubernetes API Server Health |
||
SLI | Running Kubectl Check Against API Server | |
Kubernetes Application Log Health |
||
TaskSet | Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Errors in Namespace `${NAMESPACE}` Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Stack Traces in Namespace `${NAMESPACE}` Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Connection Failures in Namespace `${NAMESPACE}` Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Timeout Errors in Namespace `${NAMESPACE}` Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Authentication and Authorization Failures in Namespace `${NAMESPACE}` Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Null Pointer and Unhandled Exceptions in Namespace `${NAMESPACE}` Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` for Log Anomalies in Namespace `${NAMESPACE}` Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Application Restarts and Failures in Namespace `${NAMESPACE}` Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Memory and CPU Resource Warnings in Namespace `${NAMESPACE}` Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Service Dependency Failures in Namespace `${NAMESPACE}` |
|
SLI | Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Errors in Namespace `${NAMESPACE}` Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Stack Traces in Namespace `${NAMESPACE}` Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Connection Failures in Namespace `${NAMESPACE}` Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Timeout Errors in Namespace `${NAMESPACE}` Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Authentication and Authorization Failures in Namespace `${NAMESPACE}` Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Null Pointer and Unhandled Exceptions in Namespace `${NAMESPACE}` Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` for Log Anomalies in Namespace `${NAMESPACE}` Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Application Restarts and Failures in Namespace `${NAMESPACE}` Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Memory and CPU Resource Warnings in Namespace `${NAMESPACE}` Scan ${WORKLOAD_TYPE} `${WORKLOAD_NAME}` Logs for Service Dependency Failures in Namespace `${NAMESPACE}` Generate Application Gateway Health Score |
|
Kubernetes Application Troubleshoot |
||
TaskSet | Get `${CONTAINER_NAME}` Application Logs from Workload `${WORKLOAD_NAME}` in Namespace `${NAMESPACE}` Scan `${CONTAINER_NAME}` Application For Misconfigured Environment Tail `${CONTAINER_NAME}` Application Logs For Stacktraces in Workload `${WORKLOAD_NAME}` |
|
SLI | Measure Application Exceptions in `${NAMESPACE}` | |
Kubernetes ArgoCD Application Health & Troubleshoot |
||
TaskSet | Fetch ArgoCD Application Sync Status & Health for `${APPLICATION}` Fetch ArgoCD Application Last Sync Operation Details for `${APPLICATION}` Fetch Unhealthy ArgoCD Application Resources for `${APPLICATION}` Scan For Errors in Pod Logs Related to ArgoCD Application `${APPLICATION}` Fully Describe ArgoCD Application `${APPLICATION}` |
|
Kubernetes ArgoCD HelmRelease TaskSet |
||
TaskSet | Fetch all available ArgoCD Helm releases in namespace `${NAMESPACE}` Fetch Installed ArgoCD Helm release versions in namespace `${NAMESPACE}` |
|
Kubernetes Artifactory Triage |
||
TaskSet | Check Artifactory Liveness and Readiness Endpoints in `NAMESPACE` | |
Kubernetes cert-manager Healthcheck |
||
TaskSet | Get Namespace Certificate Summary for Namespace `${NAMESPACE}` Find Unhealthy Certificates in Namespace `${NAMESPACE}` Find Failed Certificate Requests and Identify Issues for Namespace `${NAMESPACE}` |
|
SLI | Count Unready and Expired Certificates in Namespace `${NAMESPACE}` | |
Kubernetes CLI Command |
||
TaskSet | ${TASK_TITLE} | |
SLI | ${TASK_TITLE} | |
Kubernetes CLI Command |
||
TaskSet | ${TASK_TITLE} | |
SLI | ${TASK_TITLE} | |
Kubernetes Cluster Node Health |
||
TaskSet | Check for Node Restarts in Cluster `${CONTEXT}` within Interval `${INTERVAL}` | |
SLI | Check for Node Restarts in Cluster `${CONTEXT}` Generate Namespace Score in Kubernetes Cluster `$${CONTEXT}` |
|
Kubernetes Cluster Resource Health |
||
TaskSet | Identify High Utilization Nodes for Cluster `${CONTEXT}` Identify Pods Causing High Node Utilization in Cluster `${CONTEXT}` Identify Pods with Resource Limits Exceeding Node Capacity in Cluster `${CONTEXT}` |
|
SLI | Identify High Utilization Nodes for Cluster `${CONTEXT}` Identify Pods with Resource Limits Exceeding Node Capacity in Cluster `${CONTEXT}` Generate Cluster Resource Health Score |
|
Kubernetes Daemonset Health Check |
||
SLI | Health Check Daemonset | |
Kubernetes DaemonSet Triage |
||
TaskSet | Analyze Application Log Patterns for DaemonSet `${DAEMONSET_NAME}` in Namespace `${NAMESPACE}` Detect Log Anomalies for DaemonSet `${DAEMONSET_NAME}` in Namespace `${NAMESPACE}` Check Liveness Probe Configuration for DaemonSet `${DAEMONSET_NAME}` Check Readiness Probe Configuration for DaemonSet `${DAEMONSET_NAME}` in Namespace `${NAMESPACE}` Inspect DaemonSet Warning Events for `${DAEMONSET_NAME}` in Namespace `${NAMESPACE}` Fetch DaemonSet Workload Details For `${DAEMONSET_NAME}` in Namespace `${NAMESPACE}` Inspect DaemonSet Status for `${DAEMONSET_NAME}` in namespace `${NAMESPACE}` Check Node Affinity and Tolerations for DaemonSet `${DAEMONSET_NAME}` in Namespace `${NAMESPACE}` |
|
Kubernetes Decomission Workload |
||
TaskSet | Generate Decomission Commands | |
Kubernetes Deployment Operations |
||
TaskSet | Restart Deployment `${DEPLOYMENT_NAME}` in Namespace `${NAMESPACE}` Force Delete Pods in Deployment `${DEPLOYMENT_NAME}` in Namespace `${NAMESPACE}` Rollback Deployment `${DEPLOYMENT_NAME}` in Namespace `${NAMESPACE}` to Previous Version Scale Down Deployment `${DEPLOYMENT_NAME}` in Namespace `${NAMESPACE}` Scale Up Deployment `${DEPLOYMENT_NAME}` in Namespace `${NAMESPACE}` by ${SCALE_UP_FACTOR}x Clean Up Stale ReplicaSets for Deployment `${DEPLOYMENT_NAME}` in Namespace `${NAMESPACE}` Scale Down Stale ReplicaSets for Deployment `${DEPLOYMENT_NAME}` in Namespace `${NAMESPACE}` |
|
Kubernetes Deployment Triage |
||
TaskSet | Analyze Application Log Patterns for Deployment `${DEPLOYMENT_NAME}` in Namespace `${NAMESPACE}` Detect Log Anomalies for Deployment `${DEPLOYMENT_NAME}` in Namespace `${NAMESPACE}` Check Deployment Log For Issues with `${DEPLOYMENT_NAME}` Fetch Deployments Logs for `${DEPLOYMENT_NAME}` in Namespace `${NAMESPACE}` and Add to Report Check Liveness Probe Configuration for Deployment `${DEPLOYMENT_NAME}` Check Readiness Probe Configuration for Deployment `${DEPLOYMENT_NAME}` in Namespace `${NAMESPACE}` Inspect Container Restarts for Deployment `${DEPLOYMENT_NAME}` in Namespace `${NAMESPACE}` Inspect Deployment Warning Events for `${DEPLOYMENT_NAME}` in Namespace `${NAMESPACE}` Fetch Deployment Workload Details For `${DEPLOYMENT_NAME}` in Namespace `${NAMESPACE}` Inspect Deployment Replicas for `${DEPLOYMENT_NAME}` in namespace `${NAMESPACE}` |
|
Kubernetes Event Query |
||
SLI | Get Number Of Matching Events | |
Kubernetes Flux Choas Testing |
||
TaskSet | Suspend the Flux Resource Reconciliation for `${FLUX_RESOURCE_NAME}` in namespace `${FLUX_RESOURCE_NAMESPACE}` Select Random FluxCD Workload for Chaos Target in Namespace `${FLUX_RESOURCE_NAMESPACE}` Execute Chaos Command on `${TARGET_RESOURCE}` in Namespace `${TARGET_NAMESPACE}` Execute Additional Chaos Command on ${FLUX_RESOURCE_TYPE} '${FLUX_RESOURCE_NAME}' in namespace '${FLUX_RESOURCE_NAMESPACE}' Resume Flux Resource Reconciliation in `${TARGET_NAMESPACE}` |
|
Kubernetes FluxCD HelmRelease TaskSet |
||
TaskSet | List all available FluxCD Helmreleases in Namespace `${NAMESPACE}` Fetch Installed FluxCD Helmrelease Versions in Namespace `${NAMESPACE}` Fetch Mismatched FluxCD HelmRelease Version in Namespace `${NAMESPACE}` Fetch FluxCD HelmRelease Error Messages in Namespace `${NAMESPACE}` Check for Available Helm Chart Updates in Namespace `${NAMESPACE}` |
|
Kubernetes FluxCD Kustomization TaskSet |
||
TaskSet | List All FluxCD Kustomization objects in Namespace `${NAMESPACE}` in Cluster `${CONTEXT}` List Suspended FluxCD Kustomization objects in Namespace `${NAMESPACE}` in Cluster `${CONTEXT}` List Unready FluxCD Kustomizations in Namespace `${NAMESPACE}` in Cluster `${CONTEXT}` |
|
SLI | List Suspended FluxCD Kustomization objects in Namespace `${NAMESPACE}` in Cluster `${CONTEXT}` List Unready FluxCD Kustomizations in Namespace `${NAMESPACE}` in Cluster `${CONTEXT}` Generate FluxCD Kustomization Health Score for Namespace `${NAMESPACE}` in Cluster `${CONTEXT}` |
|
Kubernetes Fluxcd Reconciliation Report |
||
TaskSet | Check FluxCD Reconciliation Health in Kubernetes Namespace `${FLUX_NAMESPACE}` | |
SLI | Health Check Flux Reconciliation | |
Kubernetes GitOps GitHub Remediation |
||
TaskSet | Remediate Readiness and Liveness Probe GitOps Manifests in Namespace `${NAMESPACE}` Increase ResourceQuota Limit for Namespace `${NAMESPACE}` in GitHub GitOps Repository Adjust Pod Resources to Match VPA Recommendation in `${NAMESPACE}` Expand Persistent Volume Claims in Namespace `${NAMESPACE}` |
|
Kubernetes Grafana Loki Health Check |
||
TaskSet | Check Loki Ring API for Unhealthy Shards in Kubernetes Cluster `$${NAMESPACE}` Check Loki API Ready in Kubernetes Cluster `${NAMESPACE}` |
|
Kubernetes Image Check |
||
TaskSet | Check Image Rollover Times for Namespace `${NAMESPACE}` List Images and Tags for Every Container in Running Pods for Namespace `${NAMESPACE}` List Images and Tags for Every Container in Failed Pods for Namespace `${NAMESPACE}` List ImagePullBackOff Events and Test Path and Tags for Namespace `${NAMESPACE}` |
|
Kubernetes Ingress GCE & GCP HTTP Load Balancer Healthcheck |
||
TaskSet | Search For GCE Ingress Warnings in GKE Context `${CONTEXT}` Identify Unhealthy GCE HTTP Ingress Backends in GKE Namespace `${NAMESPACE}` Validate GCP HTTP Load Balancer Configurations in GCP Project `${GCP_PROJECT_ID}` Fetch Network Error Logs from GCP Operations Manager for Ingress Backends in GCP Project `${GCP_PROJECT_ID}` Review GCP Operations Logging Dashboard in GCP project `${GCP_PROJECT_ID}` |
|
Kubernetes Ingress Healthcheck |
||
TaskSet | Fetch Ingress Object Health in Namespace `${NAMESPACE}` Check for Ingress and Service Conflicts in Namespace `${NAMESPACE}` |
|
Kubernetes Istio System Health |
||
TaskSet | Verify Istio Sidecar Injection for Cluster `${CONTEXT}` Check Istio Sidecar Resource Usage for Cluster `${CONTEXT}` Validate Istio Installation in Cluster `${CONTEXT}` Check Istio Controlplane Logs For Errors in Cluster `${CONTEXT}` Fetch Istio Proxy Logs in Cluster `${CONTEXT}` Verify Istio SSL Certificates in Cluster `${CONTEXT}` Check Istio Configuration Health in Cluster `${CONTEXT}` |
|
SLI | Verify Istio Sidecar Injection for Cluster `${CONTEXT}` Check Istio Sidecar Resource Usage for Cluster `${CONTEXT}` Validate Istio Installation in Cluster `${CONTEXT}` Check Istio Controlplane Logs For Errors in Cluster `${CONTEXT}` Fetch Istio Proxy Logs in Cluster `${CONTEXT}` Verify Istio SSL Certificates in Cluster `${CONTEXT}` Check Istio Configuration Health in Cluster `${CONTEXT}` Generate Health Score for Cluster ${CONTEXT} |
|
Kubernetes Jenkins Healthcheck |
||
TaskSet | Query The Jenkins Kubernetes Workload HTTP Endpoint in Kubernetes StatefulSet `${STATEFULSET_NAME}` Query For Stuck Jenkins Jobs in Kubernetes Statefulset Workload `${STATEFULSET_NAME}` |
|
Kubernetes Labeled Pod Count |
||
SLI | Measure Number of Running Pods with Label in `${NAMESPACE}` | |
Kubernetes Namespace Chaos Engineering |
||
TaskSet | Kill Random Pods In Namespace `${NAMESPACE}` OOMKill Pods In Namespace `${NAMESPACE}` Mangle Service Selector In Namespace `${NAMESPACE}` Mangle Service Port In Namespace `${NAMESPACE}` Fill Random Pod Tmp Directory In Namespace `${NAMESPACE}` |
|
Kubernetes Namespace Inspection |
||
TaskSet | Inspect Warning Events in Namespace `${NAMESPACE}` Inspect Container Restarts In Namespace `${NAMESPACE}` Inspect Pending Pods In Namespace `${NAMESPACE}` Inspect Failed Pods In Namespace `${NAMESPACE}` Inspect Workload Status Conditions In Namespace `${NAMESPACE}` Get Listing Of Resources In Namespace `${NAMESPACE}` Check Event Anomalies in Namespace `${NAMESPACE}` Check Missing or Risky PodDisruptionBudget Policies in Namepace `${NAMESPACE}` Check Resource Quota Utilization in Namespace `${NAMESPACE}` |
|
SLI | Get Error Event Count within ${EVENT_AGE} and calculate Score Get Container Restarts and Score in Namespace `${NAMESPACE}` Get NotReady Pods in `${NAMESPACE}` Generate Namespace Score in `${NAMESPACE}` |
|
Kubernetes Namespace Troubleshoot |
||
TaskSet | Trace Namespace Errors Fetch Unready Pods Triage Namespace Object Condition Check Namespace Get All |
|
SLI | Get Event Count and Score Get Container Restarts and Score Get NotReady Pods Generate Namspace Score |
|
Kubernetes Patroni Health Check |
||
SLI | Determine Patroni Health | |
Kubernetes Patroni Lag Health |
||
TaskSet | Determine Patroni Health | |
SLI | Measure Patroni Member Lag | |
Kubernetes Persistent Volume Healthcheck |
||
TaskSet | Fetch Events for Unhealthy Kubernetes PersistentVolumeClaims in Namespace `${NAMESPACE}` List PersistentVolumeClaims in Terminating State in Namespace `${NAMESPACE}` List PersistentVolumes in Terminating State in Namespace `${NAMESPACE}` List Pods with Attached Volumes and Related PersistentVolume Details in Namespace `${NAMESPACE}` Fetch the Storage Utilization for PVC Mounts in Namespace `${NAMESPACE}` Check for RWO Persistent Volume Node Attachment Issues in Namespace `${NAMESPACE}` |
|
SLI | Fetch the Storage Utilization for PVC Mounts in Namespace `${NAMESPACE}` Generate Namespace Score for Namespace `${NAMESPACE}` |
|
Kubernetes Pod Resources Health |
||
TaskSet | Show Pods Without Resource Limit or Resource Requests Set in Namespace `${NAMESPACE}` Check Pod Resource Utilization with Top in Namespace `${NAMESPACE}` Identify VPA Pod Resource Recommendations in Namespace `${NAMESPACE}` Identify Overutilized Pods in Namespace `${NAMESPACE}` |
|
Kubernetes Postgres Healthcheck |
||
TaskSet | List Resources Related to Postgres Cluster `${OBJECT_NAME}` in Namespace `${NAMESPACE}` Get Postgres Pod Logs & Events for Cluster `${OBJECT_NAME}` in Namespace `${NAMESPACE}` Get Postgres Pod Resource Utilization for Cluster `${OBJECT_NAME}` in Namespace `${NAMESPACE}` Get Running Postgres Configuration for Cluster `${OBJECT_NAME}` in Namespace `${NAMESPACE}` Get Patroni Output and Add to Report for Cluster `${OBJECT_NAME}` in Namespace `${NAMESPACE}` Fetch Patroni Database Lag for Cluster `${OBJECT_NAME}` in Namespace `${NAMESPACE}` Check Database Backup Status for Cluster `${OBJECT_NAME}` in Namespace `${NAMESPACE}` Run DB Queries for Cluster `${OBJECT_NAME}` in Namespace `${NAMESPACE}` |
|
SLI | Check Patroni Database Lag in Namespace `${NAMESPACE}` on Host `${HOSTNAME}` using `patronictl` Check Database Backup Status for Cluster `${OBJECT_NAME}` in Namespace `${NAMESPACE}` Generate Namespace Score for Namespace `${NAMESPACE}` |
|
Kubernetes PostgreSQL Query |
||
TaskSet | Run Postgres Query And Results to Report | |
SLI | Run Postgres Query And Return Result As Metric | |
Kubernetes PostgreSQL Triage |
||
TaskSet | Get Standard Resources Describe Custom Resources Get Pod Logs & Events Get Pod Resource Utilization Get Running Configuration Get Patroni Output Run DB Queries |
|
Kubernetes Redis Healthcheck |
||
TaskSet | Ping `${DEPLOYMENT_NAME}` Redis Workload Verify `${DEPLOYMENT_NAME}` Redis Read Write Operation in Kubernetes |
|
Kubernetes Restart resource |
||
TaskSet | Get Current Resource State with Labels `${LABELS}` Get Resource Logs with Labels `${LABELS}` Restart Resource with Labels `${LABELS}` in `${CONTEXT}` |
|
Kubernetes Run Shell Command |
||
TaskSet | Running Kubectl And Adding Stdout To Report | |
Kubernetes Service Account Check |
||
TaskSet | Test Service Account Access to Kubernetes API Server in Namespace `${NAMESPACE}` | |
Kubernetes StatefulSet Triage |
||
TaskSet | Analyze Application Log Patterns for StatefulSet `${STATEFULSET_NAME}` in Namespace `${NAMESPACE}` Detect Log Anomalies for StatefulSet `${STATEFULSET_NAME}` in Namespace `${NAMESPACE}` Check Liveness Probe Configuration for StatefulSet `${STATEFULSET_NAME}` Check Readiness Probe Configuration for StatefulSet `${STATEFULSET_NAME}` in Namespace `${NAMESPACE}` Inspect StatefulSet Warning Events for `${STATEFULSET_NAME}` in Namespace `${NAMESPACE}` Fetch StatefulSet Workload Details For `${STATEFULSET_NAME}` in Namespace `${NAMESPACE}` Inspect StatefulSet Replicas for `${STATEFULSET_NAME}` in namespace `${NAMESPACE}` Check StatefulSet PersistentVolumeClaims for `${STATEFULSET_NAME}` in Namespace `${NAMESPACE}` |
|
Kubernetes Synthetic PVC Test |
||
SLI | Run Canary Job | |
Kubernetes Tail Application Logs |
||
TaskSet | Get `${CONTAINER_NAME}` Application Logs in Namespace `${NAMESPACE}` Tail `${CONTAINER_NAME}` Application Logs For Stacktraces |
|
SLI | Tail `${CONTAINER_NAME}` Application Logs For Stacktraces | |
Kubernetes Top |
||
SLI | Running Kubectl Top And Extracting Metric Data | |
Kubernetes Triage Deployment Replicas |
||
TaskSet | Fetch Logs Get Related Events Check Deployment Replicas |
|
Kubernetes Triage Patroni |
||
TaskSet | Get Patroni Status Get Pods Status Fetch Logs |
|
Kubernetes Triage StatefulSet |
||
TaskSet | Check StatefulSets Replicas Ready Get Events For The StatefulSet Get StatefulSet Logs Get StatefulSet Manifests Dump |
|
Kubernetes Troubleshoot Deployment |
||
TaskSet | Troubleshoot Resourcing Troubleshoot Events Troubleshoot PVC Troubleshoot Pods |
|
Kubernetes Vault Triage |
||
TaskSet | Fetch Vault CSI Driver Logs in Namespace `${NAMESPACE}` Get Vault CSI Driver Warning Events in `${NAMESPACE}` Check Vault CSI Driver Replicas Fetch Vault Pod Workload Logs in Namespace `${NAMESPACE}` with Labels `${LABELS}` Get Related Vault Events in Namespace `${NAMESPACE}` Fetch Vault StatefulSet Manifest Details in `${NAMESPACE}` Fetch Vault DaemonSet Manifest Details in Kubernetes Cluster `${NAMESPACE}` Verify Vault Availability in Namespace `${NAMESPACE}` and Context `${CONTEXT}` Check Vault StatefulSet Replicas in `NAMESPACE` |
|
Kubernetes Workload Chaos Engineering |
||
TaskSet | Test `${WORKLOAD_NAME}` High Availability in Namespace `${NAMESPACE}` OOMKill `${WORKLOAD_NAME}` Pod Mangle Service Selector For `${WORKLOAD_NAME}` in `${NAMESPACE}` Mangle Service Port For `${WORKLOAD_NAME}` in `${NAMESPACE}` Fill Tmp Directory Of Pod From `${WORKLOAD_NAME}` |
|
Kubernetes Workload Metric |
||
SLI | Running Kubectl get and push the metric | |
Loki Query via Grafana (Relative Times) |
||
TaskSet | ${TASK_TITLE} | |
Microsoft Teams Send Message |
||
TaskSet | Send a Message to an MS Teams Channel | |
MongoDB Health (GCP PromQL) |
||
SLI | Get Access Token Get Instance Status Get Connection Utilization Rate Get MongoDB Member State Health Get MongoDB Replication Lag Get MongoDB Queue Size Get Assertion Rate Generate MongoDB Score |
|
OpsGenie Create Alert |
||
TaskSet | Get Opsgenie System Info Create An Alert |
|
PagerDuty Webhook Handler |
||
TaskSet | Run SLX Tasks with matching PagerDuty Webhook Service ID | |
Ping Host Availability |
||
SLI | Ping host and collect packet lost percentage | |
Pingdom Health |
||
SLI | Check Pingdom Health | |
Prometheus Query (Instant) Metric |
||
SLI | Querying Prometheus Instance And Pushing Aggregated Data | |
Prometheus Query (Range) Metric |
||
SLI | Querying Prometheus Instance And Pushing Aggregated Data | |
rds-mysql-conn-count |
||
TaskSet | Run Bash File | |
SLI | Querying Prometheus Instance And Pushing Aggregated Data | |
REST Metric |
||
SLI | Request Data From Rest Endpoint | |
REST Metric (Basic Auth) |
||
SLI | Request Data From Rest Endpoint | |
REST Metric (Explicit OAuth2 with BasicAuth) |
||
SLI | Request Data From Rest Endpoint | |
REST Metric (Explicit OAuth2 with Bearer Token) |
||
SLI | Request Data From Rest Endpoint | |
RocketChat Send Message |
||
TaskSet | Send Chat Message | |
RunWhen Local Helm Update (ACR) |
||
TaskSet | Apply Available RunWhen Helm Images in ACR Registry`${REGISTRY_NAME}` | |
SLI | Check for Available RunWhen Helm Images in ACR Registry`${REGISTRY_NAME}` | |
RunWhen Platform Azure ACR Image Sync |
||
TaskSet | Sync CodeCollection Images to ACR Registry `${REGISTRY_NAME}` Sync RunWhen Local Image Updates to ACR Registry`${REGISTRY_NAME}` |
|
SLI | Check for CodeCollection Updates against ACR Registry`${REGISTRY_NAME}` Check for RunWhen Local Image Updates against ACR Registry`${REGISTRY_NAME}` Count Images Needing Update and Push Metric |
|
Slack - Send Issue Summary From RunSession |
||
TaskSet | Send Slack Notification to Channel `${SLACK_CHANNEL}` from RunSession | |
Slack Send Message |
||
TaskSet | Send Chat Message | |
SLI Alert Threshold |
||
SLI | Check If SLI Within Incident Threshold | |
Sysdig Monitor Metric |
||
SLI | Query Sysdig Metric Data And Pushing Metric | |
Sysdig Monitor PromQL Metric |
||
SLI | Querying PromQL Endpoint And Pushing Metric Data | |
Ternary Default Dashboard Metrics |
||
TaskSet | Fetch Ternary Report from Query | |
Terraform Cloud Workspace Lock Check |
||
TaskSet | Checking whether the Terraform Cloud Workspace '${TERRAFORM_WORKSPACE_NAME}' is in a locked state | |
Twitter Query Handle |
||
TaskSet | Query Twitter | |
SLI | Query Twitter | |
Uptime.com Component Health |
||
SLI | Check If Vault Endpoint Is Healthy | |
Web Triage |
||
TaskSet | Validate Platform Egress Perform Inspection On URL |