ops0ops0

Kubernetes Incidents

ops0 automatically detects issues in your Kubernetes clusters and surfaces them as actionable incidents. Monitor pod failures, resource exhaustion, and deployment problems from one dashboard.

Detected Incident Types

CrashLoopBackOff
Container repeatedly crashing and restarting
OOMKilled
Container killed due to memory limit exceeded
ImagePullBackOff
Unable to pull container image from registry
Pending Pods
Pods stuck waiting for scheduling
Node NotReady
Nodes in unhealthy or unreachable state
Failed Deployments
Deployment rollout failures or stuck progress

Severity Levels

SeverityColorDescriptionExample
P1 CriticalRedImmediate attention requiredOOMKilled, Node NotReady
P2 HighOrangeUrgent, affecting productionCrashLoopBackOff
P3 MediumYellowShould be addressed soonHigh restart count, Pending pods

Incident Status

StatusDescription
OpenNewly detected, not yet triaged
AcknowledgedTeam is aware and investigating
ResolvedIssue fixed or no longer occurring

Incident Details

When you click an incident, the detail panel shows:

Resource Information

FieldDescription
PodAffected pod name
NamespaceKubernetes namespace
ContainerContainer name (if applicable)
NodeNode the pod is running on
Restart CountNumber of container restarts
Last TerminationReason for last container exit

Timeline

Shows event history leading to the incident:

10:45:00  Container 'api' exited with code 1
10:45:30  Container restarted (attempt 1)
10:46:00  Container 'api' exited with code 1
10:46:30  Container restarted (attempt 2)
10:47:00  CrashLoopBackOff detected
10:47:00  Incident created (P2 High)

Captured Logs

Recent container logs captured at incident detection:

[10:44:55] Starting application...
[10:44:56] Connected to database
[10:44:58] Error: Connection pool exhausted
[10:44:58] Fatal: Unable to handle request
[10:44:59] Process exiting with code 1

Quick Actions

View Logs
Open container logs with search and filtering
Open Terminal
Exec into pod to investigate live
Describe Pod
Full kubectl describe output
View Events
Kubernetes events for the resource

Managing Incidents

Acknowledge

  1. Click Acknowledge on an open incident
  2. Optionally add a note about who's investigating
  3. Status changes to "Acknowledged"
  4. Shows acknowledger name and timestamp

Resolve

  1. Click Resolve on an incident
  2. Add resolution notes (what fixed it)
  3. Select resolution type:
    • Fixed - Issue was corrected
    • Not an issue - False positive or expected behavior
    • Auto-resolved - Issue cleared on its own
  4. Incident moves to resolved state

Manual Incident Scanning

Trigger on-demand incident detection for a cluster:

  1. Navigate to cluster detail page
  2. Click "Scan for Incidents" button
  3. ops0 runs detection across all namespaces
  4. New incidents appear in list

When to Use:

  • After deploying changes to verify no new issues
  • Troubleshooting suspected problems
  • Validating incident auto-resolution

Scan Process:

Scanning production-eks for incidents...
✓ Checked 156 pods
✓ Checked 12 nodes
✓ Checked 35 deployments
Found 2 new incidents (1 Critical, 1 Warning)

AI-Powered Analysis

For each incident, ops0 can produce AI-powered root cause analysis:

Analysis Includes:

  • Probable root cause
  • Related configuration issues
  • Suggested remediation steps
  • Similar past incidents

Example Analysis:

Root Cause Analysis (AI-Generated)
───────────────────────────────────────────────
Incident: CrashLoopBackOff in api-gateway pod

Root Cause:
The pod is failing to start due to missing environment
variable DB_PASSWORD. The Secret 'api-gateway-secrets'
exists but is missing the 'db-password' key.

Evidence:
- Container logs show: "ERROR Missing required env: DB_PASSWORD"
- Secret 'api-gateway-secrets' was updated 45 minutes ago
- Previous version contained 'db-password' key
- No other pods in namespace are affected

Recommended Actions:
1. Verify Secret 'api-gateway-secrets' contains 'db-password' key
2. If key was removed, add it back with correct value
3. If key name changed, update Deployment env reference

Updating Analysis: AI analysis is created automatically for new incidents. To refresh:

  1. Click "Refresh Analysis" in incident panel
  2. Updated analysis appears within seconds

Incident Notes

Add investigation notes and findings to incidents:

Adding Notes:

  1. Open incident detail panel
  2. Click "Add Note" button
  3. Write note in markdown
  4. Click "Save"

Note Visibility:

  • All team members can view notes
  • Author and timestamp recorded
  • Notes preserved after incident resolution

Use Cases:

  • Document investigation steps
  • Share findings with team
  • Track remediation attempts
  • Link to related runbooks

Example Notes:

@sarah.chen 2024-01-15 10:50:00
Checked database pod - also in CrashLoopBackOff.
Appears to be PVC mount issue after node rotation.

@mike.jones 2024-01-15 11:05:00
Confirmed: PVC using wrong storage class.
Recreating with correct class now.

@sarah.chen 2024-01-15 11:15:00
Database pod recovered. API gateway auto-recovered
once DB became available. Closing incident.

Auto-Resolution

Incidents automatically resolve when the underlying issue clears:

Auto-Resolution Logic:

  • ops0 continuously monitors incident status
  • If pods become healthy and stable for 10 minutes, incident auto-resolves
  • Resolution type marked as "Auto-resolved"
  • Timeline shows auto-resolution event

Example Timeline:

10:47:00  Incident created (CrashLoopBackOff)
10:50:00  Acknowledged by @sarah.chen
11:15:00  Pod became healthy
11:25:00  Auto-resolved (stable for 10 minutes)

Manual vs Auto-Resolution:

  • Auto-resolved: Issue cleared on its own or fix was applied
  • Fixed: Manually marked as resolved by team member
  • Not an issue: False positive or expected behavior

Filtering Incidents

FilterOptions
StatusAll, Open, Acknowledged, Resolved
SeverityAll, Critical, High, Medium
TypeCrashLoopBackOff, OOMKilled, ImagePullBackOff, etc.
NamespaceFilter by Kubernetes namespace
SearchPod name, namespace, or message content

Incident Metrics

The incidents page shows aggregate metrics:

Last 24 Hours
─────────────────────────────
Total Incidents:  12
Critical (P1):    1
High (P2):        3
Medium (P3):      8

MTTA (Acknowledge): 5 min
MTTR (Resolve):     45 min

Each incident links to related Kubernetes resources:

  • Deployment - Parent deployment if pod is managed
  • ReplicaSet - Current ReplicaSet
  • Service - Services routing to the pod
  • ConfigMap/Secret - Mounted configurations

Click any related resource to view its details.

Best Practices
Acknowledge quickly - Shows the team is aware and prevents duplicate investigation
Add notes - Document findings during investigation for future reference
Resolve with details - Explain what fixed it to help with recurring issues
Set up alerts - Configure Slack/PagerDuty for critical incidents
Review patterns - Recurring incidents indicate systemic issues to address

Example: Investigating a CrashLoopBackOff

Incident Alert

INCIDENT #1247 - P2 High
─────────────────────────────────────
Type:       CrashLoopBackOff
Pod:        api-gateway-7d9f8c6b4d-2xkjp
Namespace:  production
Container:  api
Cluster:    production-eks
Detected:   2024-01-15 10:47:00 UTC

Incident Timeline

10:45:00  Container 'api' started
10:45:02  Readiness probe passed
10:45:15  Error: Database connection refused
10:45:15  Container exited with code 1
10:45:30  Container restarted (attempt 1)
10:45:32  Readiness probe passed
10:45:45  Error: Database connection refused
10:45:45  Container exited with code 1
10:46:00  Container restarted (attempt 2)
10:46:45  Container exited with code 1
10:47:00  CrashLoopBackOff detected
10:47:00  Incident #1247 created (P2 High)
10:47:01  Slack notification sent to #incidents

Captured Logs

[10:45:12] INFO  Starting API Gateway v2.3.1
[10:45:13] INFO  Loading configuration from /etc/config/app.yaml
[10:45:14] INFO  Connecting to database: postgresql://db.internal:5432/api
[10:45:15] ERROR Connection refused: postgresql://db.internal:5432/api
[10:45:15] ERROR Failed to establish database connection after 3 retries
[10:45:15] FATAL Cannot start without database connection, exiting
[10:45:15] INFO  Shutdown complete

Investigation

Step 1: Check Events

Type    Reason     Age   Message
────    ──────     ───   ───────
Normal  Scheduled  5m    Successfully assigned production/api-gateway-... to node-1
Normal  Pulling    5m    Pulling image "api-gateway:v2.3.1"
Normal  Pulled     5m    Successfully pulled image in 2s
Normal  Created    5m    Created container api
Normal  Started    5m    Started container api
Warning BackOff    2m    Back-off restarting failed container

Step 2: Check Database Pod

kubectl get pods -n production -l app=postgresql

NAME                         READY   STATUS             RESTARTS   AGE
postgresql-0                 0/1     CrashLoopBackOff   5          10m

Root Cause: The PostgreSQL pod was also crashing due to a PVC mount failure.

Resolution

Resolution Type:  Fixed
Resolution Notes: PostgreSQL pod was crashing due to PVC storage class
                  misconfiguration after node rotation. Recreated PVC
                  with correct storage class. Database pod recovered,
                  API gateway auto-recovered after database became available.

Resolved by:      @sarah.chen
Resolved at:      2024-01-15 11:15:00 UTC
MTTR:             28 minutes

Example: OOMKilled Incident

Incident Alert

INCIDENT #1248 - P1 Critical
─────────────────────────────────────
Type:       OOMKilled
Pod:        data-processor-5f8d9c7b2-kp3mn
Namespace:  analytics
Container:  processor
Cluster:    production-eks
Detected:   2024-01-15 14:22:00 UTC

Resource Details

Container: processor
─────────────────────────────────────
Memory Request:  512Mi
Memory Limit:    1Gi
Last Usage:      1Gi (100% of limit)
Exit Code:       137 (OOMKilled)

Pod Events:
Warning  OOMKilled  Container processor exceeded memory limit

Resolution

Increased memory limit in deployment:

resources:
  requests:
    memory: "1Gi"    # was 512Mi
  limits:
    memory: "2Gi"    # was 1Gi

Applied change and pod recovered without further OOM events.