Kubernetes Incidents
ops0 automatically detects issues in your Kubernetes clusters and surfaces them as actionable incidents. Monitor pod failures, resource exhaustion, and deployment problems from one dashboard.
Detected Incident Types
Severity Levels
| Severity | Color | Description | Example |
|---|---|---|---|
| P1 Critical | Red | Immediate attention required | OOMKilled, Node NotReady |
| P2 High | Orange | Urgent, affecting production | CrashLoopBackOff |
| P3 Medium | Yellow | Should be addressed soon | High restart count, Pending pods |
Incident Status
| Status | Description |
|---|---|
| Open | Newly detected, not yet triaged |
| Acknowledged | Team is aware and investigating |
| Resolved | Issue fixed or no longer occurring |
Incident Details
When you click an incident, the detail panel shows:
Resource Information
| Field | Description |
|---|---|
| Pod | Affected pod name |
| Namespace | Kubernetes namespace |
| Container | Container name (if applicable) |
| Node | Node the pod is running on |
| Restart Count | Number of container restarts |
| Last Termination | Reason for last container exit |
Timeline
Shows event history leading to the incident:
10:45:00 Container 'api' exited with code 1
10:45:30 Container restarted (attempt 1)
10:46:00 Container 'api' exited with code 1
10:46:30 Container restarted (attempt 2)
10:47:00 CrashLoopBackOff detected
10:47:00 Incident created (P2 High)
Captured Logs
Recent container logs captured at incident detection:
[10:44:55] Starting application...
[10:44:56] Connected to database
[10:44:58] Error: Connection pool exhausted
[10:44:58] Fatal: Unable to handle request
[10:44:59] Process exiting with code 1
Quick Actions
Managing Incidents
Acknowledge
- Click Acknowledge on an open incident
- Optionally add a note about who's investigating
- Status changes to "Acknowledged"
- Shows acknowledger name and timestamp
Resolve
- Click Resolve on an incident
- Add resolution notes (what fixed it)
- Select resolution type:
- Fixed - Issue was corrected
- Not an issue - False positive or expected behavior
- Auto-resolved - Issue cleared on its own
- Incident moves to resolved state
Manual Incident Scanning
Trigger on-demand incident detection for a cluster:
- Navigate to cluster detail page
- Click "Scan for Incidents" button
- ops0 runs detection across all namespaces
- New incidents appear in list
When to Use:
- After deploying changes to verify no new issues
- Troubleshooting suspected problems
- Validating incident auto-resolution
Scan Process:
Scanning production-eks for incidents...
✓ Checked 156 pods
✓ Checked 12 nodes
✓ Checked 35 deployments
Found 2 new incidents (1 Critical, 1 Warning)
AI-Powered Analysis
For each incident, ops0 can produce AI-powered root cause analysis:
Analysis Includes:
- Probable root cause
- Related configuration issues
- Suggested remediation steps
- Similar past incidents
Example Analysis:
Root Cause Analysis (AI-Generated)
───────────────────────────────────────────────
Incident: CrashLoopBackOff in api-gateway pod
Root Cause:
The pod is failing to start due to missing environment
variable DB_PASSWORD. The Secret 'api-gateway-secrets'
exists but is missing the 'db-password' key.
Evidence:
- Container logs show: "ERROR Missing required env: DB_PASSWORD"
- Secret 'api-gateway-secrets' was updated 45 minutes ago
- Previous version contained 'db-password' key
- No other pods in namespace are affected
Recommended Actions:
1. Verify Secret 'api-gateway-secrets' contains 'db-password' key
2. If key was removed, add it back with correct value
3. If key name changed, update Deployment env reference
Updating Analysis: AI analysis is created automatically for new incidents. To refresh:
- Click "Refresh Analysis" in incident panel
- Updated analysis appears within seconds
Incident Notes
Add investigation notes and findings to incidents:
Adding Notes:
- Open incident detail panel
- Click "Add Note" button
- Write note in markdown
- Click "Save"
Note Visibility:
- All team members can view notes
- Author and timestamp recorded
- Notes preserved after incident resolution
Use Cases:
- Document investigation steps
- Share findings with team
- Track remediation attempts
- Link to related runbooks
Example Notes:
@sarah.chen 2024-01-15 10:50:00
Checked database pod - also in CrashLoopBackOff.
Appears to be PVC mount issue after node rotation.
@mike.jones 2024-01-15 11:05:00
Confirmed: PVC using wrong storage class.
Recreating with correct class now.
@sarah.chen 2024-01-15 11:15:00
Database pod recovered. API gateway auto-recovered
once DB became available. Closing incident.
Auto-Resolution
Incidents automatically resolve when the underlying issue clears:
Auto-Resolution Logic:
- ops0 continuously monitors incident status
- If pods become healthy and stable for 10 minutes, incident auto-resolves
- Resolution type marked as "Auto-resolved"
- Timeline shows auto-resolution event
Example Timeline:
10:47:00 Incident created (CrashLoopBackOff)
10:50:00 Acknowledged by @sarah.chen
11:15:00 Pod became healthy
11:25:00 Auto-resolved (stable for 10 minutes)
Manual vs Auto-Resolution:
- Auto-resolved: Issue cleared on its own or fix was applied
- Fixed: Manually marked as resolved by team member
- Not an issue: False positive or expected behavior
Filtering Incidents
| Filter | Options |
|---|---|
| Status | All, Open, Acknowledged, Resolved |
| Severity | All, Critical, High, Medium |
| Type | CrashLoopBackOff, OOMKilled, ImagePullBackOff, etc. |
| Namespace | Filter by Kubernetes namespace |
| Search | Pod name, namespace, or message content |
Incident Metrics
The incidents page shows aggregate metrics:
Last 24 Hours
─────────────────────────────
Total Incidents: 12
Critical (P1): 1
High (P2): 3
Medium (P3): 8
MTTA (Acknowledge): 5 min
MTTR (Resolve): 45 min
Related Resources
Each incident links to related Kubernetes resources:
- Deployment - Parent deployment if pod is managed
- ReplicaSet - Current ReplicaSet
- Service - Services routing to the pod
- ConfigMap/Secret - Mounted configurations
Click any related resource to view its details.
Example: Investigating a CrashLoopBackOff
Incident Alert
INCIDENT #1247 - P2 High
─────────────────────────────────────
Type: CrashLoopBackOff
Pod: api-gateway-7d9f8c6b4d-2xkjp
Namespace: production
Container: api
Cluster: production-eks
Detected: 2024-01-15 10:47:00 UTC
Incident Timeline
10:45:00 Container 'api' started
10:45:02 Readiness probe passed
10:45:15 Error: Database connection refused
10:45:15 Container exited with code 1
10:45:30 Container restarted (attempt 1)
10:45:32 Readiness probe passed
10:45:45 Error: Database connection refused
10:45:45 Container exited with code 1
10:46:00 Container restarted (attempt 2)
10:46:45 Container exited with code 1
10:47:00 CrashLoopBackOff detected
10:47:00 Incident #1247 created (P2 High)
10:47:01 Slack notification sent to #incidents
Captured Logs
[10:45:12] INFO Starting API Gateway v2.3.1
[10:45:13] INFO Loading configuration from /etc/config/app.yaml
[10:45:14] INFO Connecting to database: postgresql://db.internal:5432/api
[10:45:15] ERROR Connection refused: postgresql://db.internal:5432/api
[10:45:15] ERROR Failed to establish database connection after 3 retries
[10:45:15] FATAL Cannot start without database connection, exiting
[10:45:15] INFO Shutdown complete
Investigation
Step 1: Check Events
Type Reason Age Message
──── ────── ─── ───────
Normal Scheduled 5m Successfully assigned production/api-gateway-... to node-1
Normal Pulling 5m Pulling image "api-gateway:v2.3.1"
Normal Pulled 5m Successfully pulled image in 2s
Normal Created 5m Created container api
Normal Started 5m Started container api
Warning BackOff 2m Back-off restarting failed container
Step 2: Check Database Pod
kubectl get pods -n production -l app=postgresql
NAME READY STATUS RESTARTS AGE
postgresql-0 0/1 CrashLoopBackOff 5 10m
Root Cause: The PostgreSQL pod was also crashing due to a PVC mount failure.
Resolution
Resolution Type: Fixed
Resolution Notes: PostgreSQL pod was crashing due to PVC storage class
misconfiguration after node rotation. Recreated PVC
with correct storage class. Database pod recovered,
API gateway auto-recovered after database became available.
Resolved by: @sarah.chen
Resolved at: 2024-01-15 11:15:00 UTC
MTTR: 28 minutes
Example: OOMKilled Incident
Incident Alert
INCIDENT #1248 - P1 Critical
─────────────────────────────────────
Type: OOMKilled
Pod: data-processor-5f8d9c7b2-kp3mn
Namespace: analytics
Container: processor
Cluster: production-eks
Detected: 2024-01-15 14:22:00 UTC
Resource Details
Container: processor
─────────────────────────────────────
Memory Request: 512Mi
Memory Limit: 1Gi
Last Usage: 1Gi (100% of limit)
Exit Code: 137 (OOMKilled)
Pod Events:
Warning OOMKilled Container processor exceeded memory limit
Resolution
Increased memory limit in deployment:
resources:
requests:
memory: "1Gi" # was 512Mi
limits:
memory: "2Gi" # was 1Gi
Applied change and pod recovered without further OOM events.