ops0 automatically detects issues in your Kubernetes clusters and surfaces them as actionable incidents. Monitor pod failures, resource exhaustion, and deployment problems from one dashboard.
| Severity | Color | Description | Example |
|---|---|---|---|
| P1 Critical | Red | Immediate attention required | OOMKilled, Node NotReady |
| P2 High | Orange | Urgent, affecting production | CrashLoopBackOff |
| P3 Medium | Yellow | Should be addressed soon | High restart count, Pending pods |
| Status | Description |
|---|---|
| Open | Newly detected, not yet triaged |
| Acknowledged | Team is aware and investigating |
| Resolved | Issue fixed or no longer occurring |
When you click an incident, the detail panel shows:
| Field | Description |
|---|---|
| Pod | Affected pod name |
| Namespace | Kubernetes namespace |
| Container | Container name (if applicable) |
| Node | Node the pod is running on |
| Restart Count | Number of container restarts |
| Last Termination | Reason for last container exit |
Shows event history leading to the incident:
10:45:00 Container 'api' exited with code 1
10:45:30 Container restarted (attempt 1)
10:46:00 Container 'api' exited with code 1
10:46:30 Container restarted (attempt 2)
10:47:00 CrashLoopBackOff detected
10:47:00 Incident created (P2 High)
Recent container logs captured at incident detection:
[10:44:55] Starting application...
[10:44:56] Connected to database
[10:44:58] Error: Connection pool exhausted
[10:44:58] Fatal: Unable to handle request
[10:44:59] Process exiting with code 1
Trigger on-demand incident detection for a cluster:
When to Use:
Scan Process:
Scanning production-eks for incidents...
✓ Checked 156 pods
✓ Checked 12 nodes
✓ Checked 35 deployments
Found 2 new incidents (1 Critical, 1 Warning)
For each incident, ops0 can produce AI-powered root cause analysis:
Analysis Includes:
Example Analysis:
Root Cause Analysis (AI-Generated)
───────────────────────────────────────────────
Incident: CrashLoopBackOff in api-gateway pod
Root Cause:
The pod is failing to start due to missing environment
variable DB_PASSWORD. The Secret 'api-gateway-secrets'
exists but is missing the 'db-password' key.
Evidence:
- Container logs show: "ERROR Missing required env: DB_PASSWORD"
- Secret 'api-gateway-secrets' was updated 45 minutes ago
- Previous version contained 'db-password' key
- No other pods in namespace are affected
Recommended Actions:
1. Verify Secret 'api-gateway-secrets' contains 'db-password' key
2. If key was removed, add it back with correct value
3. If key name changed, update Deployment env reference
Updating Analysis: AI analysis is created automatically for new incidents. To refresh:
Add investigation notes and findings to incidents:
Adding Notes:
Note Visibility:
Use Cases:
Example Notes:
@sarah.chen 2024-01-15 10:50:00
Checked database pod - also in CrashLoopBackOff.
Appears to be PVC mount issue after node rotation.
@mike.jones 2024-01-15 11:05:00
Confirmed: PVC using wrong storage class.
Recreating with correct class now.
@sarah.chen 2024-01-15 11:15:00
Database pod recovered. API gateway auto-recovered
once DB became available. Closing incident.
Incidents automatically resolve when the underlying issue clears:
Auto-Resolution Logic:
Example Timeline:
10:47:00 Incident created (CrashLoopBackOff)
10:50:00 Acknowledged by @sarah.chen
11:15:00 Pod became healthy
11:25:00 Auto-resolved (stable for 10 minutes)
Manual vs Auto-Resolution:
| Filter | Options |
|---|---|
| Status | All, Open, Acknowledged, Resolved |
| Severity | All, Critical, High, Medium |
| Type | CrashLoopBackOff, OOMKilled, ImagePullBackOff, etc. |
| Namespace | Filter by Kubernetes namespace |
| Search | Pod name, namespace, or message content |
The incidents page shows aggregate metrics:
Last 24 Hours
─────────────────────────────
Total Incidents: 12
Critical (P1): 1
High (P2): 3
Medium (P3): 8
MTTA (Acknowledge): 5 min
MTTR (Resolve): 45 min
Each incident links to related Kubernetes resources:
Click any related resource to view its details.
INCIDENT #1247 - P2 High
─────────────────────────────────────
Type: CrashLoopBackOff
Pod: api-gateway-7d9f8c6b4d-2xkjp
Namespace: production
Container: api
Cluster: production-eks
Detected: 2024-01-15 10:47:00 UTC
10:45:00 Container 'api' started
10:45:02 Readiness probe passed
10:45:15 Error: Database connection refused
10:45:15 Container exited with code 1
10:45:30 Container restarted (attempt 1)
10:45:32 Readiness probe passed
10:45:45 Error: Database connection refused
10:45:45 Container exited with code 1
10:46:00 Container restarted (attempt 2)
10:46:45 Container exited with code 1
10:47:00 CrashLoopBackOff detected
10:47:00 Incident #1247 created (P2 High)
10:47:01 Slack notification sent to #incidents
[10:45:12] INFO Starting API Gateway v2.3.1
[10:45:13] INFO Loading configuration from /etc/config/app.yaml
[10:45:14] INFO Connecting to database: postgresql://db.internal:5432/api
[10:45:15] ERROR Connection refused: postgresql://db.internal:5432/api
[10:45:15] ERROR Failed to establish database connection after 3 retries
[10:45:15] FATAL Cannot start without database connection, exiting
[10:45:15] INFO Shutdown complete
Step 1: Check Events
Type Reason Age Message
──── ────── ─── ───────
Normal Scheduled 5m Successfully assigned production/api-gateway-... to node-1
Normal Pulling 5m Pulling image "api-gateway:v2.3.1"
Normal Pulled 5m Successfully pulled image in 2s
Normal Created 5m Created container api
Normal Started 5m Started container api
Warning BackOff 2m Back-off restarting failed container
Step 2: Check Database Pod
kubectl get pods -n production -l app=postgresql
NAME READY STATUS RESTARTS AGE
postgresql-0 0/1 CrashLoopBackOff 5 10m
Root Cause: The PostgreSQL pod was also crashing due to a PVC mount failure.
Resolution Type: Fixed
Resolution Notes: PostgreSQL pod was crashing due to PVC storage class
misconfiguration after node rotation. Recreated PVC
with correct storage class. Database pod recovered,
API gateway auto-recovered after database became available.
Resolved by: @sarah.chen
Resolved at: 2024-01-15 11:15:00 UTC
MTTR: 28 minutes
INCIDENT #1248 - P1 Critical
─────────────────────────────────────
Type: OOMKilled
Pod: data-processor-5f8d9c7b2-kp3mn
Namespace: analytics
Container: processor
Cluster: production-eks
Detected: 2024-01-15 14:22:00 UTC
Container: processor
─────────────────────────────────────
Memory Request: 512Mi
Memory Limit: 1Gi
Last Usage: 1Gi (100% of limit)
Exit Code: 137 (OOMKilled)
Pod Events:
Warning OOMKilled Container processor exceeded memory limit
Increased memory limit in deployment:
resources:
requests:
memory: "1Gi" # was 512Mi
limits:
memory: "2Gi" # was 1Gi
Applied change and pod recovered without further OOM events.