Kubernetes Incidents

ops0 automatically detects issues in your Kubernetes clusters and surfaces them as actionable incidents. Monitor pod failures, resource exhaustion, and deployment problems from one dashboard.

Detected Incident Types

CrashLoopBackOff

Container repeatedly crashing and restarting

OOMKilled

Container killed due to memory limit exceeded

ImagePullBackOff

Unable to pull container image from registry

Pending Pods

Pods stuck waiting for scheduling

Node NotReady

Nodes in unhealthy or unreachable state

Failed Deployments

Deployment rollout failures or stuck progress

Severity Levels

Severity	Color	Description	Example
P1 Critical	Red	Immediate attention required	OOMKilled, Node NotReady
P2 High	Orange	Urgent, affecting production	CrashLoopBackOff
P3 Medium	Yellow	Should be addressed soon	High restart count, Pending pods

Incident Status

Status	Description
Open	Newly detected, not yet triaged
Acknowledged	Team is aware and investigating
Resolved	Issue fixed or no longer occurring

Incident Details

When you click an incident, the detail panel shows:

Resource Information

Field	Description
Pod	Affected pod name
Namespace	Kubernetes namespace
Container	Container name (if applicable)
Node	Node the pod is running on
Restart Count	Number of container restarts
Last Termination	Reason for last container exit

Timeline

Shows event history leading to the incident:

10:45:00  Container 'api' exited with code 1
10:45:30  Container restarted (attempt 1)
10:46:00  Container 'api' exited with code 1
10:46:30  Container restarted (attempt 2)
10:47:00  CrashLoopBackOff detected
10:47:00  Incident created (P2 High)

Captured Logs

Recent container logs captured at incident detection:

[10:44:55] Starting application...
[10:44:56] Connected to database
[10:44:58] Error: Connection pool exhausted
[10:44:58] Fatal: Unable to handle request
[10:44:59] Process exiting with code 1

Quick Actions

View Logs

Open container logs with search and filtering

Open Terminal

Exec into pod to investigate live

Describe Pod

Full kubectl describe output

View Events

Kubernetes events for the resource

Managing Incidents

Acknowledge

Click Acknowledge on an open incident
Optionally add a note about who's investigating
Status changes to "Acknowledged"
Shows acknowledger name and timestamp

Resolve

Click Resolve on an incident
Add resolution notes (what fixed it)
Select resolution type:
- Fixed - Issue was corrected
- Not an issue - False positive or expected behavior
- Auto-resolved - Issue cleared on its own
Incident moves to resolved state

Manual Incident Scanning

Trigger on-demand incident detection for a cluster:

Navigate to cluster detail page
Click "Scan for Incidents" button
ops0 runs detection across all namespaces
New incidents appear in list

When to Use:

After deploying changes to verify no new issues
Troubleshooting suspected problems
Validating incident auto-resolution

Scan Process:

Scanning production-eks for incidents...
✓ Checked 156 pods
✓ Checked 12 nodes
✓ Checked 35 deployments
Found 2 new incidents (1 Critical, 1 Warning)

AI-Powered Analysis

For each incident, ops0 can produce AI-powered root cause analysis:

Analysis Includes:

Probable root cause
Related configuration issues
Suggested remediation steps
Similar past incidents

Example Analysis:

Root Cause Analysis (AI-Generated)
───────────────────────────────────────────────
Incident: CrashLoopBackOff in api-gateway pod

Root Cause:
The pod is failing to start due to missing environment
variable DB_PASSWORD. The Secret 'api-gateway-secrets'
exists but is missing the 'db-password' key.

Evidence:
- Container logs show: "ERROR Missing required env: DB_PASSWORD"
- Secret 'api-gateway-secrets' was updated 45 minutes ago
- Previous version contained 'db-password' key
- No other pods in namespace are affected

Recommended Actions:
1. Verify Secret 'api-gateway-secrets' contains 'db-password' key
2. If key was removed, add it back with correct value
3. If key name changed, update Deployment env reference

Updating Analysis: AI analysis is created automatically for new incidents. To refresh:

Click "Refresh Analysis" in incident panel
Updated analysis appears within seconds

Incident Notes

Add investigation notes and findings to incidents:

Adding Notes:

Open incident detail panel
Click "Add Note" button
Write note in markdown
Click "Save"

Note Visibility:

All team members can view notes
Author and timestamp recorded
Notes preserved after incident resolution

Use Cases:

Document investigation steps
Share findings with team
Track remediation attempts
Link to related runbooks

Example Notes:

@sarah.chen 2024-01-15 10:50:00
Checked database pod - also in CrashLoopBackOff.
Appears to be PVC mount issue after node rotation.

@mike.jones 2024-01-15 11:05:00
Confirmed: PVC using wrong storage class.
Recreating with correct class now.

@sarah.chen 2024-01-15 11:15:00
Database pod recovered. API gateway auto-recovered
once DB became available. Closing incident.

Auto-Resolution

Incidents automatically resolve when the underlying issue clears:

Auto-Resolution Logic:

ops0 continuously monitors incident status
If pods become healthy and stable for 10 minutes, incident auto-resolves
Resolution type marked as "Auto-resolved"
Timeline shows auto-resolution event

Example Timeline:

10:47:00  Incident created (CrashLoopBackOff)
10:50:00  Acknowledged by @sarah.chen
11:15:00  Pod became healthy
11:25:00  Auto-resolved (stable for 10 minutes)

Manual vs Auto-Resolution:

Auto-resolved: Issue cleared on its own or fix was applied
Fixed: Manually marked as resolved by team member
Not an issue: False positive or expected behavior

Filtering Incidents

Filter	Options
Status	All, Open, Acknowledged, Resolved
Severity	All, Critical, High, Medium
Type	CrashLoopBackOff, OOMKilled, ImagePullBackOff, etc.
Namespace	Filter by Kubernetes namespace
Search	Pod name, namespace, or message content

Incident Metrics

The incidents page shows aggregate metrics:

Last 24 Hours
─────────────────────────────
Total Incidents:  12
Critical (P1):    1
High (P2):        3
Medium (P3):      8

MTTA (Acknowledge): 5 min
MTTR (Resolve):     45 min

Each incident links to related Kubernetes resources:

Deployment - Parent deployment if pod is managed
ReplicaSet - Current ReplicaSet
Service - Services routing to the pod
ConfigMap/Secret - Mounted configurations

Click any related resource to view its details.

Best Practices

• Acknowledge quickly - Shows the team is aware and prevents duplicate investigation

• Add notes - Document findings during investigation for future reference

• Resolve with details - Explain what fixed it to help with recurring issues

• Set up alerts - Configure Slack/PagerDuty for critical incidents

• Review patterns - Recurring incidents indicate systemic issues to address

Example: Investigating a CrashLoopBackOff

Incident Alert

INCIDENT #1247 - P2 High
─────────────────────────────────────
Type:       CrashLoopBackOff
Pod:        api-gateway-7d9f8c6b4d-2xkjp
Namespace:  production
Container:  api
Cluster:    production-eks
Detected:   2024-01-15 10:47:00 UTC

Incident Timeline

10:45:00  Container 'api' started
10:45:02  Readiness probe passed
10:45:15  Error: Database connection refused
10:45:15  Container exited with code 1
10:45:30  Container restarted (attempt 1)
10:45:32  Readiness probe passed
10:45:45  Error: Database connection refused
10:45:45  Container exited with code 1
10:46:00  Container restarted (attempt 2)
10:46:45  Container exited with code 1
10:47:00  CrashLoopBackOff detected
10:47:00  Incident #1247 created (P2 High)
10:47:01  Slack notification sent to #incidents

Captured Logs

[10:45:12] INFO  Starting API Gateway v2.3.1
[10:45:13] INFO  Loading configuration from /etc/config/app.yaml
[10:45:14] INFO  Connecting to database: postgresql://db.internal:5432/api
[10:45:15] ERROR Connection refused: postgresql://db.internal:5432/api
[10:45:15] ERROR Failed to establish database connection after 3 retries
[10:45:15] FATAL Cannot start without database connection, exiting
[10:45:15] INFO  Shutdown complete

Investigation

Step 1: Check Events

Type    Reason     Age   Message
────    ──────     ───   ───────
Normal  Scheduled  5m    Successfully assigned production/api-gateway-... to node-1
Normal  Pulling    5m    Pulling image "api-gateway:v2.3.1"
Normal  Pulled     5m    Successfully pulled image in 2s
Normal  Created    5m    Created container api
Normal  Started    5m    Started container api
Warning BackOff    2m    Back-off restarting failed container

Step 2: Check Database Pod

kubectl get pods -n production -l app=postgresql

NAME                         READY   STATUS             RESTARTS   AGE
postgresql-0                 0/1     CrashLoopBackOff   5          10m

Root Cause: The PostgreSQL pod was also crashing due to a PVC mount failure.

Resolution

Resolution Type:  Fixed
Resolution Notes: PostgreSQL pod was crashing due to PVC storage class
                  misconfiguration after node rotation. Recreated PVC
                  with correct storage class. Database pod recovered,
                  API gateway auto-recovered after database became available.

Resolved by:      @sarah.chen
Resolved at:      2024-01-15 11:15:00 UTC
MTTR:             28 minutes

Example: OOMKilled Incident

Incident Alert

INCIDENT #1248 - P1 Critical
─────────────────────────────────────
Type:       OOMKilled
Pod:        data-processor-5f8d9c7b2-kp3mn
Namespace:  analytics
Container:  processor
Cluster:    production-eks
Detected:   2024-01-15 14:22:00 UTC

Resource Details

Container: processor
─────────────────────────────────────
Memory Request:  512Mi
Memory Limit:    1Gi
Last Usage:      1Gi (100% of limit)
Exit Code:       137 (OOMKilled)

Pod Events:
Warning  OOMKilled  Container processor exceeded memory limit

Resolution

Increased memory limit in deployment:

resources:
  requests:
    memory: "1Gi"    # was 512Mi
  limits:
    memory: "2Gi"    # was 1Gi

Applied change and pod recovered without further OOM events.

Kubernetes Incidents

Detected Incident Types

Severity Levels

Incident Status

Incident Details

Resource Information

Timeline

Captured Logs

Quick Actions

Managing Incidents

Acknowledge

Resolve

Manual Incident Scanning

AI-Powered Analysis

Incident Notes

Auto-Resolution

Filtering Incidents

Incident Metrics

Related Resources

Example: Investigating a CrashLoopBackOff

Incident Alert

Incident Timeline

Captured Logs

Investigation

Resolution

Example: OOMKilled Incident

Incident Alert

Resource Details

Resolution

Kubernetes Incidents

Detected Incident Types

Severity Levels

Incident Status

Incident Details

Resource Information

Timeline

Captured Logs

Quick Actions

Managing Incidents

Acknowledge

Resolve

Manual Incident Scanning

AI-Powered Analysis

Incident Notes

Auto-Resolution

Filtering Incidents

Incident Metrics

Related Resources

Example: Investigating a CrashLoopBackOff

Incident Alert

Incident Timeline

Captured Logs

Investigation

Resolution

Example: OOMKilled Incident

Incident Alert

Resource Details

Resolution