Use ops0's AI-assisted troubleshooting to quickly diagnose and resolve pod failures, resource issues, and cluster problems.
Your team gets paged at 3 AM because pods are crashing. You need to:
This guide shows how ops0 accelerates Kubernetes debugging.
ops0's ops0 agent continuously monitors your clusters and automatically detects issues:
| Incident Type | Detection Trigger |
|---|---|
| CrashLoopBackOff | Pod restart count > 3 within 10 minutes |
| OOMKilled | Container terminated due to memory limit |
| ImagePullBackOff | Failed to pull container image after 3 attempts |
| Pending | Pod stuck in Pending for > 5 minutes |
| Failed | Pod entered Failed state |
| High CPU | Container CPU > 90% for > 5 minutes |
| High Memory | Container memory > 85% of limit |
When detected, incidents appear in your ops0 dashboard with AI analysis already started.
┌─────────────────────────────────────────────────────────────────┐
│ INCIDENT: CrashLoopBackOff │
│ Pod: api-server-7d9f8b6c4-x2k9m │
│ Namespace: production │
│ Started: 3 minutes ago │
│ Restarts: 5 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ AI Analysis │
│ ─────────────────────────────────────────────────────────────── │
│ The pod is crashing because it cannot connect to the database │
│ at postgres.production.svc.cluster.local:5432. The connection │
│ is timing out after 30 seconds. │
│ │
│ Likely cause: The postgres service was deleted or renamed │
│ in the last deployment (15 minutes ago). │
│ │
└─────────────────────────────────────────────────────────────────┘
ops0 automatically analyzes multiple data sources to identify the root cause:
| Section | What It Contains |
|---|---|
| Root Cause | Primary reason for the failure |
| Evidence | Specific log lines, events, or configs that led to this conclusion |
| Impact | What's affected (other pods, services, endpoints) |
| Suggested Fixes | Actionable remediation steps |
Click on any evidence item to see the full context:
[2024-01-15 03:14:22] INFO Starting application...
[2024-01-15 03:14:22] INFO Connecting to database: postgres.production.svc.cluster.local:5432
[2024-01-15 03:14:52] ERROR Connection timeout after 30000ms
[2024-01-15 03:14:52] FATAL Unable to connect to database, shutting down
[2024-01-15 03:14:52] INFO Application terminated with exit code 1
ops0 highlights the relevant lines and lets you search/filter.
TIMESTAMP TYPE REASON MESSAGE
03:14:52 Warning BackOff Back-off restarting failed container
03:14:22 Normal Pulled Container image "api:v2.3.1" already present
03:14:22 Normal Created Created container api
03:14:22 Normal Started Started container api
03:12:05 Warning BackOff Back-off restarting failed container
ops0 shows resources that might be involved:
ops0 provides specific remediation steps. For this example:
kubectl apply -f - <<EOF
apiVersion: v1
kind: Service
metadata:
name: postgres
namespace: production
spec:
selector:
app: postgres
ports:
- port: 5432
EOF
kubectl set env deployment/api-server DATABASE_URL=postgresql://postgresql.production.svc.cluster.local:5432/app
For common fixes, ops0 provides a one-click option:
After applying the fix:
┌─────────────────────────────────────────────────────────────────┐
│ INCIDENT RESOLVED │
├─────────────────────────────────────────────────────────────────┤
│ Duration: 8 minutes │
│ Root Cause: Missing postgres service │
│ Resolution: Service restored via ops0 one-click fix │
│ Resolved by: jane@company.com │
└─────────────────────────────────────────────────────────────────┘
| Common Cause | How to Identify | Fix |
|---|---|---|
| App crash on startup | Error in logs immediately after start | Fix application code or config |
| Missing config/secret | "file not found" or "env var not set" | Create the missing ConfigMap/Secret |
| Database connection | Connection timeout/refused | Check database service exists and is running |
| OOM during startup | OOMKilled in events | Increase memory limits |
| Common Cause | How to Identify | Fix |
|---|---|---|
| Image doesn't exist | "manifest unknown" | Check image tag exists in registry |
| Auth failure | "unauthorized" | Update imagePullSecrets |
| Registry unreachable | "connection refused" | Check network policies, firewall |
| Rate limited | "too many requests" | Wait, or use registry mirror |
| Common Cause | How to Identify | Fix |
|---|---|---|
| Insufficient CPU/memory | "Insufficient cpu" in events | Scale cluster or reduce requests |
| No matching nodes | "node selector" mismatch | Update nodeSelector or add nodes |
| PVC pending | "waiting for volume" | Check StorageClass, PV availability |
| Taints not tolerated | "taints not tolerated" | Add tolerations to pod spec |
| Common Cause | How to Identify | Fix |
|---|---|---|
| Memory leak | Gradual increase before kill | Fix application memory leak |
| Limit too low | Immediate kill on load | Increase memory limit |
| JVM heap issues | "OutOfMemoryError" in logs | Tune JVM heap settings |
Beyond automatic analysis, you can ask questions in natural language:
"Looking at the memory profile, the api-server container has grown from 256MB at startup to 1.8GB over the past 6 hours. The logs show repeated 'cache entry added' messages without corresponding evictions. This suggests the application's in-memory cache is growing unbounded.
Recommendation: Either set a max cache size in the application config, or add a cache TTL to expire old entries."
Get notified before issues impact customers: