Debug Kubernetes Issues
Use ops0's AI-assisted troubleshooting to quickly diagnose and resolve pod failures, resource issues, and cluster problems.
Scenario
Your team gets paged at 3 AM because pods are crashing. You need to:
- Quickly identify what's wrong
- Find the root cause without digging through multiple tools
- Get actionable fixes, not just error messages
- Resolve the issue before it impacts customers
This guide shows how ops0 accelerates Kubernetes debugging.
Prerequisites
Understanding Automatic Incident Detection
ops0's Hive agent continuously monitors your clusters and automatically detects issues:
| Incident Type | Detection Trigger |
|---|---|
| CrashLoopBackOff | Pod restart count > 3 within 10 minutes |
| OOMKilled | Container terminated due to memory limit |
| ImagePullBackOff | Failed to pull container image after 3 attempts |
| Pending | Pod stuck in Pending for > 5 minutes |
| Failed | Pod entered Failed state |
| High CPU | Container CPU > 90% for > 5 minutes |
| High Memory | Container memory > 85% of limit |
When detected, incidents appear in your ops0 dashboard with AI analysis already started.
Step 1: View the Incident
Incident Overview
┌─────────────────────────────────────────────────────────────────┐
│ INCIDENT: CrashLoopBackOff │
│ Pod: api-server-7d9f8b6c4-x2k9m │
│ Namespace: production │
│ Started: 3 minutes ago │
│ Restarts: 5 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ AI Analysis │
│ ─────────────────────────────────────────────────────────────── │
│ The pod is crashing because it cannot connect to the database │
│ at postgres.production.svc.cluster.local:5432. The connection │
│ is timing out after 30 seconds. │
│ │
│ Likely cause: The postgres service was deleted or renamed │
│ in the last deployment (15 minutes ago). │
│ │
└─────────────────────────────────────────────────────────────────┘
Step 2: Review AI Analysis
ops0 automatically analyzes multiple data sources to identify the root cause:
What AI Examines
AI Summary Sections
| Section | What It Contains |
|---|---|
| Root Cause | Primary reason for the failure |
| Evidence | Specific log lines, events, or configs that led to this conclusion |
| Impact | What's affected (other pods, services, endpoints) |
| Suggested Fixes | Actionable remediation steps |
Step 3: Explore the Evidence
Click on any evidence item to see the full context:
Logs Tab
[2024-01-15 03:14:22] INFO Starting application...
[2024-01-15 03:14:22] INFO Connecting to database: postgres.production.svc.cluster.local:5432
[2024-01-15 03:14:52] ERROR Connection timeout after 30000ms
[2024-01-15 03:14:52] FATAL Unable to connect to database, shutting down
[2024-01-15 03:14:52] INFO Application terminated with exit code 1
ops0 highlights the relevant lines and lets you search/filter.
Events Tab
TIMESTAMP TYPE REASON MESSAGE
03:14:52 Warning BackOff Back-off restarting failed container
03:14:22 Normal Pulled Container image "api:v2.3.1" already present
03:14:22 Normal Created Created container api
03:14:22 Normal Started Started container api
03:12:05 Warning BackOff Back-off restarting failed container
Related Resources
ops0 shows resources that might be involved:
Step 4: Apply the Fix
ops0 provides specific remediation steps. For this example:
Suggested Fix
kubectl apply -f - <<EOF
apiVersion: v1
kind: Service
metadata:
name: postgres
namespace: production
spec:
selector:
app: postgres
ports:
- port: 5432
EOF
kubectl set env deployment/api-server DATABASE_URL=postgresql://postgresql.production.svc.cluster.local:5432/app
One-Click Apply
For common fixes, ops0 provides a one-click option:
Step 5: Verify Resolution
After applying the fix:
Resolution Summary
┌─────────────────────────────────────────────────────────────────┐
│ INCIDENT RESOLVED │
├─────────────────────────────────────────────────────────────────┤
│ Duration: 8 minutes │
│ Root Cause: Missing postgres service │
│ Resolution: Service restored via ops0 one-click fix │
│ Resolved by: jane@company.com │
└─────────────────────────────────────────────────────────────────┘
Common Incident Types
CrashLoopBackOff
| Common Cause | How to Identify | Fix |
|---|---|---|
| App crash on startup | Error in logs immediately after start | Fix application code or config |
| Missing config/secret | "file not found" or "env var not set" | Create the missing ConfigMap/Secret |
| Database connection | Connection timeout/refused | Check database service exists and is running |
| OOM during startup | OOMKilled in events | Increase memory limits |
ImagePullBackOff
| Common Cause | How to Identify | Fix |
|---|---|---|
| Image doesn't exist | "manifest unknown" | Check image tag exists in registry |
| Auth failure | "unauthorized" | Update imagePullSecrets |
| Registry unreachable | "connection refused" | Check network policies, firewall |
| Rate limited | "too many requests" | Wait, or use registry mirror |
Pending Pods
| Common Cause | How to Identify | Fix |
|---|---|---|
| Insufficient CPU/memory | "Insufficient cpu" in events | Scale cluster or reduce requests |
| No matching nodes | "node selector" mismatch | Update nodeSelector or add nodes |
| PVC pending | "waiting for volume" | Check StorageClass, PV availability |
| Taints not tolerated | "taints not tolerated" | Add tolerations to pod spec |
OOMKilled
| Common Cause | How to Identify | Fix |
|---|---|---|
| Memory leak | Gradual increase before kill | Fix application memory leak |
| Limit too low | Immediate kill on load | Increase memory limit |
| JVM heap issues | "OutOfMemoryError" in logs | Tune JVM heap settings |
Ask AI for Help
Beyond automatic analysis, you can ask questions in natural language:
"Looking at the memory profile, the api-server container has grown from 256MB at startup to 1.8GB over the past 6 hours. The logs show repeated 'cache entry added' messages without corresponding evictions. This suggests the application's in-memory cache is growing unbounded.
Recommendation: Either set a max cache size in the application config, or add a cache TTL to expire old entries."
Example Questions
- "What changed in this namespace in the last hour?"
- "Why are requests to this service timing out?"
- "Which pods are using the most CPU in the cluster?"
- "Show me all pods that restarted today"
- "What's the difference between this deployment and the previous one?"
Set Up Alerts
Get notified before issues impact customers: