ops0ops0

Debug Kubernetes Issues

Use ops0's AI-assisted troubleshooting to quickly diagnose and resolve pod failures, resource issues, and cluster problems.


Scenario

Your team gets paged at 3 AM because pods are crashing. You need to:

  • Quickly identify what's wrong
  • Find the root cause without digging through multiple tools
  • Get actionable fixes, not just error messages
  • Resolve the issue before it impacts customers

This guide shows how ops0 accelerates Kubernetes debugging.


Prerequisites

Kubernetes cluster connected with Hive agent (Setup guide)

Understanding Automatic Incident Detection

ops0's Hive agent continuously monitors your clusters and automatically detects issues:

Incident TypeDetection Trigger
CrashLoopBackOffPod restart count > 3 within 10 minutes
OOMKilledContainer terminated due to memory limit
ImagePullBackOffFailed to pull container image after 3 attempts
PendingPod stuck in Pending for > 5 minutes
FailedPod entered Failed state
High CPUContainer CPU > 90% for > 5 minutes
High MemoryContainer memory > 85% of limit

When detected, incidents appear in your ops0 dashboard with AI analysis already started.


Step 1: View the Incident

1Click Kubernetes in the sidebar
2Select your cluster
3Click Incidents tab
4Click on the incident to open details

Incident Overview

┌─────────────────────────────────────────────────────────────────┐
│ INCIDENT: CrashLoopBackOff                                      │
│ Pod: api-server-7d9f8b6c4-x2k9m                                │
│ Namespace: production                                           │
│ Started: 3 minutes ago                                          │
│ Restarts: 5                                                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│ AI Analysis                                                     │
│ ─────────────────────────────────────────────────────────────── │
│ The pod is crashing because it cannot connect to the database   │
│ at postgres.production.svc.cluster.local:5432. The connection   │
│ is timing out after 30 seconds.                                 │
│                                                                 │
│ Likely cause: The postgres service was deleted or renamed       │
│ in the last deployment (15 minutes ago).                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Step 2: Review AI Analysis

ops0 automatically analyzes multiple data sources to identify the root cause:

What AI Examines

Pod Logs
Last 1000 lines, searching for errors, exceptions, and connection failures.
Events
Kubernetes events for the pod and related resources.
Resource Config
Deployment, ConfigMaps, Secrets referenced by the pod.
Recent Changes
Deployments, config changes, and image updates in the last 24 hours.

AI Summary Sections

SectionWhat It Contains
Root CausePrimary reason for the failure
EvidenceSpecific log lines, events, or configs that led to this conclusion
ImpactWhat's affected (other pods, services, endpoints)
Suggested FixesActionable remediation steps

Step 3: Explore the Evidence

Click on any evidence item to see the full context:

Logs Tab

[2024-01-15 03:14:22] INFO  Starting application...
[2024-01-15 03:14:22] INFO  Connecting to database: postgres.production.svc.cluster.local:5432
[2024-01-15 03:14:52] ERROR Connection timeout after 30000ms
[2024-01-15 03:14:52] FATAL Unable to connect to database, shutting down
[2024-01-15 03:14:52] INFO  Application terminated with exit code 1

ops0 highlights the relevant lines and lets you search/filter.

Events Tab

TIMESTAMP            TYPE      REASON              MESSAGE
03:14:52             Warning   BackOff             Back-off restarting failed container
03:14:22             Normal    Pulled              Container image "api:v2.3.1" already present
03:14:22             Normal    Created             Created container api
03:14:22             Normal    Started             Started container api
03:12:05             Warning   BackOff             Back-off restarting failed container

ops0 shows resources that might be involved:

Service: postgres (Not Found)
Referenced by api-server but doesn't exist in namespace
ConfigMap: api-config
Last modified: 15 minutes ago by john@company.com

Step 4: Apply the Fix

ops0 provides specific remediation steps. For this example:

Suggested Fix

Option 1: Restore the postgres service
The postgres service was deleted. Re-create it to restore connectivity.

kubectl apply -f - <<EOF
apiVersion: v1
kind: Service
metadata:
  name: postgres
  namespace: production
spec:
  selector:
    app: postgres
  ports:
    - port: 5432
EOF

Option 2: Update DATABASE_URL to new service name
If the database was renamed to 'postgresql', update the environment variable.

kubectl set env deployment/api-server DATABASE_URL=postgresql://postgresql.production.svc.cluster.local:5432/app

One-Click Apply

For common fixes, ops0 provides a one-click option:

1Click Apply Fix next to the suggested remediation
2Review the changes that will be applied
3Confirm to execute

Step 5: Verify Resolution

After applying the fix:

1Watch the pod status in ops0 (auto-refreshes)
2Confirm pod enters Running state
3Incident automatically resolves when pod is stable for 5 minutes

Resolution Summary

┌─────────────────────────────────────────────────────────────────┐
│ INCIDENT RESOLVED                                               │
├─────────────────────────────────────────────────────────────────┤
│ Duration: 8 minutes                                             │
│ Root Cause: Missing postgres service                            │
│ Resolution: Service restored via ops0 one-click fix             │
│ Resolved by: jane@company.com                                   │
└─────────────────────────────────────────────────────────────────┘

Common Incident Types

CrashLoopBackOff

Common CauseHow to IdentifyFix
App crash on startupError in logs immediately after startFix application code or config
Missing config/secret"file not found" or "env var not set"Create the missing ConfigMap/Secret
Database connectionConnection timeout/refusedCheck database service exists and is running
OOM during startupOOMKilled in eventsIncrease memory limits

ImagePullBackOff

Common CauseHow to IdentifyFix
Image doesn't exist"manifest unknown"Check image tag exists in registry
Auth failure"unauthorized"Update imagePullSecrets
Registry unreachable"connection refused"Check network policies, firewall
Rate limited"too many requests"Wait, or use registry mirror

Pending Pods

Common CauseHow to IdentifyFix
Insufficient CPU/memory"Insufficient cpu" in eventsScale cluster or reduce requests
No matching nodes"node selector" mismatchUpdate nodeSelector or add nodes
PVC pending"waiting for volume"Check StorageClass, PV availability
Taints not tolerated"taints not tolerated"Add tolerations to pod spec

OOMKilled

Common CauseHow to IdentifyFix
Memory leakGradual increase before killFix application memory leak
Limit too lowImmediate kill on loadIncrease memory limit
JVM heap issues"OutOfMemoryError" in logsTune JVM heap settings

Ask AI for Help

Beyond automatic analysis, you can ask questions in natural language:

You:
"Why is this pod using so much memory?"
AI:

"Looking at the memory profile, the api-server container has grown from 256MB at startup to 1.8GB over the past 6 hours. The logs show repeated 'cache entry added' messages without corresponding evictions. This suggests the application's in-memory cache is growing unbounded.

Recommendation: Either set a max cache size in the application config, or add a cache TTL to expire old entries."

Example Questions

  • "What changed in this namespace in the last hour?"
  • "Why are requests to this service timing out?"
  • "Which pods are using the most CPU in the cluster?"
  • "Show me all pods that restarted today"
  • "What's the difference between this deployment and the previous one?"

Set Up Alerts

Get notified before issues impact customers:

1Go to Settings > Integrations > Slack (or PagerDuty)
2Configure a channel for Kubernetes alerts
3Set severity thresholds (e.g., only alert on production)

Next Steps