Debug Kubernetes Issues

Use ops0's AI-assisted troubleshooting to quickly diagnose and resolve pod failures, resource issues, and cluster problems.

Scenario

Your team gets paged at 3 AM because pods are crashing. You need to:

Quickly identify what's wrong
Find the root cause without digging through multiple tools
Get actionable fixes, not just error messages
Resolve the issue before it impacts customers

This guide shows how ops0 accelerates Kubernetes debugging.

Prerequisites

✓Kubernetes cluster connected with ops0 agent (Setup guide)

Understanding Automatic Incident Detection

ops0's ops0 agent continuously monitors your clusters and automatically detects issues:

Incident Type	Detection Trigger
CrashLoopBackOff	Pod restart count > 3 within 10 minutes
OOMKilled	Container terminated due to memory limit
ImagePullBackOff	Failed to pull container image after 3 attempts
Pending	Pod stuck in Pending for > 5 minutes
Failed	Pod entered Failed state
High CPU	Container CPU > 90% for > 5 minutes
High Memory	Container memory > 85% of limit

When detected, incidents appear in your ops0 dashboard with AI analysis already started.

Step 1: View the Incident

1Click Kubernetes in the sidebar

2Select your cluster

3Click Incidents tab

4Click on the incident to open details

Incident Overview

┌─────────────────────────────────────────────────────────────────┐
│ INCIDENT: CrashLoopBackOff                                      │
│ Pod: api-server-7d9f8b6c4-x2k9m                                │
│ Namespace: production                                           │
│ Started: 3 minutes ago                                          │
│ Restarts: 5                                                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│ AI Analysis                                                     │
│ ─────────────────────────────────────────────────────────────── │
│ The pod is crashing because it cannot connect to the database   │
│ at postgres.production.svc.cluster.local:5432. The connection   │
│ is timing out after 30 seconds.                                 │
│                                                                 │
│ Likely cause: The postgres service was deleted or renamed       │
│ in the last deployment (15 minutes ago).                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Step 2: Review AI Analysis

ops0 automatically analyzes multiple data sources to identify the root cause:

What AI Examines

Pod Logs

Last 1000 lines, searching for errors, exceptions, and connection failures.

Events

Kubernetes events for the pod and related resources.

Resource Config

Deployment, ConfigMaps, Secrets referenced by the pod.

Recent Changes

Deployments, config changes, and image updates in the last 24 hours.

AI Summary Sections

Section	What It Contains
Root Cause	Primary reason for the failure
Evidence	Specific log lines, events, or configs that led to this conclusion
Impact	What's affected (other pods, services, endpoints)
Suggested Fixes	Actionable remediation steps

Step 3: Explore the Evidence

Click on any evidence item to see the full context:

Logs Tab

[2024-01-15 03:14:22] INFO  Starting application...
[2024-01-15 03:14:22] INFO  Connecting to database: postgres.production.svc.cluster.local:5432
[2024-01-15 03:14:52] ERROR Connection timeout after 30000ms
[2024-01-15 03:14:52] FATAL Unable to connect to database, shutting down
[2024-01-15 03:14:52] INFO  Application terminated with exit code 1

ops0 highlights the relevant lines and lets you search/filter.

Events Tab

TIMESTAMP            TYPE      REASON              MESSAGE
03:14:52             Warning   BackOff             Back-off restarting failed container
03:14:22             Normal    Pulled              Container image "api:v2.3.1" already present
03:14:22             Normal    Created             Created container api
03:14:22             Normal    Started             Started container api
03:12:05             Warning   BackOff             Back-off restarting failed container

ops0 shows resources that might be involved:

Service: postgres (Not Found)

Referenced by api-server but doesn't exist in namespace

ConfigMap: api-config

Last modified: 15 minutes ago by john@company.com

Step 4: Apply the Fix

ops0 provides specific remediation steps. For this example:

Suggested Fix

Option 1: Restore the postgres service

The postgres service was deleted. Re-create it to restore connectivity.

kubectl apply -f - <<EOF
apiVersion: v1
kind: Service
metadata:
  name: postgres
  namespace: production
spec:
  selector:
    app: postgres
  ports:
    - port: 5432
EOF

Option 2: Update DATABASE_URL to new service name

If the database was renamed to 'postgresql', update the environment variable.

kubectl set env deployment/api-server DATABASE_URL=postgresql://postgresql.production.svc.cluster.local:5432/app

One-Click Apply

For common fixes, ops0 provides a one-click option:

1Click Apply Fix next to the suggested remediation

2Review the changes that will be applied

3Confirm to execute

Step 5: Verify Resolution

After applying the fix:

1Watch the pod status in ops0 (auto-refreshes)

2Confirm pod enters Running state

3Incident automatically resolves when pod is stable for 5 minutes

Resolution Summary

┌─────────────────────────────────────────────────────────────────┐
│ INCIDENT RESOLVED                                               │
├─────────────────────────────────────────────────────────────────┤
│ Duration: 8 minutes                                             │
│ Root Cause: Missing postgres service                            │
│ Resolution: Service restored via ops0 one-click fix             │
│ Resolved by: jane@company.com                                   │
└─────────────────────────────────────────────────────────────────┘

Common Incident Types

CrashLoopBackOff

Common Cause	How to Identify	Fix
App crash on startup	Error in logs immediately after start	Fix application code or config
Missing config/secret	"file not found" or "env var not set"	Create the missing ConfigMap/Secret
Database connection	Connection timeout/refused	Check database service exists and is running
OOM during startup	OOMKilled in events	Increase memory limits

ImagePullBackOff

Common Cause	How to Identify	Fix
Image doesn't exist	"manifest unknown"	Check image tag exists in registry
Auth failure	"unauthorized"	Update imagePullSecrets
Registry unreachable	"connection refused"	Check network policies, firewall
Rate limited	"too many requests"	Wait, or use registry mirror

Pending Pods

Common Cause	How to Identify	Fix
Insufficient CPU/memory	"Insufficient cpu" in events	Scale cluster or reduce requests
No matching nodes	"node selector" mismatch	Update nodeSelector or add nodes
PVC pending	"waiting for volume"	Check StorageClass, PV availability
Taints not tolerated	"taints not tolerated"	Add tolerations to pod spec

OOMKilled

Common Cause	How to Identify	Fix
Memory leak	Gradual increase before kill	Fix application memory leak
Limit too low	Immediate kill on load	Increase memory limit
JVM heap issues	"OutOfMemoryError" in logs	Tune JVM heap settings

Ask AI for Help

Beyond automatic analysis, you can ask questions in natural language:

You:

"Why is this pod using so much memory?"

AI:

"Looking at the memory profile, the api-server container has grown from 256MB at startup to 1.8GB over the past 6 hours. The logs show repeated 'cache entry added' messages without corresponding evictions. This suggests the application's in-memory cache is growing unbounded.

Recommendation: Either set a max cache size in the application config, or add a cache TTL to expire old entries."

Example Questions

"What changed in this namespace in the last hour?"
"Why are requests to this service timing out?"
"Which pods are using the most CPU in the cluster?"
"Show me all pods that restarted today"
"What's the difference between this deployment and the previous one?"

Set Up Alerts

Get notified before issues impact customers:

1Go to Settings > Integrations > Slack (or PagerDuty)

2Configure a channel for Kubernetes alerts

3Set severity thresholds (e.g., only alert on production)

Next Steps

Incident Management

Configure detection rules and notifications

Pod Management

View logs, exec into pods, manage resources

Debug Kubernetes Issues

Scenario

Prerequisites

Understanding Automatic Incident Detection

Step 1: View the Incident

Incident Overview

Step 2: Review AI Analysis

What AI Examines

AI Summary Sections

Step 3: Explore the Evidence

Logs Tab

Events Tab

Related Resources

Step 4: Apply the Fix

Suggested Fix

One-Click Apply

Step 5: Verify Resolution

Resolution Summary

Common Incident Types

CrashLoopBackOff

ImagePullBackOff

Pending Pods

OOMKilled

Ask AI for Help

Example Questions

Set Up Alerts

Next Steps

Debug Kubernetes Issues

Scenario

Prerequisites

Understanding Automatic Incident Detection

Step 1: View the Incident

Incident Overview

Step 2: Review AI Analysis

What AI Examines

AI Summary Sections

Step 3: Explore the Evidence

Logs Tab

Events Tab

Related Resources

Step 4: Apply the Fix

Suggested Fix

One-Click Apply

Step 5: Verify Resolution

Resolution Summary

Common Incident Types

CrashLoopBackOff

ImagePullBackOff

Pending Pods

OOMKilled

Ask AI for Help

Example Questions

Set Up Alerts

Next Steps