Kubernetes

Monitor, manage, and troubleshoot your Kubernetes clusters with AI-powered insights.

What is Kubernetes in ops0?

ops0 provides a unified view of all your Kubernetes clusters with real-time monitoring, incident detection, and AI-assisted troubleshooting. Connect clusters via the ops0 agent and get instant visibility into workloads, resources, and issues.

Visibility

Keep all clusters, workloads, and health signals in one operator view.

Detection

Surface incidents, rollouts, and resource pressure before they turn into outages.

Resolution

Move from symptom to root cause with logs, events, dependencies, and AI guidance.

Key Features

Cluster Dashboard

Real-time view of cluster health, node status, resource usage, and running workloads.

Incident Detection

Automatic detection of pod crashes, resource pressure, failed deployments, and misconfigurations.

Resource Graph

Visual dependency graph with incident severity overlays (P1/P2/P3).

AI Troubleshooting

Root cause analysis and remediation suggestions with workload context attached.

Deploy to Cluster

Deploy Helm charts and manifests with per-file configuration and planning.

Cost Analysis

Per-namespace cost breakdown with CPU, memory, GPU, PV, and network costs.

Resource Graph

The Kubernetes resource graph provides a visual dependency map for a cluster:

Navigate to a cluster and click Resource Graph
See deployments, services, pods, configmaps, secrets, and their connections
Incidents are overlaid on the graph with severity indicators (P1 Critical, P2 High, P3 Medium)
Click any node to view resource details and drill into logs or events

Deploy to Cluster

Deploy Helm charts or Kubernetes manifests directly from ops0:

Feature	Description
Per-file configuration	Configure each manifest or Helm values file individually
Helm support	Deploy and manage Helm releases
kubectl apply	Apply raw manifests to the cluster
Deployment planning	Preview changes before applying
Outputs	View deployment outputs and applied resource status

How It Works

Connect

Install ops0 agent in your cluster

Monitor

View real-time cluster status and metrics

Detect

Get alerted when incidents occur

Resolve

Use AI to diagnose and fix issues

Cluster Status

Clusters show real-time health status:

Status	Meaning
Healthy	All major cluster checks are passing and workloads are behaving normally
Warning	Minor issues are present, such as resource pressure or degraded rollouts
Critical	Immediate operator action is likely required
Offline	The cluster is not connected or has stopped reporting

Incident Severities

Incidents are categorized by severity:

Severity	Typical meaning
Critical	Service down, data-loss risk, security exposure, or widespread workload failure
Warning	Degraded performance, rollout risk, or resource pressure that needs review
Info	Changes, scaling events, or other signals that are useful context but not urgent

Supported Resources

ops0 monitors all standard Kubernetes resources:

Category	Resources
Workloads	Deployments, StatefulSets, DaemonSets, Jobs, CronJobs, ReplicaSets, Pods
Networking	Services, Ingress, NetworkPolicies, Endpoints
Storage	PersistentVolumes, PersistentVolumeClaims, StorageClasses, ResourceQuotas
Config	ConfigMaps, Secrets, ServiceAccounts, Certificates
Scaling	HorizontalPodAutoscalers, PodDisruptionBudgets, LimitRanges
RBAC	Roles, ClusterRoles, RoleBindings, ClusterRoleBindings
Namespaces	Create, view, and delete namespaces
Custom	All CRDs installed in the cluster

Quick Start

Connect a Cluster

Install ops0 agent and connect your first cluster.

View Incidents

Monitor and respond to cluster issues with more context.

Explore Resource Graph

Visualize resource relationships and dependencies.

Deploy Manifests

Deploy Kubernetes configurations with ops0.

Example: Investigating a Production Incident

Here's how ops0 helps you troubleshoot a production issue:

1. Incident Detected

Critical Incident2 minutes ago

CrashLoopBackOff: api-gateway-7d9f8c6b4d-2xkjp

Cluster: production-eks • Namespace: api-gateway

2. View in Resource Graph

The Resource Graph highlights the affected pod and its dependencies:

Pod api-gateway-7d9f8c6b4d-2xkjp shows red border
Connected Service, ConfigMap, and Secret are visible
Related Deployment shows warning status

3. AI Analysis

Root Cause Analysis:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The pod is failing to start due to a missing environment
variable DB_PASSWORD. The Secret 'api-gateway-secrets'
exists but is missing the 'db-password' key.

Last successful deployment: 3 hours ago
Recent change: Secret 'api-gateway-secrets' was updated
              45 minutes ago (removed db-password key)

Recommended Actions:
1. Add 'db-password' key to Secret 'api-gateway-secrets'
2. Or update Deployment to reference correct Secret key

4. Resolution

After fixing the Secret:

ResolvedJust now

Pod successfully started

3/3 replicas running • All health checks passing

Example: Cluster Dashboard View

┌─────────────────────────────────────────────────────────┐
│  production-eks                          ● Healthy      │
│  AWS EKS 1.28 • us-east-1 • 12 nodes                   │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Nodes          Pods           CPU        Memory        │
│  ━━━━━━━━━━━    ━━━━━━━━━━━    ━━━━━━━    ━━━━━━━       │
│  12/12 Ready    156/200        42%        61%           │
│  ● ● ● ● ●      ████████░░     ████░░░    ██████░       │
│  ● ● ● ● ●                                              │
│  ● ●                                                    │
│                                                         │
│  Recent Incidents                                       │
│  ─────────────────────────────────────────────────────  │
│  ● Warning   High memory usage on node-7    15m ago     │
│  ● Info      HPA scaled api-gateway 3→5     1h ago      │
│  ● Resolved  CrashLoop fixed                2h ago      │
│                                                         │
│  Top Namespaces by Pod Count                            │
│  ─────────────────────────────────────────────────────  │
│  api-gateway      ████████████████████  45              │
│  web-frontend     ████████████          28              │
│  backend          ████████              19              │
│  monitoring       ██████                14              │
│                                                         │
└─────────────────────────────────────────────────────────┘