ops0ops0

Events

Monitor cluster-wide Kubernetes events to track resource state changes, failures, and system-level activities across all namespaces.

What are Kubernetes Events?

Events are time-limited records of state changes and notable occurrences in your cluster. They provide visibility into what's happening behind the scenes with your resources.

Event Characteristics:

  • Automatic: Kubernetes creates events automatically for resource state changes
  • Temporary: Events expire after 1 hour by default (retained longer in ops0)
  • Informational: Describe what happened, when, and why
  • Diagnostic: Essential for troubleshooting pod failures and scheduling issues

Event Types

Normal
Expected state transitions (pod scheduled, image pulled, container started)
Warning
Issues and errors (failed health checks, evictions, scheduling failures)

Viewing Events

Access cluster-wide events from the Kubernetes overview:

Event List

ColumnDescription
TypeNormal or Warning
ReasonEvent reason code (Scheduled, Pulling, Failed, etc.)
ObjectAffected resource (pod, node, deployment, etc.)
NamespaceKubernetes namespace
MessageHuman-readable event description
CountHow many times this event occurred
First SeenWhen event first occurred
Last SeenMost recent occurrence

Common Event Reasons

Normal Events (Success)

ReasonObject TypeMeaning
ScheduledPodPod assigned to node successfully
PullingPodDownloading container image
PulledPodContainer image downloaded successfully
CreatedPodContainer created
StartedPodContainer started successfully
KillingPodContainer being terminated gracefully
ScalingReplicaSetDeploymentReplica count adjusted
SuccessfulCreateReplicaSetNew pod created successfully

Warning Events (Failures)

ReasonObject TypeMeaningAction Needed
FailedPodContainer exited with non-zero codeCheck container logs for errors
BackOffPodContainer restarting after crashInvestigate crash cause (logs, events)
FailedSchedulingPodCannot find node meeting requirementsCheck resource requests, node capacity, taints/tolerations
FailedMountPodVolume mount failedVerify PVC exists and is bound
ImagePullBackOffPodCannot pull container imageCheck image name, registry credentials
UnhealthyPodReadiness/liveness probe failedCheck probe configuration and app health endpoint
EvictedPodPod removed due to resource pressureIncrease node capacity or reduce resource usage
FailedKillPodPodCould not terminate podCheck node status, may require manual cleanup

Filtering Events

Narrow event list to find specific issues:

By Type

Normal Events Only:

Filter: Type = Normal
Shows: Successful operations, routine activities

Warning Events Only:

Filter: Type = Warning
Shows: Errors, failures, issues requiring attention

By Resource Type

Filter: Object Kind = Pod
Shows: Events related to pods only

Filter: Object Kind = Node
Shows: Node-level events (NotReady, DiskPressure, etc.)

By Namespace

Filter: Namespace = production
Shows: Events in production namespace only

Filter: Namespace = kube-system
Shows: System component events

By Reason

Filter: Reason = FailedScheduling
Shows: All pod scheduling failures

Filter: Reason = BackOff
Shows: All container crash loops

By Time Range

Filter: Last 1 hour
Filter: Last 24 hours
Filter: Last 7 days

Investigating Common Issues

Pod Won't Start: FailedScheduling

Event:

Type:    Warning
Reason:  FailedScheduling
Message: 0/5 nodes are available: 3 Insufficient memory,
         2 node(s) had taints that pod didn't tolerate.

Investigation Steps:

Check Resource Requests

Pod requesting more memory than available on any node.

Solution: Reduce resources.requests.memory or add nodes with more capacity.

Check Node Taints

Nodes have taints preventing pod scheduling.

Solution: Add tolerations to pod spec or remove taints from nodes.

Example Fix:

spec:
  containers:
  - name: app
    resources:
      requests:
        memory: "512Mi"  # Was 8Gi (too large)
  tolerations:  # Add if nodes are tainted
  - key: "dedicated"
    operator: "Equal"
    value: "production"
    effect: "NoSchedule"

Container Crash Loop: BackOff

Event:

Type:    Warning
Reason:  BackOff
Message: Back-off restarting failed container
Count:   15

Investigation Steps:

Check Container Logs

View logs to see why container is crashing.

Command: Click pod → Logs tab

Check Exit Code

Identify how container exited.

Exit Code 1: Application error Exit Code 137: OOMKilled (out of memory) Exit Code 143: SIGTERM (graceful shutdown)

Fix Root Cause

  • Application error: Fix code bug
  • OOMKilled: Increase memory limit
  • Missing dependency: Fix configuration or add init container

Image Pull Failed: ImagePullBackOff

Event:

Type:    Warning
Reason:  Failed
Message: Failed to pull image "myapp:v2.0.1":
         rpc error: code = Unknown desc = Error response
         from daemon: pull access denied for myapp,
         repository does not exist or may require 'docker login'

Common Causes:

CauseSolution
Image name typoVerify image name and tag are correct
Private registryAdd imagePullSecrets to pod spec
Image doesn't existPush image to registry or fix tag
Registry credentials expiredUpdate secret with new credentials

Adding Image Pull Secret:

spec:
  imagePullSecrets:
  - name: registry-credentials
  containers:
  - name: app
    image: myregistry.com/myapp:v2.0.1

Node Issues: NodeNotReady

Event:

Type:    Warning
Reason:  NodeNotReady
Object:  Node/ip-10-0-1-45
Message: Node ip-10-0-1-45 status is now: NodeNotReady

Investigation:

Check Node Conditions

View node detail page for conditions: DiskPressure, MemoryPressure, PIDPressure.

Check Node Logs

SSH to node (if accessible) and check kubelet logs:

journalctl -u kubelet -f

Common Fixes

  • DiskPressure: Clean up unused images and logs
  • MemoryPressure: Increase node size or evict pods
  • Network issues: Check CNI plugin status
  • Kubelet crash: Restart kubelet service

Event Patterns for Troubleshooting

Deployment Rollout Stuck

Event Sequence:

10:30:00  Normal   ScalingReplicaSet  Scaled up replica set to 3
10:30:05  Normal   SuccessfulCreate   Created pod: app-new-abc12
10:30:10  Warning  FailedScheduling   0/5 nodes available
10:30:15  Warning  FailedScheduling   0/5 nodes available
...

Diagnosis: New pods cannot schedule due to insufficient resources.

Fix: Increase cluster capacity or reduce resource requests.

Persistent Volume Issues

Event Sequence:

10:25:00  Normal   Scheduled       Pod assigned to node-3
10:25:05  Normal   Pulling         Pulling image (success)
10:25:20  Normal   Pulled          Image pulled
10:25:25  Warning  FailedMount     MountVolume.SetUp failed:
                                   PVC "data-pvc" not found

Diagnosis: PVC doesn't exist or is in different namespace.

Fix: Create PVC or fix PVC name in pod spec.

Liveness Probe Killing Healthy Pods

Event Sequence:

10:40:00  Normal   Started          Container started
10:40:30  Warning  Unhealthy        Liveness probe failed:
                                    HTTP probe failed with statuscode: 500
10:40:40  Warning  Unhealthy        Liveness probe failed (2nd time)
10:40:50  Warning  Unhealthy        Liveness probe failed (3rd time)
10:40:51  Normal   Killing          Killing container due to liveness probe failure
10:41:00  Normal   Pulled           Container image pulled
10:41:05  Normal   Started          Container restarted

Diagnosis: Liveness probe too aggressive or app has slow startup.

Fix: Increase initialDelaySeconds or adjust probe threshold.

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 60  # Wait longer for app to start
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3  # Allow 3 failures before killing

Event Retention

Kubernetes Default: Events expire after 1 hour.

ops0 Retention: Events retained for 30 days for historical troubleshooting.

Benefits:

  • Investigate issues that occurred outside the 1-hour window
  • Compare event patterns across deployments
  • Track recurring issues over time

Using Events with Incidents

Events automatically feed into incident detection:

Event Occurs

Kubernetes emits a warning event (e.g., CrashLoopBackOff).

Incident Created

ops0 detects event pattern and creates incident.

Events Attached

All related events attached to incident for context.

Timeline Built

Event timeline shows sequence leading to incident.

Example Incident Timeline:

Incident #1247: CrashLoopBackOff in api-gateway
─────────────────────────────────────────────────
10:45:00  Normal   Started          Container started
10:45:15  Warning  Unhealthy        Readiness probe failed
10:45:15  Normal   Killing          Container exited (code 1)
10:45:30  Normal   Started          Container restarted (attempt 1)
10:45:45  Normal   Killing          Container exited (code 1)
10:46:00  Normal   Started          Container restarted (attempt 2)
10:46:15  Normal   Killing          Container exited (code 1)
10:46:30  Warning  BackOff          Back-off restarting failed container
10:47:00  -        IncidentCreated  Incident #1247 opened

Best Practices

Event Monitoring Tips
Check events first - Before diving into logs, check events for high-level context
Filter by Warning - Focus on warning events to find issues quickly
Watch for patterns - Recurring events indicate systemic issues vs one-off failures
Combine with logs - Events show what happened, logs show why
Set up alerts - Configure Slack/PagerDuty notifications for critical event patterns
Review after deployments - Check for unusual events after rolling out changes