Incident Management

When a deployment fails or infrastructure behaves unexpectedly, ops0 automatically creates an incident record. An AI model analyzes the failure using the full deployment log, plan output, and policy check results, then suggests a step-by-step remediation runbook to help your team resolve the issue quickly.

What Triggers an Incident

Trigger	Description
Apply failure	`terraform apply` exits with a non-zero status code
Policy block	A deployment is blocked by a critical policy violation
Drift detected	Live infrastructure state diverges significantly from the IaC state
Plan timeout	The plan stage exceeds the configured timeout threshold

Incident Lifecycle

Detected

ops0 detects the triggering event — a failed apply, a policy block, significant drift, or a timeout. An incident record is created immediately.

Analyzing

The AI engine receives the full deployment log, plan diff, policy check results, and Terraform error output. Analysis typically completes within 30 seconds.

Open

The incident moves to Open status once analysis is complete. Your team is notified and the incident is visible in the Incidents tab. Action is required.

Resolved

A team member follows the remediation steps, fixes the underlying issue, and re-deploys successfully. They then mark the incident as Resolved.

Closed

After resolution is confirmed and the deployment succeeds, the incident is marked Closed and archived in the incident history.

Viewing Incidents

Open the IaC Project

Navigate to the relevant project from the ops0 dashboard.

Click the Incidents Tab

The Incidents tab shows all active and historical incidents for the project.

Review the Incident List

Each row shows severity, current status, trigger type, and when the incident was created.

Click an Incident

Open the incident detail page to view the AI analysis, runbook, notes, and related deployment.

Incident List Columns

Column	Description
Severity	`critical`, `high`, `medium`, or `low` based on the nature of the failure
Status	`analyzing`, `open`, `resolved`, or `closed`
Trigger	What caused the incident (apply failure, policy block, drift, timeout)
Created	Timestamp when the incident was detected
Duration	Time elapsed since detection (for open incidents) or MTTR (for resolved ones)

Incident Detail

Clicking an incident opens the detail page, which contains the following sections.

Error Summary

The raw Terraform or apply output that caused the failure, displayed with syntax highlighting. Long outputs are paginated but fully downloadable as a text file.

Error: Error creating S3 bucket: BucketAlreadyOwnedByYou: 
  Your previous request to create the named bucket succeeded 
  and you already own it.
  
  with aws_s3_bucket.assets,
  on main.tf line 12, in resource "aws_s3_bucket" "assets":
  12: resource "aws_s3_bucket" "assets" {

AI Analysis

The AI analysis section contains three parts:

Part	Description
What went wrong	A plain-language explanation of the failure, written for engineers who may not be familiar with the specific error
Root cause hypothesis	The AI's best assessment of the underlying cause — configuration error, state drift, permission issue, provider bug, etc.
Contributing factors	Any secondary issues identified in the logs that may have contributed to or will recur alongside the primary failure

Recommended Actions

A numbered list of concrete steps the AI recommends to resolve the incident. Actions are ordered by priority — address them in sequence to avoid compounding the problem.

Example recommended actions for a state conflict error:

Run terraform state list to confirm the conflicting resource exists in state
If the resource was created outside Terraform, import it: terraform import aws_s3_bucket.assets my-bucket-name
If the resource is orphaned, remove it from state: terraform state rm aws_s3_bucket.assets
Re-run the deployment after the state is consistent

Runbook

The auto-generated runbook provides a detailed, step-by-step resolution guide tailored to the specific error type. Unlike the recommended actions (which are high-level), the runbook includes:

Exact CLI commands to run
What to check in the AWS Console or equivalent cloud provider UI
How to verify the fix before re-deploying
Rollback steps if the fix does not work

Notes

Team members can add freeform notes to an incident at any time. Notes are timestamped and attributed to the author. Use notes to:

Record what was tried and the outcome
Share context from an on-call handoff
Link to external tickets or post-mortems
Document the confirmed root cause after resolution

Every incident links back to the deployment that triggered it. Click View Deployment to open the full deployment detail, including the plan diff, cost estimate, and complete log output.

Resolving an Incident

Review the AI Analysis

Read the error summary, root cause hypothesis, and recommended actions in the incident detail page.

Follow the Runbook

Work through the auto-generated runbook step by step. Add notes to the incident as you go to keep the team informed.

Fix the Code or Configuration

Correct the Terraform code, update variables, import missing state, or resolve the underlying infrastructure issue as directed by the runbook.

Re-deploy

Trigger a new deployment from the project. If it succeeds, the incident can be resolved.

Mark Resolved

Return to the incident and click Mark as Resolved. Add a final note describing what fixed the issue. The incident moves to Closed status.

Adding Notes to an Incident

Notes are the primary collaboration tool during incident response. To add a note:

Open the incident detail page
Scroll to the Notes section
Type your note in the text area — markdown formatting is supported
Click Add Note

Notes are immediately visible to all team members with access to the project. There is no limit on the number of notes per incident.

Markdown in Notes

Notes support full markdown formatting including code blocks, bullet lists, bold, and links. Use code blocks to share Terraform output, AWS CLI responses, or configuration snippets with your team.

Incident Severity

ops0 assigns severity automatically based on the trigger type and the scope of impact:

Severity	Assigned When
Critical	Apply failure on a production project, or policy block on a security-critical policy
High	Apply failure on a staging project, or drift detected on a production project
Medium	Plan timeout, policy block on a non-critical policy, or drift on staging
Low	Informational failures, drift on development environments

Severity can be manually adjusted on the incident detail page if the automatic classification does not match the actual impact.

Incident Statistics

The Incidents tab includes an aggregate statistics panel at the top of the page:

Metric	Description
Total Incidents	All-time count of incidents for the project
Open Incidents	Incidents currently in `open` or `analyzing` status
Mean Time to Resolve (MTTR)	Average time from detection to resolution across all closed incidents
Common Failure Patterns	The top recurring error types identified by the AI across all incidents

Use the common failure patterns view to identify systemic issues in your Terraform code or cloud account configuration that are causing repeated incidents.

AI Analysis Quality

How AI Analysis Works

AI analysis uses the full deployment log, plan output, and policy check results as context. The more descriptive your Terraform resource names, variable names, and output descriptions are, the more accurate and actionable the analysis will be. Generic names like resource1 or var1 reduce the AI's ability to identify the specific component that caused the failure.

The AI engine does not have access to your cloud account or live infrastructure — analysis is based entirely on the logs and outputs captured during the deployment. For issues that require inspecting live state (such as unexpected resource configurations or permission boundaries), the runbook will direct you to the relevant console or CLI commands to gather that information manually.

Notifications

ops0 can notify your team when a new incident is created. Configure notification channels in Settings → Notifications:

Channel	Trigger
Email	New incident created, incident resolved
Slack	New incident created, severity escalated, incident resolved
PagerDuty	Critical severity incidents only
Webhook	All incident lifecycle events (for custom integrations)

Filtering and Searching Incidents

The incident list supports filtering by:

Status: open, resolved, closed, analyzing
Severity: critical, high, medium, low
Trigger type: apply failure, policy block, drift, timeout
Date range: filter by when incidents were created or resolved

Use the search bar to find incidents by keyword — ops0 searches across error summaries, AI analysis text, and notes.

Deployments

View deployment history and understand the apply lifecycle

Drift Detection

Detect when live infrastructure diverges from your IaC state

Policies

Enforce security and compliance policies that can block deployments

Incident Management

What Triggers an Incident

Incident Lifecycle

Detected

Analyzing

Open

Resolved

Closed

Viewing Incidents

Open the IaC Project

Click the Incidents Tab

Review the Incident List

Click an Incident

Incident List Columns

Incident Detail

Error Summary

AI Analysis

Recommended Actions

Runbook

Notes

Related Deployment

Resolving an Incident

Review the AI Analysis

Follow the Runbook

Fix the Code or Configuration

Re-deploy

Mark Resolved

Adding Notes to an Incident

Incident Severity

Incident Statistics

AI Analysis Quality

Notifications

Filtering and Searching Incidents

Related

Deployments

Drift Detection

Policies

Incident Management

What Triggers an Incident

Incident Lifecycle

Detected

Analyzing

Open

Resolved

Closed

Viewing Incidents

Open the IaC Project

Click the Incidents Tab

Review the Incident List

Click an Incident

Incident List Columns

Incident Detail

Error Summary

AI Analysis

Recommended Actions

Runbook

Notes

Related Deployment

Resolving an Incident

Review the AI Analysis

Follow the Runbook

Fix the Code or Configuration

Re-deploy

Mark Resolved

Adding Notes to an Incident

Incident Severity

Incident Statistics

AI Analysis Quality

Notifications

Filtering and Searching Incidents

Related

Deployments

Drift Detection

Policies