When a deployment fails or infrastructure behaves unexpectedly, ops0 automatically creates an incident record. An AI model analyzes the failure using the full deployment log, plan output, and policy check results, then suggests a step-by-step remediation runbook to help your team resolve the issue quickly.
| Trigger | Description |
|---|---|
| Apply failure | terraform apply exits with a non-zero status code |
| Policy block | A deployment is blocked by a critical policy violation |
| Drift detected | Live infrastructure state diverges significantly from the IaC state |
| Plan timeout | The plan stage exceeds the configured timeout threshold |
ops0 detects the triggering event — a failed apply, a policy block, significant drift, or a timeout. An incident record is created immediately.
The AI engine receives the full deployment log, plan diff, policy check results, and Terraform error output. Analysis typically completes within 30 seconds.
The incident moves to Open status once analysis is complete. Your team is notified and the incident is visible in the Incidents tab. Action is required.
A team member follows the remediation steps, fixes the underlying issue, and re-deploys successfully. They then mark the incident as Resolved.
After resolution is confirmed and the deployment succeeds, the incident is marked Closed and archived in the incident history.
Navigate to the relevant project from the ops0 dashboard.
The Incidents tab shows all active and historical incidents for the project.
Each row shows severity, current status, trigger type, and when the incident was created.
Open the incident detail page to view the AI analysis, runbook, notes, and related deployment.
| Column | Description |
|---|---|
| Severity | critical, high, medium, or low based on the nature of the failure |
| Status | analyzing, open, resolved, or closed |
| Trigger | What caused the incident (apply failure, policy block, drift, timeout) |
| Created | Timestamp when the incident was detected |
| Duration | Time elapsed since detection (for open incidents) or MTTR (for resolved ones) |
Clicking an incident opens the detail page, which contains the following sections.
The raw Terraform or apply output that caused the failure, displayed with syntax highlighting. Long outputs are paginated but fully downloadable as a text file.
Error: Error creating S3 bucket: BucketAlreadyOwnedByYou:
Your previous request to create the named bucket succeeded
and you already own it.
with aws_s3_bucket.assets,
on main.tf line 12, in resource "aws_s3_bucket" "assets":
12: resource "aws_s3_bucket" "assets" {
The AI analysis section contains three parts:
| Part | Description |
|---|---|
| What went wrong | A plain-language explanation of the failure, written for engineers who may not be familiar with the specific error |
| Root cause hypothesis | The AI's best assessment of the underlying cause — configuration error, state drift, permission issue, provider bug, etc. |
| Contributing factors | Any secondary issues identified in the logs that may have contributed to or will recur alongside the primary failure |
A numbered list of concrete steps the AI recommends to resolve the incident. Actions are ordered by priority — address them in sequence to avoid compounding the problem.
Example recommended actions for a state conflict error:
terraform state list to confirm the conflicting resource exists in stateterraform import aws_s3_bucket.assets my-bucket-nameterraform state rm aws_s3_bucket.assetsThe auto-generated runbook provides a detailed, step-by-step resolution guide tailored to the specific error type. Unlike the recommended actions (which are high-level), the runbook includes:
Team members can add freeform notes to an incident at any time. Notes are timestamped and attributed to the author. Use notes to:
Every incident links back to the deployment that triggered it. Click View Deployment to open the full deployment detail, including the plan diff, cost estimate, and complete log output.
Read the error summary, root cause hypothesis, and recommended actions in the incident detail page.
Work through the auto-generated runbook step by step. Add notes to the incident as you go to keep the team informed.
Correct the Terraform code, update variables, import missing state, or resolve the underlying infrastructure issue as directed by the runbook.
Trigger a new deployment from the project. If it succeeds, the incident can be resolved.
Return to the incident and click Mark as Resolved. Add a final note describing what fixed the issue. The incident moves to Closed status.
Notes are the primary collaboration tool during incident response. To add a note:
Notes are immediately visible to all team members with access to the project. There is no limit on the number of notes per incident.
Notes support full markdown formatting including code blocks, bullet lists, bold, and links. Use code blocks to share Terraform output, AWS CLI responses, or configuration snippets with your team.
ops0 assigns severity automatically based on the trigger type and the scope of impact:
| Severity | Assigned When |
|---|---|
| Critical | Apply failure on a production project, or policy block on a security-critical policy |
| High | Apply failure on a staging project, or drift detected on a production project |
| Medium | Plan timeout, policy block on a non-critical policy, or drift on staging |
| Low | Informational failures, drift on development environments |
Severity can be manually adjusted on the incident detail page if the automatic classification does not match the actual impact.
The Incidents tab includes an aggregate statistics panel at the top of the page:
| Metric | Description |
|---|---|
| Total Incidents | All-time count of incidents for the project |
| Open Incidents | Incidents currently in open or analyzing status |
| Mean Time to Resolve (MTTR) | Average time from detection to resolution across all closed incidents |
| Common Failure Patterns | The top recurring error types identified by the AI across all incidents |
Use the common failure patterns view to identify systemic issues in your Terraform code or cloud account configuration that are causing repeated incidents.
AI analysis uses the full deployment log, plan output, and policy check results as context. The more descriptive your Terraform resource names, variable names, and output descriptions are, the more accurate and actionable the analysis will be. Generic names like resource1 or var1 reduce the AI's ability to identify the specific component that caused the failure.
The AI engine does not have access to your cloud account or live infrastructure — analysis is based entirely on the logs and outputs captured during the deployment. For issues that require inspecting live state (such as unexpected resource configurations or permission boundaries), the runbook will direct you to the relevant console or CLI commands to gather that information manually.
ops0 can notify your team when a new incident is created. Configure notification channels in Settings → Notifications:
| Channel | Trigger |
|---|---|
| New incident created, incident resolved | |
| Slack | New incident created, severity escalated, incident resolved |
| PagerDuty | Critical severity incidents only |
| Webhook | All incident lifecycle events (for custom integrations) |
The incident list supports filtering by:
open, resolved, closed, analyzingcritical, high, medium, lowUse the search bar to find incidents by keyword — ops0 searches across error summaries, AI analysis text, and notes.