Incident response
Incidents
Incidents collect monitor failures, telemetry SLO breaches, dependency context, responder activity, and post-incident follow-up in one workflow.
Incident lifecycle
- Open: AImonitoring detects a confirmed failure or reliability breach.
- Acknowledged: a responder accepts ownership and starts investigation.
- Notes: responders add timeline context, findings, mitigation steps, and decisions.
- Resolved: the incident is manually resolved or automatically closes after recovery conditions are met.
- Review: teams create a post-incident review with timeline, impact, root cause, and action items.
Incident context
- Affected service, linked monitors, and current service health.
- Correlation group and related incidents where available.
- Dependency context for upstream causes and downstream impact.
- SLO burn alerts and telemetry context when telemetry triggered the incident.
- Plain-language summaries to speed initial triage.
- Incident command fields for severity, commander assignment, and communications channel.
- Recent deployment and GitHub webhook context when a mapped repository changes near incident start.
Responder actions
- Acknowledge an incident to stop secondary escalation logic from treating it as unattended.
- Add notes for investigation evidence and decisions.
- Resolve incidents only after the customer-facing impact has ended.
- Update incident command so responders can see the current commander, severity, and bridge or channel.
- Create post-incident action items for prevention, detection, response, or communication gaps.
AI-assisted reviews
- Post-incident reviews can generate an AI-assisted draft from timeline, dependency, SLO, deployment, and correlation evidence.
- Drafts include summary, impact, probable root cause, what went well, what could improve, and suggested action items.
- If AI is not configured, AImonitoring falls back to deterministic evidence-based review text so the workflow still works.
- Published reviews remain controlled by responders; AI drafts do not publish automatically.
Auditability
- Acknowledgements, notes, manual resolution, review creation, review publishing, and action-item updates are audit logged.
- Incident event timelines are kept with the incident record.
- Alert delivery logs show notification attempts and outcomes.
Related documentation
Routing and maintenance
Create escalation policies, on-call schedules, temporary overrides, secondary escalation, and planned maintenance windows.
Services and SLOs
Model owned services, link monitors, define dependencies, and track service-level objectives.
Audit log
Review access, configuration, API key, incident, review, and team-management events.