Reliability
Services and SLOs
Services connect monitors, owners, dependencies, incidents, and SLOs so reliability is organized around business systems rather than isolated checks.
Service catalog
- Create a service for each product surface, API, workflow, or internal platform dependency.
- Assign criticality tiers such as customer critical, important, standard, or internal.
- Assign an owner team so responders know who is responsible.
- Link monitors that represent the service's availability and correctness.
Dependencies and blast radius
- Add upstream dependencies for services this service needs to stay healthy.
- Use dependency type to describe sync API, async queue, database, third-party, or internal service dependency.
- Use criticality to separate hard blockers from lower-risk dependencies.
- Downstream impact helps responders understand which services may be affected by a failing dependency.
SLO configuration
- Name the SLO after the user-facing reliability promise, such as Checkout availability.
- Set a target percentage that reflects the business promise.
- Set latency thresholds when slow responses should count against reliability.
- Set a window in days for the rolling measurement period.
- Review budget consumed and budget remaining to understand reliability risk.
Burn-rate alerts
- Telemetry-backed burn alerts detect fast error burn, slow error burn, and latency burn.
- Burn alerts can open service incidents when reliability is degrading.
- Burn alerts are strongest when services receive consistent telemetry and linked monitor data.
Related documentation
Monitors
Create HTTP, TCP, ping, heartbeat, and AI-agent synthetic monitors with thresholds and regions.
Telemetry
Ingest OTLP JSON logs, metrics, traces, and inspect trace detail inside AImonitoring.
Incidents
Acknowledge, investigate, route, resolve, and review service incidents.
Team and access management
Invite users, assign organization roles, and manage service team membership.