Reliability
Observability analytics
Observability analytics turns raw monitor, telemetry, deployment, SLO, and incident signals into prioritized operational risk views.
What analytics shows
- Average service health score across modeled services.
- Unhealthy services based on incidents, SEV1 impact, SLO burn, telemetry error rate, and p95 latency.
- Breached SLO burn alerts and maximum burn rate per service.
- Open incident counts, SEV1 counts, resolved incident counts, and mean time to resolve.
- Deployment correlations when GitHub events happen near incident start.
- Root-cause groups from correlated incidents and dependency context.
How to use it
- Review the lowest-scoring services first during operational planning.
- Use burn signals to decide whether reliability work should interrupt feature work.
- Use deployment correlations to check whether a recent change may have triggered an incident.
- Use root-cause groups to understand whether multiple incidents are symptoms of one upstream issue.
Data required
- Services should be modeled with owner teams and tiers.
- Telemetry improves error-rate and latency scoring.
- SLO alert states improve burn-rate prioritization.
- GitHub repository mappings and verified webhooks improve deployment correlation.
Related documentation
Services and SLOs
Model owned services, link monitors, define dependencies, and track service-level objectives.
Telemetry
Ingest OTLP JSON logs, metrics, traces, and inspect trace detail inside AImonitoring.
Integrations
Connect AImonitoring to incident response, workflow, deployment, telemetry, and automation providers.
Incidents
Acknowledge, investigate, route, resolve, and review service incidents.