Skip to content

Verifying a rollback restored service in Datadog

After the rollback or forward fix has deployed, the engineer waits five minutes and then watches the same Datadog monitor for baseline error rate, latency and SLO burn. They also check the service map for downstream consumers still affected by queued errors.

Category
Tags
datadogverificationslo-burn-rateservice-maprollback
What and why
The observed behaviour and the reasoning behind it.
Behaviour
Reasoning
Cause and effect
What initiates this pattern and what it produces.
Trigger
Outcome
Standard operating procedure
Step-by-step instructions to reproduce this pattern.
1

Datadog

Reopen the original monitor detail page from the PagerDuty incident link.

Reusing the PagerDuty link rather than navigating fresh keeps the time range and scope identical to the triage step, which is what makes a true before and after comparison possible. A different scope or filter set produces an apples to oranges comparison that has tripped the team up before.

Expected: The Datadog monitor detail page opens with the same scope and a freshly extended time range.

2

Datadog

Set the time range to 'Last 30m'.

30 minutes captures the spike, the rollback deploy and the recovery in a single view. Shorter ranges hide the spike and longer ranges flatten the recovery to the point that small lingering issues become invisible.

Expected: The metric graph shows the spike in the left third, the deploy marker in the middle and the recovery in the right third.

3

Datadog

Wait at least 5 minutes after the deploy completion time before declaring recovery.

The metric pipeline has roughly a 1 to 2 minute aggregation lag, the rolling deploy itself takes 1 to 3 minutes to reach all instances and CDN cache TTL adds further delay. Declaring success at minute 2 is regularly wrong. The waiting period feels long during an incident but is the cheapest insurance against a re-page.

Expected: At least 5 minutes have elapsed between the deploy success and your verification check.

4

Datadog

Confirm the error rate is below the monitor's alert threshold and trending down or flat.

Below threshold is the necessary condition. Trending down or flat is the sufficient one. An error rate that is below threshold but rising is a partial recovery and likely to re-breach. Treat that case as not yet recovered.

Expected: The error rate is below the threshold line and the slope is non-positive over the last 5 minutes.

5

Datadog

Open the linked SLO from the service overview and check the burn rate widget.

The instantaneous burn rate is what matters for closing the incident. If it is still above 1 the service is still consuming error budget faster than the SLO target allows, even if the raw error rate looks fine. Wait for the burn rate to drop below 1 before resolving the page.

Expected: The SLO burn rate widget shows a current value below 1 and trending toward baseline.

6

Datadog

Open the service map view and inspect the immediate downstream consumers.

Click the affected service in the map and follow the outbound edges. Each downstream consumer should show its own error rate and latency. If a downstream is still red, the rollback fixed the source but consumers are still draining queued errors and may need a separate restart or queue purge. Note any affected downstream so you can decide whether to declare partial recovery or a full one.

Expected: All immediate downstream services show error rates back at baseline, or any still affected are noted for follow up.

Related patterns
How this pattern connects to other patterns in the library.
Supporting actions
Actions that provide evidence for this pattern.
Verified api-gateway recovery, error rate flat, SLO burn 0.4
Checked checkout-service downstream after rollback, all green
Waited 6 min post-deploy before declaring billing-worker recovered
Spotted lingering errors on payments-service, downstream of api-gateway
SLO burn still 1.2 on checkout-service, held off resolving for 4 more min
Metadata
Timestamps and identifiers.
EvidenceObserved 27 times across 5 connections
ApplicationsDatadog
First seen2 Feb 2026, 14:01
Last seen5 May 2026, 18:32
Questions

Frequently asked questions

Speak to the founder

Henry Denton, founder of FusedFrames

Get a demo. Watch a live capture, then an AI agent query the result.

Ask anything. Pricing, security or integrating with your stack.

No purchase obligation

Start capturing

Record in minutes. Install once and work as normal.

Plug AI agents in. One API call from any AI agent stack.

Refund on unused credits if you cancel