Skip to content

Incident Response

How engineering handles on-call pages, debugs incidents and ships rollbacks across PagerDuty, Datadog and GitHub.

First response

Acknowledging a PagerDuty page

When the on-call engineer receives a page in PagerDuty, they open the incident detail page in a browser, click Acknowledge to stop the escalation timer, then add a status update on the incident timeline naming the service they are about to investigate. They keep the PagerDuty tab open in the background and pivot to the linked monitor in a new tab rather than closing PagerDuty.

First response

Escalating a page to the secondary on-call

When the on-call engineer cannot start triage within the first few minutes (already paged on a parallel incident, away from a workstation or judges the affected service to be outside their area), they open the PagerDuty schedule that covers the service to confirm who is genuinely on the secondary slot at the current time, then use Reassign rather than Add Responder so the secondary is paged with a phone call rather than a passive notification.

Investigation

Triaging a page from the linked Datadog monitor

After acknowledging a PagerDuty page, the engineer follows the linked Datadog monitor in the incident detail and starts triage on the monitor's detail page. They expand the time range from the default 15 minutes to 1 hour, switch to the Logs Sample tab to read the actual error messages, then toggle the deployment markers overlay before deciding whether the issue is a code change, an upstream failure or a traffic event.

Investigation

Correlating a Datadog spike with a recent deploy

The engineer pivots from the Datadog monitor to the affected service's APM page, lines up the spike start time against the deployment markers panel and notes the deployed version tag closest to the spike. They then open the matching service repository in GitHub and go to the Releases page rather than the Pull Requests list, identify the previous and current release tags spanning the spike, then construct or follow the compare URL between them.

Investigation

Identifying the suspect PR with GitHub Compare

With the GitHub compare view open between two release tags, the engineer reads the list of merged PRs in the range, prioritises ones with infrastructure or feature flag labels and opens the top two or three suspects in new tabs. They check each PR's labels, the size of the diff and the deploy job history before settling on which PR to revert or hotfix.

Remediation

Rolling back by reverting the merge commit

With a suspect PR identified, the engineer uses GitHub's Revert button on the original PR to generate a clean revert PR, prefixes the title with [hotfix], requests review from the original PR author and one other on-call engineer, then merges only after CI completes and the deployment finishes. They leave a comment on the original PR explaining the revert rather than closing it.

Remediation

Shipping a forward fix instead of rolling back

When a clean revert is not possible (a database migration has already run, the suspect PR sits on top of unrelated changes that should not be reverted or the revert itself fails CI), the engineer branches a small fix from main, names it hotfix/<incident-id>, makes the minimum change needed to restore service, gets a thumbs up from the secondary on-call posted on the PagerDuty status update rather than a formal GitHub review, and merges with the [hotfix] label so the deploy pipeline runs the abridged test suite.

Verification

Verifying a rollback restored service in Datadog

After the rollback or forward fix has deployed, the engineer returns to the same Datadog monitor and APM service page used during triage, sets the time range to last 30 minutes and waits at least 5 minutes after the deploy completed before checking that error rate, latency and the SLO burn rate have returned to baseline. They also check the affected service's downstream consumers in the service map view to make sure queued errors have not propagated.

Resolution

Resolving an incident and recording the timeline

Once metrics have recovered, the engineer returns to the PagerDuty incident, posts a final status update naming the cause and the remediation, attaches the relevant GitHub PR numbers and the Datadog dashboard URL in the resolution note, tags the incident with the affected service and a root cause classification, then clicks Resolve. They wait at least 15 minutes after metric recovery before resolving, to allow time for the metric to flap back if the fix is not actually durable.

Investigation

Silencing a noisy Datadog monitor during investigation

When a known issue is firing repeated pages on the same Datadog monitor and a fix is already in flight, the engineer mutes the specific monitor scope (rather than editing the monitor configuration) for a fixed 30 minute window, adding a mute reason that links the active PagerDuty incident. They never mute indefinitely and they never widen the mute scope to the whole monitor.

Questions

Frequently asked questions

Speak to the founder

Henry Denton, founder of FusedFrames

Get a demo. Watch a live capture, then an AI agent query the result.

Ask anything. Pricing, security or integrating with your stack.

No purchase obligation

Start capturing

Record in minutes. Install once and work as normal.

Plug AI agents in. One API call from any AI agent stack.

Refund on unused credits if you cancel