Webhooks Explained — HMAC Signing Idempotency and Retries

A payment processor fires a charge.succeeded webhook to your server. Your server is mid-deploy, returns 502. The processor retries 30 seconds later — but by then your replicas are healthy, you process the charge, and respond 200. Three days later you discover you also processed the same charge during the original delivery (it actually succeeded; only the response was lost), so you've credited the customer twice. The customer is happy; your accountant is not.

Webhooks are the nervous system of modern integrations — but most tutorials show only the happy path. The interesting parts are signing, idempotency, and retries, which is where production deployments either survive or quietly corrupt data.

What a Webhook Actually Is

A webhook is an HTTP POST request that one system sends to another whenever something interesting happens. The receiver exposes a public URL; the sender hits it with a JSON payload describing the event.

POST /webhooks/stripe HTTP/1.1
Host: api.example.com
Content-Type: application/json
Stripe-Signature: t=1715200800,v1=a3f2c1...

{
  "id": "evt_1Nxxxxxx",
  "type": "charge.succeeded",
  "data": { "object": { "amount": 5000, "currency": "usd" } }
}

The semantics are deceptively simple: "we'll tell you when stuff happens, you handle it." The complexity hides in the questions: what if the receiver is down, what if the request is forged, what if it's a replay of an old event, what if the receiver processes it twice? GitHub's webhook documentation is a good baseline for what a complete webhook system looks like in production.

Webhooks vs Polling

The classic alternative to webhooks is polling: the client repeatedly asks the server "anything new?" Polling is simpler to reason about (the client controls timing, retry logic is trivial, no public endpoint needed) but expensive in two ways.

First, resource cost: polling 100,000 clients every 30 seconds is 200,000 requests per minute, mostly returning "nothing new." Webhooks scale with actual events, not with the number of subscribers. For a service that has occasional events (a payment provider, a CI system), this is orders of magnitude cheaper.

Second, latency: polling delivers events with up to one polling interval of delay. A 30-second poll means events arrive 0-30 seconds late. Webhooks arrive within seconds.

The tradeoff: polling clients don't need a public endpoint and don't have to handle malicious traffic. Webhooks require both. For a mobile app that goes offline, polling (or a push channel like SSE/WebSockets) is often the better choice. See the WebSockets vs REST vs GraphQL post for the full comparison.

Anatomy of a Webhook Delivery

A well-designed webhook delivery carries metadata in headers, not just the payload:

POST /webhooks/orders HTTP/1.1
Content-Type: application/json
X-Webhook-Id: evt_1NxxxxFB1Yyyyyyy
X-Webhook-Timestamp: 1715200800
X-Webhook-Event: order.created
X-Webhook-Signature: sha256=a3f2c1...
X-Webhook-Delivery-Attempt: 1

Each header earns its keep: Webhook-Id is a unique identifier the receiver uses for idempotency; Timestamp lets the receiver reject suspiciously old deliveries (replay indicator); Event type routes to the right handler without parsing the body; Signature is the HMAC verifying authenticity; Delivery-Attempt signals retries so the receiver can be extra careful.

When debugging integrations, the HTTP Request Builder lets you craft test deliveries without waiting for real events, and the JSON Formatter makes large nested webhook bodies (Stripe events can be 5KB+) much easier to scan.

HMAC Signing — How to Implement It

Anyone with your webhook URL can POST to it. Without verification, an attacker could fabricate a charge.succeeded event and trick you into shipping product. HMAC signatures solve this.

The sender holds a shared secret. For each delivery, it computes:

signed_payload = timestamp + "." + raw_body
signature = HMAC_SHA256(secret, signed_payload)

The signature is sent in a header. The receiver recomputes it and compares.

Here's a verification function in Node:

import crypto from 'crypto';

function verifyWebhook(req, secret) {
  const timestamp = req.headers['x-webhook-timestamp'];
  const signature = req.headers['x-webhook-signature'];
  const rawBody = req.rawBody; // Buffer of the unparsed request body

  // Reject deliveries older than 5 minutes (replay protection)
  const age = Math.floor(Date.now() / 1000) - parseInt(timestamp);
  if (age > 300) return false;

  const signedPayload = `${timestamp}.${rawBody.toString()}`;
  const expected = crypto
    .createHmac('sha256', secret)
    .update(signedPayload)
    .digest('hex');

  // Constant-time comparison to prevent timing attacks
  const sigBuf = Buffer.from(signature.replace('sha256=', ''));
  const expBuf = Buffer.from(expected);
  if (sigBuf.length !== expBuf.length) return false;
  return crypto.timingSafeEqual(sigBuf, expBuf);
}

Three things matter here that are commonly screwed up:

Verify against the raw body, not the parsed JSON. Re-serializing the parsed object can change whitespace or key order, breaking the signature. In Express, capture req.rawBody in middleware before express.json() runs.
Include the timestamp in the signed payload. This binds the signature to a specific time; an attacker can't replay a captured request hours later.
Use constant-time comparison (crypto.timingSafeEqual, not ===). String equality leaks information through response timing — an attacker can guess byte-by-byte from the latency. The Hash Generator is useful for sanity-checking your HMAC computation against a known input.

Stripe's webhook signature docs describe the canonical implementation; copy it for your own service.

Idempotency Keys

Networks fail. The sender posts a webhook, the receiver processes it, but the response is lost. The sender retries; the receiver processes the same event twice. If the handler creates a database record, charges a card, or sends an email, you've now done it twice.

Idempotency means: doing the operation once produces the same result as doing it ten times. The receiver achieves this by tracking which events it has already processed.

The pattern:

async function handleWebhook(event) {
  // Try to insert event ID; fails (unique constraint) if already seen
  const inserted = await db.insertEvent({
    id: event.id,
    receivedAt: new Date(),
    processed: false,
  });

  if (!inserted) {
    // Already seen this event
    const existing = await db.getEvent(event.id);
    if (existing.processed) {
      return; // No-op: already done
    }
    // Else: previous attempt failed mid-processing; safe to retry
  }

  // Do the actual work in a transaction
  await db.transaction(async (tx) => {
    await processEvent(event, tx);
    await tx.markEventProcessed(event.id);
  });
}

The unique constraint on the event ID is the critical primitive — it makes "have I seen this before?" an atomic database check. Don't rely on application-level "if exists, skip" checks; they have race conditions when two retries arrive simultaneously.

For high-throughput systems where keeping every event ID forever isn't practical, a TTL of 7 days is usually enough — most senders won't retry past that.

Retry Strategy and Exponential Backoff

When the receiver doesn't respond 2xx, the sender retries. A naive "retry every 30 seconds forever" works fine for a single failed delivery — but if 10,000 webhooks fail simultaneously (a customer's server crashed), the sender will then thunder-herd 10,000 retries every 30 seconds until they recover. That's a denial-of-service attack on your own customer.

Exponential backoff with jitter is the standard:

function nextRetryDelay(attempt) {
  const baseDelaySeconds = Math.min(2 ** attempt, 3600); // cap at 1 hour
  const jitter = baseDelaySeconds * 0.25 * (Math.random() * 2 - 1);
  return Math.round(baseDelaySeconds + jitter);
}
// Attempt 1: ~2s, Attempt 2: ~4s, Attempt 3: ~8s, ..., Attempt 10: ~1024s

GitHub's webhook retry schedule is similar: 0, 30, 60, 120, 240, ... up to 8 attempts over ~28 hours. Stripe retries for up to 3 days. The Cloudflare blog on retry patterns is a good primer on why jitter matters.

The receiver's job is to be predictable: if you can process the event, return 2xx; if you can't (handler crashed, downstream service down), return 5xx so the sender retries. Never return 2xx if you haven't durably persisted the event — that's a guarantee you can't take back.

A 4xx response (especially 400, 401, 403, 410) tells the sender "stop retrying, this will never work" — useful for permanent failures (signature verification failed, endpoint deleted) but dangerous if mistakenly returned during transient problems.

Replay Attack Prevention

An attacker who captures a legitimate webhook delivery (network sniffing, proxy logs, exposed CI artifacts) can replay it later if the receiver doesn't actively defend against this. Even with HMAC, the signature stays valid forever — the secret hasn't changed.

Two layers of defense:

Timestamp validation. As shown in the verification function above, reject deliveries with timestamps older than 5 minutes (or whatever your tolerance is). The timestamp is part of the signed payload, so an attacker can't tamper with it without invalidating the signature.

Idempotency tracking. Even if a replay sneaks through the timestamp check, the unique-event-ID-in-database check rejects it as a duplicate. This is why the idempotency layer matters even for security, not just reliability.

For high-security contexts (financial transactions), some teams add a third layer: nonces. The sender includes a single-use random value; the receiver tracks used nonces and rejects repeats. This is overkill for most use cases.

The JWT Decoder is a useful adjacent tool — JWTs use similar timestamp-and-signature primitives, and the decoder helps you understand what claims are typically embedded in signed tokens.

Testing Webhooks Locally

The chicken-and-egg problem: the sender needs a public URL, but you're developing on localhost. Three solutions:

ngrok opens a tunnel from a public hostname to your local port. ngrok http 3000 gives you a URL like https://abc123.ngrok.io to register as your webhook URL.

Cloudflare Tunnel is the production-friendly alternative — free, integrates with Cloudflare Access if you want to restrict access. cloudflared tunnel --url http://localhost:3000.

Provider replay tools like stripe listen --forward-to localhost:3000/webhook and gh webhook forward skip the tunnel entirely — they run a local agent that pulls events from the provider's queue and POSTs them to your local server, avoiding any public exposure of your dev machine.

For testing without a real provider, capture a real payload as a fixture, generate a valid signature with the Hash Generator (HMAC-SHA256), and POST with curl. The JSON Validator is fast for triaging malformed events in production logs.

FAQ

What HTTP status code should my webhook handler return?

200 (or 204) when you've successfully and durably persisted the event. 4xx (especially 400 for bad payload, 401 for failed signature) when retrying won't help. 5xx for transient failures where you want the sender to retry. The most common bug: returning 200 before persistence is durable, then crashing — the sender thinks delivery succeeded, the event is lost.

How do I prevent processing the same webhook twice?

Track event IDs in a database with a unique constraint. On delivery, attempt to insert the ID; if the insert fails (already exists), skip the work. The unique constraint provides atomicity — application-level "if exists, skip" checks have race conditions when retries arrive concurrently. Most webhook providers send a unique event ID specifically so receivers can do this.

Why does Stripe send a timestamp in the signature?

To prevent replay attacks. Without the timestamp, an attacker who captures a valid signed webhook can replay it forever — the signature remains valid. By including the timestamp in the signed payload, the receiver can reject any delivery with a stale timestamp, limiting the replay window to (typically) 5 minutes.

What's the difference between a webhook and a callback URL?

In casual usage, nothing — both refer to a URL the server calls back to. More precisely, "callback URL" often refers to a one-time redirect target (OAuth's redirect_uri, Stripe's checkout success URL), while "webhook" refers to repeated event-triggered POSTs over the lifetime of an integration.

How long should I retry failed webhook deliveries for?

GitHub stops after ~28 hours, Stripe after 3 days, Slack after 1 hour. The right answer depends on the criticality of the event and your customer's typical recovery window. 24-72 hours covers most planned and unplanned outages. Longer retention adds queue management complexity for diminishing returns.

Can I use webhooks instead of polling for everything?

No. Webhooks require the receiver to have a public, always-on endpoint — incompatible with mobile apps, browser-based clients, or systems behind strict firewalls. Webhooks also lose events if the receiver is unreachable past the retry window. Polling (or persistent push channels like WebSockets) is the better fit for clients that go offline.

Why does my HMAC verification fail even when the secret is right?

Almost always one of: (1) you're hashing the parsed-then-reserialized JSON instead of the raw body bytes, (2) wrong character encoding on the secret or payload, (3) you forgot to include the timestamp in the signed payload, (4) the header name is mistyped. Capture the wire-level request to inspect the raw bytes the sender signed.

Should webhook handlers do work synchronously or queue it?

Queue it. Return 200 as fast as possible after durably persisting the raw event, then process asynchronously. This decouples handler reliability from downstream services — if your email provider is slow, your webhook handler shouldn't time out and trigger sender retries. Stripe explicitly recommends this pattern.