|8 min read

How One Renamed Field Took Down Three Services for Four Hours

A realistic incident postmortem of a field rename that cascaded through a microservices architecture. The change wasn't wrong — it was uninformed.

This is a story about a single pull request, three lines of JSON, and four hours that nobody at Meridian SaaS wants to relive. The names are fictional. The architecture is not. If you run microservices, you have probably lived some version of this already.

The Setup

Meridian is a mid-size B2B SaaS company. Their platform runs on 12 microservices, maintained by a team of about 40 engineers spread across five squads. The user service is the oldest and most central — it handles authentication, profile data, and serves as the source of truth for user identity across the platform. Fourteen internal services and three external partner integrations consume its API.

The user service had been accumulating technical debt for two years. Field names were inconsistent. The created timestamp used a different format than every other service. Address fields were flat — street, city, state, zip — sitting alongside unrelated profile data. Everyone agreed it needed cleanup. Nobody agreed on when.

The Change

On a Friday morning, a senior engineer on the platform team opened a pull request titled "Clean up user service response schema." The diff was tidy. The intent was good. Here is what the response looked like before:

{
  "id": "usr_8a3f2c",
  "name": "Alice Chen",
  "email": "alice@example.com",
  "created": "2024-01-15 09:30:00",
  "street": "742 Evergreen Terrace",
  "city": "Springfield",
  "state": "IL",
  "zip": "62704"
}

And after:

{
  "id": "usr_8a3f2c",
  "displayName": "Alice Chen",
  "email": "alice@example.com",
  "createdAt": "2024-01-15T09:30:00Z",
  "address": {
    "street": "742 Evergreen Terrace",
    "city": "Springfield",
    "state": "IL",
    "zip": "62704"
  }
}

The PR got two approvals within the hour. Unit tests passed. Integration tests passed — because the integration test suite mocked the user service response, and the mocks had been updated in the same PR. The CI pipeline was green across the board.

It merged at 4:07 PM on Friday. Continuous deployment pushed it to production at 4:12 PM.

The Cascade

T+0 (4:12 PM) — The new user service is live. No alerts fire. Health checks pass. The service is returning 200s with well-formed JSON. From its own perspective, everything is perfect.

T+28 minutes (4:40 PM) — A QA engineer on the mobile team pings the #mobile-dev channel: "Are user profiles broken for anyone else? I am seeing 'undefined' where the user's name should be." The mobile app reads response.name to render profile cards. That field no longer exists. There is no error — just a silent undefined that flows through to the UI. The mobile team assumes it is a client-side bug and starts investigating their own code.

T+55 minutes (5:07 PM) — The partner integration team receives the first of what will become 47 support tickets from three external API consumers. A fintech partner's webhook handler is failing validation because it expects a top-level street field. Their error messages are cryptic — "Required field missing in user payload" — and their on-call engineer cannot reproduce it locally because their staging environment still points at Meridian's staging user service, which has not been updated.

T+1 hour 20 minutes (5:32 PM) — The billing service begins logging warnings. It joins user records with billing accounts using a combination of id and created to handle edge cases with migrated accounts. The created field is gone, replaced by createdAt in a different format. The join logic does not fail — it falls through to a secondary matching strategy that uses email alone, which is less precise. Two invoices are silently routed to the wrong accounts.

T+1 hour 48 minutes (6:00 PM) — A billing team member notices revenue numbers look off on an internal dashboard. They open a ticket. Nobody connects it to the user service change yet.

T+2 hours 5 minutes (6:17 PM) — The mobile team has ruled out a client-side bug. They trace the issue to the user service response and realize the name field is missing. Someone asks in #engineering: "Did anything change in the user service today?" The PR is found. Incident is declared.

T+2 hours 30 minutes (6:42 PM) — The incident commander pulls in engineers from four squads. The scope of the damage is becoming clear but is hard to quantify. The question on everyone's mind: what else reads from the user service? Nobody has a definitive answer. There is no service dependency map. There is no contract registry. Engineers start grepping across repositories to find references to the changed fields.

T+3 hours 15 minutes (7:27 PM) — Root cause is confirmed. The full list of affected consumers is still being assembled. The team debates: roll back the user service, or patch the consumers? Rolling back is faster but will revert other changes that shipped this week. They decide to add backward-compatible aliases — return both name and displayName, both created and createdAt, and flatten the address fields back to the top level alongside the new nested object.

T+4 hours 3 minutes (8:15 PM) — The hotfix is deployed. Mobile profiles render correctly. Partner webhook handlers stop failing. The billing mismatch is identified and flagged for manual correction on Monday.

Engineers start going home. The Slack channel stays active until midnight.

The Postmortem

The following Monday, the team ran a blameless postmortem. They identified five contributing factors:

  • No contract awareness. The user service had no machine-readable API contract. There was no OpenAPI spec, no schema registry, nothing that explicitly declared "these are the fields consumers depend on." The contract existed only implicitly — in the code of every consumer.
  • Mocks masked the break. Integration tests used mocked responses that were updated in the same PR as the schema change. The tests verified that the user service returned the new format correctly. They never verified that consumers could handle it.
  • No consumer notification. Three external partners depended on the old field names. None were notified before the change went live. The partnership agreement included a vague commitment to "reasonable notice of breaking changes" but no mechanism to deliver it.
  • No diff-stage detection. The code review focused on whether the new schema was cleaner — and it was. Nobody asked the question: "Who depends on the fields we are removing?" There was no tooling to surface that information during review.
  • Friday afternoon deploy. The timing reduced the number of engineers available to respond and extended the resolution time by at least an hour.

The Cost

The final tally was sobering:

  • 40+ engineering hours spent on incident response, hotfix development, and consumer-side patches across four teams
  • 3 partner escalations, including one that reached VP level and triggered a contract review
  • 2 billing errors that required manual correction and customer apology emails
  • 1 week of follow-up work to build the backward-compatible response layer that should have existed before the change shipped

The original PR took 45 minutes to write and review.

The Lesson

Here is what the postmortem did not conclude: that the change was wrong. The rename from name to displayName was a genuine improvement. Nesting address fields was the right structural decision. Standardizing the timestamp format was overdue.

The problem was not the change. The problem was that the change was uninformed. The engineer who wrote the PR had no way to know — at the moment of writing it — that 14 internal services and 3 external integrations depended on the exact field names being removed. The reviewers who approved it had no way to see the blast radius. The CI pipeline had no way to flag it.

The information needed to prevent this incident existed. It was scattered across a dozen codebases, embedded in import statements and JSON parsing logic and webhook handler configurations. But none of it was surfaced at the one moment when it mattered: when the PR was open and a human was deciding whether to merge.

Catching It Before It Ships

This class of incident is not an edge case. Field renames, type changes, removed properties, restructured response objects — these are among the most common causes of integration failures in microservice architectures. They are also among the most preventable, because the change is visible in the diff long before it reaches production.

The fix is not "stop making schema changes." The fix is knowing who depends on what you are changing, and knowing it before you merge — not four hours after you deploy.

Automated detection of API contract changes at the pull request stage turns a four-hour incident into a two-minute review comment. The engineer sees which fields changed, understands the severity, and can coordinate with consumers before the change ships. The partners get notified. The billing service gets updated. And Friday evening stays uneventful.

The best time to catch a breaking change is when it is still just a diff.

Catch API breaking changes before they ship

RiftCheck monitors every commit and PR for API contract changes. Free to start.

Related posts