Source linked

40 систем, 2,5 года, 6 ошибок, которые почти убили миграцию

hackernoon.com@systems_wire3 hours ago·Systems Engineering·1 comments

Команда, которая перенесла 40 старых систем на EKS, обнаружила, что добыча баз данных съела 45% бюджета программы, а не предполагаемого 20% - и это была всего лишь одна из шести ошибок планирования, которые почти затопили в первый год.

ekspactlegacy system migrationpostmortemplatform engineeringmicroservices

Forty-something legacy systems, two and a half years, one new platform on EKS — and the first year was a near-disaster we were lucky to recover from. By any external measure the program shipped and the lights stayed on. Internally, the team spent twelve months learning lessons that more careful planning should have prevented. Here are the six mistakes that cost us, and what we'd do differently.

Starting Easy Taught Us Nothing

Our first migration target: a small read-only lookup service. It went smoothly, the team celebrated, the executive sponsor sent a "great job" email. Perfect, right? Wrong. That easy win taught us nothing about database extraction, CDC pipelines, staff retraining, or the API contract drift that would later chew up six months of year two. Pick your first migration for what it teaches, not for what it ships. A win that doesn't surface the platform's weaknesses is a delayed disaster. We should have started with a medium-complexity system — one with a database, a downstream integration, and a non-trivial user-facing change — even if it took 3x longer.

Verbal Agreements and Missing Contract Tests Added Months

Everyone agreed at kickoff: "No new features in the legacy while we migrate." It was in the slides, in the steering committee notes, in nobody's engineering policy. For nine months, teams added features to the legacy because velocity felt higher and customers were happier. Every feature added during migration became one we had to re-migrate, costing roughly four months of program time. Fix: a written policy with VP sign-off, posted in the wiki, enforced at the pull-request level via a CI check that blocked changes to certain legacy directories without an explicit override. Verbal agreements lose to weekly delivery pressure; written ones survive it.

Unit and integration tests were in place. Contract tests weren't. When we cut over the customer service, three callers that depended on a deprecated field broke silently. The field returned empty strings instead of nulls. Nobody caught it for nine days, and a handful of downstream batch jobs produced wrong data for that entire window. After that, every consumer-provider relationship was codified in Pact (or a homegrown equivalent). Every change to a service's response had to pass consumer tests before merging. It added overhead. It also stopped the next nine field-shape regressions from reaching production.

The Database Work Was 45% of the Program, Not the Estimated 20%

Services were the visible work. Database extraction is where the real time went. Twelve of the forty legacy apps wrote to the same customer table. Extracting "customer" as a service meant migrating every one of those write paths through the new API — each with its own ORM, assumptions, and undocumented invariants. We estimated the database work at 20% of the program; it was closer to 45%. Fix in hindsight: budget database extraction at parity with the service work, not as a tail item. Do not try to fork the database until every writer has been migrated for at least two release cycles. Premature forking caused two of our worst incidents.

The API Gateway Deserved a Product Owner

Our gateway was set up by the platform team and handed off to nobody. No owner, no roadmap, no backlog. When teams asked for new routing behaviors, request transforms, or auth modes, they got "the platform team is busy" or "we'll get to it." Migrations stalled waiting on the gateway. In month twelve we made the gateway a product with a dedicated owner, a roadmap, and an on-call rotation. Migration throughput roughly doubled the following quarter. Shared infrastructure that everyone depends on needs a product owner with real authority, or it becomes the bottleneck the whole program waits on.

Cutting Over Isn't Decommissioning

Every time we routed traffic to a new service, the team celebrated, marked the migration done in the tracker, and moved on. The legacy code kept running — for an average of eight months past cutover. Nobody owned deletion, and because every cutover had a footgun, nobody wanted to be the first to disarm. The original "we'll save infrastructure cost" pitch didn't materialize until late year two, when we finally instituted a "delete the legacy code 90 days after cutover" policy with an executive sign-off requirement to extend.

What we would do differently on day one: pick a medium-complexity first migration that surfaces platform weaknesses early. Codify the "no new features in legacy" rule as a CI policy on day one. Require contract tests before any cutover; treat them as part of the platform. Budget database extraction at parity with service work, not 20%. Make the API gateway a product with a named owner. Decommission within 90 days; track it like a delivery metric.

By month 28, deploy cadence had gone from quarterly to multiple times per day, and the customer-facing wins were real. None of that erases the first year of learning the hard way. The hardest part of modernization isn't the technology — it is the discipline to make agreements stick.


Source: We Replaced 40 Legacy Systems: Here's What We Got Wrong
Domain: hackernoon.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.