Skip to content
Back to Insights
Consultancy Refactoring Legacy Architecture

Refactoring Without Rewriting: The Strangler Fig Pattern in Practice

The instinct when looking at a 12-year-old monolith is 'rewrite it.' The data on rewrites is brutal — most miss every deadline by 2-3×, half never ship. The Strangler Fig pattern is how we actually replace legacy systems for clients. This is the playbook with three real examples.

Codecanis Admin

11 min read
Working with legacy code
Year-two view of a Strangler Fig migration — the new system carries 78% of traffic, the legacy monolith handles the rest.

Every engineering team that inherits a 12-year-old monolith reaches the same point eventually. The codebase has accreted features faster than it has shed complexity, every change takes longer than it should, and someone on the team says the words: "we should just rewrite it." The conversation that follows is usually the most expensive one a CTO can have.

The data on rewrites is brutal. Most miss every deadline by 2–3×. Roughly half never reach feature parity with the system they were meant to replace. A non-trivial fraction are abandoned partway through after consuming a year or more of engineering capacity, leaving the team with two systems to maintain instead of one. Joel Spolsky's "things you should never do" essay was published in 2000 and the lesson keeps landing every five years on a new generation of engineering leaders.

The good news: there's a better pattern, and it's been understood since Martin Fowler named it in 2004. The Strangler Fig is how we actually replace legacy systems for clients — and it has shipped successfully on every engagement where we've used it. This post walks through the four steps, three real client case studies (anonymised), and the pitfalls that derail teams who think they're following the pattern but aren't.

Why Rewrites Fail

Three structural reasons rewrites collapse, beyond the obvious "it's hard":

  • The data problem: A rewrite has to move data from the old schema to the new one. The old schema has 12 years of edge cases, undocumented foreign keys, fields that meant one thing before 2017 and another thing after. Every migration script becomes an archaeological expedition. Most rewrites underestimate this by an order of magnitude.
  • The scope problem: The team rebuilding the system also wants to fix everything they always hated about it. Now you're not rewriting — you're inventing a new product that happens to overlap with the old one. Scope creep is structural, not a discipline failure.
  • The morale problem: A rewrite is a multi-year commitment with no visible value delivery until the very end. Engineers leave. Product leaders lose patience. Boards lose patience. The pressure to "just ship something" causes shortcuts that compromise the rewrite's only justification — that the new system would be clean.

The Strangler Fig avoids all three by changing the shape of the work. Instead of a multi-year capacity-consuming megaproject, it becomes a series of incremental, individually shippable migrations, each of which delivers measurable value on its own.

The Four Steps of the Strangler Fig Pattern

Step 1: Identify the Seam

A seam is a place in the system where you can intercept traffic — an HTTP request, a queue message, a function call — and route it conditionally to either the old or the new implementation. The most common seam is the HTTP layer, but message queues, scheduled jobs, and even database triggers can be seams in the right context.

The seam has to be:

  • Cheap to add: If adding the routing layer requires touching the entire codebase, you've picked the wrong seam.
  • Observable: You need to see traffic going through it, latency, error rates, by-route.
  • Reversible: One config change should swing 100% of traffic back to the legacy implementation in an emergency.

Step 2: Build the Routing Layer

A simple reverse proxy in front of the legacy system, capable of routing per-request based on rules — a path prefix, a feature flag, a user ID, a percentage. Start with everything routed to legacy. Add routing rules as you replace slices.

from flask import Flask, request
import requests

app = Flask(__name__)

LEGACY_URL = "http://legacy.internal:8080"
NEW_URL    = "http://new-service.internal:9090"

# Routing table: which paths are handled by the new system.
# Start empty. Add entries as you migrate slices.
NEW_ROUTES = {
    "/api/v1/invoices": 1.0,   # 100% of invoices to new system
    "/api/v1/customers": 0.10, # 10% canary to new system
}

@app.route("/", methods=["GET", "POST", "PUT", "DELETE"])
def route(path):
    full_path = "/" + path
    target = LEGACY_URL

    for prefix, share in NEW_ROUTES.items():
        if full_path.startswith(prefix):
            if should_route_to_new(request, share):
                target = NEW_URL
            break

    resp = requests.request(
        method=request.method,
        url=target + request.full_path,
        headers={k: v for k, v in request.headers if k != "Host"},
        data=request.get_data(),
        allow_redirects=False,
    )
    return resp.content, resp.status_code, resp.headers.items()


def should_route_to_new(req, share: float) -> bool:
    """Sticky routing by user_id so the same user sees consistent behaviour."""
    user_id = req.headers.get("X-User-Id", "")
    if not user_id:
        return False
    return (hash(user_id) % 1000) / 1000 < share

In production we typically use Envoy, Nginx, or a service mesh (Istio, Linkerd) rather than a hand-rolled Flask proxy — but the principle is identical. The routing layer is the foundation; everything else builds on it.

Step 3: Replace One Slice

Pick one slice — one bounded context, one URL prefix, one feature domain — and reimplement it in the new system. Run both implementations side by side for at least two weeks with shadow logging: traffic goes to the legacy system as normal, but every request is also sent to the new system and the responses are compared, with discrepancies logged.

def shadow_request(req, legacy_response):
    """Fire request at new system, log if response differs from legacy."""
    try:
        new_response = requests.request(
            method=req.method,
            url=NEW_URL + req.full_path,
            headers={k: v for k, v in req.headers if k != "Host"},
            data=req.get_data(),
            timeout=5,
        )
    except Exception as e:
        log_shadow_diff(req, "shadow_error", str(e))
        return

    if new_response.status_code != legacy_response.status_code:
        log_shadow_diff(req, "status_mismatch", {
            "legacy": legacy_response.status_code,
            "new": new_response.status_code,
        })

    if not bodies_equivalent(new_response.content, legacy_response.content):
        log_shadow_diff(req, "body_mismatch", {
            "legacy_hash": hash(legacy_response.content),
            "new_hash": hash(new_response.content),
        })

Only after shadow logging shows the new implementation is matching the legacy implementation on 99.9%+ of real traffic do you start routing real traffic to it — and even then, you start with a small canary (5%, then 25%, then 100%), with automated rollback on error rate thresholds.

Step 4: Decommission

This is the step teams skip and pay for. Once 100% of traffic for a slice is routed to the new system and has run cleanly for at least a month, delete the legacy code for that slice. Don't leave it in place "just in case." Dead code rots in interesting ways — someone will eventually accidentally invoke it from a scheduled job, an integration test, or a forgotten admin script.

Removing the legacy code is the only thing that makes the migration real. Without that step, you have two systems forever.

Case Study 1: SaaS Billing System

Client: B2B SaaS, mid-stage growth. Legacy: a Ruby on Rails 4 billing engine, 9 years old, handling invoice generation, payment retries, and dunning. Why migrate: the team couldn't ship a new pricing model (usage-based metering) without rewriting half the billing engine, and the existing engineers wouldn't touch the billing code after two production incidents in 18 months.

What we did: Built a new billing service in Go. Identified the seam at the API gateway — every billing API call goes through one Rails controller. Routed traffic at the URL prefix level (/billing/v2/...) to the new service while leaving /billing/v1/... on Rails. Migrated one customer at a time onto the v2 endpoints over six months.

Outcome: The new pricing model shipped in month seven. The legacy billing engine was fully decommissioned by month eleven. No customer experienced an incorrect invoice during the migration. Engineering velocity on billing features improved roughly 3× post-migration.

Case Study 2: Legacy CRM

Client: financial services firm, 800 sales staff. Legacy: a custom Windows Forms CRM written in 2009, running on a Citrix farm because nobody dared touch the deployment story. Why migrate: the CRM was the system of record for a large book of client portfolios; the codebase had a bus factor of one (the original architect was 18 months from retirement).

What we did: Built a web-based replacement, but didn't try to replace the whole thing at once. Identified the seam at the screen level — each CRM screen became a candidate for migration. Built a hybrid wrapper application that could display either legacy Windows Forms screens (via remote app) or new web screens in the same workspace, with shared session state.

Outcome: Migration ran 22 months. Sales staff saw a progressively more modern application without any "rip and replace" day. The original architect retired six months in, having documented enough to make the remaining work tractable. Legacy CRM was decommissioned in month 26.

Case Study 3: Custom Odoo Deployment

Client: industrial wholesaler. Legacy: a heavily customised Odoo 11 deployment, with 280,000 lines of in-house Python customisations and a fork of Odoo core that had drifted significantly from upstream. Why migrate: Odoo 11 was end-of-life, the customisations were no longer maintainable, and the team wanted to consolidate onto modern Odoo with cleaner extension patterns.

What we did: This one was harder — Odoo's monolithic architecture doesn't expose obvious HTTP seams for every business operation. We identified seams at the data layer instead: built event sourcing on top of the legacy system using PostgreSQL logical replication, and reimplemented business processes on a new Odoo 17 instance that consumed the same event stream. Customer-facing screens migrated last.

Outcome: 14-month migration. The new system handled 78% of production traffic by month 12 and 100% by month 14. Twelve of the original customisations were eliminated entirely as native Odoo 17 features now covered them.

Pitfalls That Derail Strangler Fig Migrations

  • Skipping the shadow phase: Teams under pressure cut shadow logging short. They learn about the bugs from production customers instead.
  • Choosing the wrong seam: If the seam requires invasive changes to the legacy code, you're not strangling — you're refactoring the legacy system, which defeats the purpose.
  • Never decommissioning: Two systems forever, double the operational burden. Schedule the decommission as a hard milestone, not a "we'll get to it" task.
  • Migrating the easy slices first and stalling on the hard ones: Tempting but dangerous. Migrate the highest-risk or highest-value slice first, when team energy is high.
  • Underinvesting in the routing layer: The routing layer is production-critical infrastructure. It needs the same SLOs, monitoring, and on-call attention as the systems it fronts.

Key Takeaways

  • Big-bang rewrites fail predictably — on data migration, scope creep, and morale.
  • The Strangler Fig pattern replaces a megaproject with a series of incrementally shippable migrations.
  • Identify a clean seam, build a robust routing layer, replace one slice at a time, and actually decommission the old code.
  • Shadow logging is non-negotiable — it's how you catch behavioural divergence before customers do.
  • Schedule the decommission as a hard milestone; otherwise you'll maintain both systems forever.
  • The routing layer is production-critical infrastructure and deserves matching investment in observability and reliability.
Let's build something

Want to work together?

If this article made you think about your architecture, your roadmap, or a problem you haven't solved yet — let's talk.