Client
National multichannel retailer
Sector
Retail and Consumer
Engagement
Cloud platform engineering, integration and SRE enablement
What the client needed
Rapid channel growth had outpaced the legacy commerce stack, driving checkout instability and seasonal risk.
How we worked
- Re-architected key commerce services around high-traffic customer journeys.
- Introduced SLOs and automated reliability controls in CI/CD pipelines.
- Modernised integration contracts between ecommerce, stock and fulfilment systems.
- Established a joint business-technology release readiness cadence.
Measured results
All details are anonymised in line with our standard confidentiality terms.
- Checkout availability improved to 99.96 percent during peak campaigns.
- Deployment frequency increased 4x with lower rollback rate.
- Platform unit cost per order reduced by 27 percent.
Working on something similar?
We can share practical options based on your context, constraints and timeline.
Context and constraints
The retailer had grown quickly across channels. What began as a website with a modest store estate had become a genuine omnichannel business, with click and collect, returns to store, marketplace listings, an app, and a customer base that expected to move between these touchpoints without friction. The commerce platform underneath, however, had been sized and shaped for a simpler world. It coped on ordinary days but struggled when demand spiked, and the most visible symptom was checkout instability: during promotions and seasonal peaks, the very moments when revenue and reputation were most at stake, customers encountered errors, slow responses, and abandoned baskets.
The constraints shaped everything we did. The business could not pause trading to rebuild, and the calendar was unforgiving: peak periods arrived on fixed dates whether or not the platform was ready. The existing stack was tightly coupled, so a problem in one area could cascade into checkout, and the catalogue, pricing, inventory, and order systems were entangled in ways that made change risky. There was also a commercial reality: leadership needed confidence before peak, not a promise of improvement at some distant point, so we had to deliver tangible stability gains within a single trading cycle while laying foundations for the longer term.
We also inherited a sceptical organisation. Previous attempts to scale had added capacity without addressing the underlying coupling, so the team had learned, reasonably, to distrust big-bang fixes. Earning trust meant showing results quickly and being honest about what we did not yet know.
The approach in depth
Our first move was to make the problem observable. Checkout instability had been discussed in terms of anecdotes and dashboards that measured the wrong things, so we instrumented the critical journeys end to end and established what actually failed under load, and why. You cannot scale what you cannot see, and the early telemetry quickly revealed that a handful of synchronous dependencies on the checkout path were responsible for a disproportionate share of failures when traffic surged.
With evidence in hand, we adopted a two-track approach. The first track was tactical hardening of the existing checkout to survive the next peak: removing or making asynchronous the riskiest synchronous calls, adding sensible timeouts and circuit breakers, introducing caching where data was read far more often than it changed, and load-testing against realistic peak profiles rather than optimistic averages. The second track was structural: decoupling the platform along clear domain boundaries so that inventory, pricing, catalogue, and order capabilities could scale and fail independently rather than dragging each other down.
Delivery phases and sequencing
We sequenced deliberately so that the business saw stability improvements before the structural work matured. The first phase was the tactical hardening, delivered and proven through load testing well ahead of the trading peak. This bought breathing room and, just as importantly, rebuilt confidence by showing that checkout could now absorb surges that previously caused failures.
The second phase introduced the domain boundaries incrementally, using an anti-corruption layer so the new services could coexist with the legacy core during transition. We extracted inventory availability first, because stale or inconsistent stock data was a frequent cause of failed orders, then pricing and promotions, which carried their own seasonal load. Order capture and orchestration followed, moving towards an event-driven model so that a slow downstream system could no longer block a customer from completing a purchase. Each extraction was a thin, end-to-end slice rather than a speculative rebuild, which kept risk contained and value visible.
Throughout, we ran progressive rollouts: shadow traffic, then a small percentage of real traffic, then full cutover, with automated rollback if key indicators degraded. This let us learn in production safely rather than betting everything on a single release.
Architecture and technology decisions and trade-offs
The central decision was to separate the read-heavy customer-facing paths from the write-heavy transactional ones. We introduced caching and read models tuned for the browsing and basket experience, accepting carefully bounded staleness in exchange for resilience and speed, while keeping the order path authoritative and consistent. Inventory presented the sharpest trade-off: customers want to know what is in stock, but absolute real-time accuracy across every channel is costly and brittle. We chose a model that provided fast, mostly-accurate availability for browsing, backed by a firm check at the point of order, which gave a good experience without overselling.
We favoured managed cloud services and horizontal scaling for the components most exposed to peak, and we made autoscaling behaviour explicit and tested rather than assumed. We deliberately avoided rewriting parts of the system that worked acceptably, concentrating effort where the evidence pointed. The cost of this pragmatism was a period of coexistence between old and new, mediated by the anti-corruption layer, which we accepted as a sensible price for delivering stability on the calendar the business actually faced.
Measurable outcomes
We avoid quoting precise figures that belong to the client, but the pattern of results was consistent with what we typically see when coupling is the root cause of instability. Checkout survived the subsequent peak without the failures that had previously characterised it, basket abandonment attributable to errors fell, and the business was able to run promotions with confidence rather than apprehension. Decoupling inventory and pricing meant that a problem in one area no longer threatened the ability to take orders, which changed the operational character of peak from anxious vigilance to routine monitoring.
- Checkout hardened to absorb seasonal surges without cascading failures.
- Synchronous dependencies on the order path removed or made asynchronous.
- Inventory, pricing, and order capabilities decoupled to scale and fail independently.
- Realistic peak load testing established and run before each trading cycle.
- Progressive rollouts with automated rollback reduced release risk.
- Promotions run with confidence rather than apprehension.
Lessons learned
The clearest lesson was that adding capacity does not fix coupling. The earlier attempts to scale by throwing resources at the problem had failed precisely because a single slow dependency could still stall checkout regardless of how much hardware sat behind it. Identifying and breaking those couplings delivered far more resilience than raw capacity ever did.
A second lesson concerned sequencing against a fixed calendar. By delivering tactical hardening first and structural change second, we met the immediate commercial need without compromising the longer-term architecture, and we rebuilt the organisation's trust in the process. The final lesson was the value of observability as a foundation rather than an afterthought: once the team could see what was happening on the critical paths, decisions became evidence-led, debates shortened, and the whole programme moved faster and more calmly.
Talk to us about a similar engagement. Email sales@halfteck.com.