The “Not so good” Case: Rescuing a Burning Platform
Before we began to work with our customer, they had successfully established their B2B digital commerce business and had scaled it together with their agency partner to a lower 9-digit revenue range.
At a certain point, the digital commerce platform experienced sudden slowness and stability issues.
Initially, it was on a monthly basis, then on a weekly basis. When the platform crashed 3 times in a single week, we were asked by Senior Management to take over and stabilise this “burning platform”.
A team consisting of a Software Architect (SAP hybris), an Infrastructure Architect, an Operational Analyst and a Service Delivery Manager, executed the 1st stage of our Rapid Stabilisation Approach. This entailed a 2-day health check where we interviewed key stakeholders, reviewed business cases, organisational setup (in this case of the customer and the supporting agency), process and system documentation, incident and problem reports / Root Cause Analysis reports.
Additionally, we performed a review of critical areas of the source code of the digital commerce platform, as well as performed a high-level dependency check.
It quickly became evident that the agency partner had limited experience in professional service operations, and that our customer overlooked “technical debt reduction needs” brought forward by the agency partner.
This led to the situation that only a very basic monitoring was existing but no alerting and SLA was agreed.
The infrastructure contained several Single Points of Failure and the applications had hard dependencies on surrounding systems (pricing).
Furthermore, deployment processes and configuration management was immature and not well managed, therefore error prone.
Service Management / Incident reports hardly existed and no one was able to report on KPIs and platform quality.
We convinced the customer to install our suite of tools for monitoring, alerting and service management within 2 days. In addition, we set-up a “hyper care” team that was actively managing the platform and learning on the fly about the platform peak load behaviour.
We executed a quick Knowledge Transfer according to our Interview-Style Rapid Stabilisation Approach. In addition, we were able to repurpose Non-Production Environments to scale out the platform and developed rolling-restart protocols and decoupled the digital commerce platform from the pricing engine.
The platform could be stabilised in the following week and downtimes could be first reduced to <5 minutes and then prevented completely.
After the completion of this 1st phase of the Rapid Stabilisation Initiative (Survive), we developed an improvement plan in a second phase (Stay Live) that took around 6 weeks.
The improvement execution project in which we set up a completely new “single-point-of-failure” infrastructure, including a completely new configuration management approach to automatically create the infrastructure.
In addition, we introduced lightweight, but professional Service Management Processes, helped to identify/analyse key solution problems and established a technical debt backlog.
This second phase took an additional 4 months and subsequently, the unplanned yearly downtimes could be reduced to <10 hours.