The Good Case: Stabilising through Corona Peak Load
Our customer was experiencing traffic that was 5-10 times their typical pattern. Visitor numbers, transactions, and page requests to the Digital Commerce & Content Platform was peaking. This unexpected “peak season” was a result of the global lockdown during the corona pandemia.
Although most parts of the platform were built and tested very thoroughly, the platform was not able to, at all times, deal “on its own” with high regional loads.
On any single day >5m EUR
in revenue is being generated
n this platform.
As best-in-class-quality of service was an “absolute must-have” for our customer, we spent a lot of time during the initial go-live and the last holiday season on observability concerns. We had built-up and implemented solid and fine-grained infrastructure, as well as application monitoring and alerting capabilities.
Through intense performance and endurance tests, followed by tuning activities (towards lower peak-level targets), we had gathered detailed knowledge about the platform’s behaviour during peak load times and were able to trace traffic and watch known bottlenecks.
As our sensors/monitoring tools detected anomalies (initially without customer impact), our teams were able to quickly react to these incidents and could mitigate a number of outages.
The platform went offline nevertheless as a change (new feature) was applied and the system became unstable and crashed.
Because our 24/7 first and second level support team (which is foundational for Managed Services provided by Mindcurv), could react quickly (although the incidents occurred after office hours), a workaround solution could be identified quickly (scale-up memory, CPU) and the platform was up and running again after only 14 minutes.
Later, we identified together with other parties the real root cause, an issue in the application that was introduced with the last feature release.
Our customer believed in the importance of professional platform operations and had agreed to building-up a 24/7 support organization with strict SLAs. They also had proper monitoring, alerting solutions, and had thoroughly invested in architecture reviews, performance tests and optimisation.
Therefore, we could familiarise ourselves with the platform behaviour under load, and were able to mitigate and resolve this critical “P1” incident quickly and reduce platform downtimes to a minimum.
On any single day, >5m EUR in revenue is being generated on this platform, therefore it was vital for the customer to be online again quickly. It is known that customers (in this case grocery shoppers), will soon look for alternative options if they cannot transact immediately. A downtime of 4 hours could have resulted in a revenue loss of >2m EUR, and potentially the loss of thousands of customers that may turn to a competitor for their grocery needs.