1..2…3…Freeze….Peak



I haven’t written anything for ages as  I’ve been the busiest I have ever been at my current employer – a large E-Commerce Website -  as a result of preparation (planned and unplanned) for its largest ever “Peak” weekend, my company’s name for the Black Friday weekend.    

Sorry this is a long read so grab a coffee or skip to the Summary, or don't read it!

I’ve started writing this on the train coming back from London where I was required to provide extra-support for the Payments Platform (Domain)  - where I ply my trade – a group of Microservices and Legacy components supporting Payment processing for the Website and sitting very closely to the Orders Domain. 

I expected to be bleary eyed and wired to the hilt, full of coffee and adrenaline as me and my colleagues worked like mad to pull out all the stops to make our giant incredible machine work properly. But no, as our CIO put it, everything just worked (well 99% of the time – more on this later). 


The freeze part of the title refers to the fact that, in the weeks up to Peak, as we tried to ensure stability of our software, only the most critical software releases were allowed out.  I , like many of my peers, do not think this is great.  This is costly in many ways; not least, all of the feature development still going on actively within the company is shelved and starts to gather dust and not being released means the software is accruing debt on many fronts.  However, I can’t be too hasty in judgement, this way of working comes from something along the way biting my company really hard during this crucial trading period, this mentality is understandable if not immediately completely excusable.  Hopefully, this will get better as the company gains more maturity, confidence and control over its systems.

Application Insights

As part of the preparations it became clear that our services did not have sufficient monitoring capabilities.   Services were performance tested in quasi-production like environments and so we had some idea about how they might perform, but only an idea.    So we had some expectations about performance but very little in the way of helping us see operational health .  Measuring and keeping  track both would have been impossible without telemetry and logging of some sort.    

We already had some monitoring available via dashboards powered by Grafana. A number of dashboards exist showing Order flow, and, as a result, part of an Orders path through Payments, but only at a high level.  With various counters representing the number of Orders at a particular status, e.g. arriving from the Website, currently being billed, and shipped to the warehouse, this is used to help try to generalise the performance and health of backend systems as Orders wing their way through the Website, through Order and Payment Processing and much more and then, eventually, out to the Warehouse(s). 

If problems start occurring, by some thresholds being exceeded or not met,  represented by numbers and traffic light boxes, focussed investigations into issues can take place in a particular area.   The problem with this is that in between the different statuses there are a vast swathe of software services in locations all over the world involved: including New-World Microservices, Legacy systems, stuff in the sky, Messaging, RESTful APIs, document DBs,  SQL DBs, No-Sql Dbs, Edge caching, and hardware on premise.  Application support have lots of fun trying to diagnose errors and, where they are at a loss ,they hand over to software development teams/developers who mostly run around in blind panic at the same time.

More recently with the release of a updated versions of the Website (which is multi-platform/device) New Relic was enlisted in helping see how the various APIs consumed by the Website were behaving and this has helped to start to build a more fine grained picture of the operational behaviour of APIs   ( and by association the Websites) where various HTTP statuses can be obtained and detailed analysis about response times, payload sizes, customer locations, can be gleaned, but that’s about it in terms of insight into what an API is doing. There is yet another monitoring tool called Kibana (sitting atop of the ELK stack) which is used for some of our Azure based services, largely this gives transparency to telemetry just like New Relic. My money’s on New Relic in this area though.

So, the introduction of performance counters, custom and otherwise were conveniently placed at the door of our APIs, on the 3rd party calls within them and in and around the messaging infrastructure used in the backend processing for our services.  Relevant logging was added too, to ensure that in the event of failure or exceptional behaviour we had some trace of what was going on.  So, having all of this information, but not at your fingertips is a bit of a nightmare. 

Consolidating this all into one single place is achieved with dramatic effect using Application Insights hosted in Azure.    Simply adding appropriate libraries in the service and hooking it up to an App Insights Resource Group in Azure and boom you are presented with a bewildering array of options to present you with your telemetry.    Coupled with this you can also leverage Analytics to perform queries on all of the data collected and herein lies a problem.  Making sense of all of this information is hard.  Certainly, there are some headline acts easily noticeable but more subtle problems like a semi dodgy 3rd party call, or reasonably flaky database writes can be eked out but not after some considered tweaking with Queries etc.    Also without reasonable SLAs from the business in most places and the fact that a message queueing system is used, what alerts should be used and when is a question we are still asking ourselves.

Long nights and Instability

So getting to a good place a lot of pain needs to be had.  And boy did we experience our fair share up to peak.  With SAN migrations causing mayhem with our messaging,  Windows Cluster failovers  misbehaving and deleted Orders in Azure there was plenty of time to practice and gain experience in trying to solve problems  quickly and still meet “customer promise”  a term used to describe meeting cutoff for various delivery options, which is not taken lightly. 

Graph Watching First Blood Part III

So the culmination in all of the Black Friday prep is a series of graph watching sessions, with eyes trained on graphs rolling around in multi-tabbed browsers and lots of suggestions and conjecture about what this trend is saying and what that trend is saying.   I was personally involved in 4 of the 5 days of Peak support, remotely on Thursday, Black Friday some of Saturday  and on premise (by demand) on Cyber Monday.  Certainly, at the beginning of the event focus was very high and everyone was apprehensive about what might occur and what problems might need to be addressed.  Previous year’s Peak had seen all manner of catastrophes including site outages and various problems.


   
The biggest problem this year, then,  from a Payments perspective, was that a Legacy component on dealing with timeouts with a Payment Service Provider couldn’t reliably recover and compensate  from the timeouts and so Orders would be left in a state where we didn’t know whether they had been billed.  Simply retrying the Orders meant risking a cancellation occurring as the legacy component is pretty dumb, so a quick support tool was knocked up and like a scene from Lost, a button was pushed every so often to make sure these were processed properly. 

The highlight though is that everything, generally, worked.   On Cyber Monday, due to a failure with a voucher service Orders dropped off a cliff.  Fixing the problem and spreading the word to our persistent and patient customers meant that between 9-10pm an average of 55 orders as second were being processed and all of this would land at Payments door step. The system coped admirably and a backlog of messages, of which there were up to about 40k at one point (spread across our system) were slowly but surely dealt with.  

Summary

The Payments Platform have now got a lot a data regarding capabilities during spiky load periods, a greater operational understanding and has breathed a big sigh of relief having been able to support Black Friday.   Thing is, its going to be even bigger next year and we are just about to start preparations in updating our Platform to be even more resilient, scalable and available.  Looking forward to it.  We will have a lot more control of our systems come this time next year and a maturity and experience to put what we have learnt to good use.  

Up Next....Reactive Extensions or maybe even HAL (Hypermedia Application Language)




Comments

Popular posts from this blog

Book Review: Test Driven Development By Example (Kent Beck)

SSDT March 2014 Static Code Analysis Extensibility

Testing, Testing, 1.2.3.4.5.6..ad infinitum