1..2…3…Freeze….Peak
I haven’t
written anything for ages as I’ve been
the busiest I have ever been at my current employer – a large E-Commerce Website - as a result of preparation (planned and unplanned)
for its largest ever “Peak” weekend, my company’s name for the Black Friday weekend.
Sorry this is a long read so grab a coffee or skip to the Summary, or don't read it!
Sorry this is a long read so grab a coffee or skip to the Summary, or don't read it!
I’ve started writing this on the train coming
back from London where I was required to provide extra-support for the Payments
Platform (Domain) - where I ply my trade
– a group of Microservices and Legacy components supporting Payment processing
for the Website and sitting very closely to the Orders Domain.
I expected to be bleary eyed and wired to the hilt, full of coffee and adrenaline as me and my colleagues worked like mad to pull out all the stops to make our giant incredible machine work properly. But no, as our CIO put it, everything just worked (well 99% of the time – more on this later).
The freeze
part of the title refers to the fact that, in the weeks up to Peak, as we tried
to ensure stability of our software, only the most critical software releases were
allowed out. I , like many of my peers,
do not think this is great. This is
costly in many ways; not least, all of the feature development still going on
actively within the company is shelved and starts to gather dust and not being
released means the software is accruing debt on many fronts. However, I can’t be too hasty in judgement,
this way of working comes from something along the way biting my company really
hard during this crucial trading period, this mentality is understandable if
not immediately completely excusable. Hopefully,
this will get better as the company gains more maturity, confidence and control
over its systems.
Application Insights
As part of
the preparations it became clear that our services did not have sufficient
monitoring capabilities. Services were
performance tested in quasi-production like environments and so we had some
idea about how they might perform, but only an idea. So we had some expectations about performance
but very little in the way of helping us see operational health . Measuring and keeping track both would have been impossible without telemetry
and logging of some sort.
We already
had some monitoring available via dashboards powered by Grafana. A number of dashboards exist showing Order flow, and, as a
result, part of an Orders path through Payments, but only at a high level. With various counters representing the number
of Orders at a particular status, e.g. arriving from the Website, currently
being billed, and shipped to the warehouse, this is used to help try to
generalise the performance and health of backend systems as Orders wing their
way through the Website, through Order and Payment Processing and much more and
then, eventually, out to the Warehouse(s).
If problems
start occurring, by some thresholds being exceeded or not met, represented by numbers and traffic light
boxes, focussed investigations into issues can take place in a particular
area. The problem with this is that in
between the different statuses there are a vast swathe of software services in
locations all over the world involved: including New-World Microservices,
Legacy systems, stuff in the sky, Messaging, RESTful APIs, document DBs, SQL DBs, No-Sql Dbs, Edge caching, and
hardware on premise. Application support
have lots of fun trying to diagnose errors and, where they are at a loss ,they
hand over to software development teams/developers who mostly run around in
blind panic at the same time.
More
recently with the release of a updated versions of the Website (which is
multi-platform/device) New Relic was
enlisted in helping see how the various APIs consumed by the Website were behaving
and this has helped to start to build a more fine grained picture of the operational
behaviour of APIs ( and by association the Websites) where
various HTTP statuses can be obtained and detailed analysis about response
times, payload sizes, customer locations, can be gleaned, but that’s about it in
terms of insight into what an API is doing. There is yet another monitoring
tool called Kibana (sitting atop of the
ELK stack) which is used for some of our Azure based services, largely this
gives transparency to telemetry just like New
Relic. My money’s on New Relic in this area though.
So, the
introduction of performance counters, custom and otherwise were conveniently
placed at the door of our APIs, on the 3rd party calls within them
and in and around the messaging infrastructure used in the backend processing
for our services. Relevant logging was
added too, to ensure that in the event of failure or exceptional behaviour we
had some trace of what was going on. So,
having all of this information, but not
at your fingertips is a bit of a nightmare.
Consolidating this all into one single place is achieved with dramatic effect using Application Insights hosted in Azure. Simply adding appropriate libraries in the service and hooking it up to an App Insights Resource Group in Azure and boom you are presented with a bewildering array of options to present you with your telemetry. Coupled with this you can also leverage Analytics to perform queries on all of the data collected and herein lies a problem. Making sense of all of this information is hard. Certainly, there are some headline acts easily noticeable but more subtle problems like a semi dodgy 3rd party call, or reasonably flaky database writes can be eked out but not after some considered tweaking with Queries etc. Also without reasonable SLAs from the business in most places and the fact that a message queueing system is used, what alerts should be used and when is a question we are still asking ourselves.
Consolidating this all into one single place is achieved with dramatic effect using Application Insights hosted in Azure. Simply adding appropriate libraries in the service and hooking it up to an App Insights Resource Group in Azure and boom you are presented with a bewildering array of options to present you with your telemetry. Coupled with this you can also leverage Analytics to perform queries on all of the data collected and herein lies a problem. Making sense of all of this information is hard. Certainly, there are some headline acts easily noticeable but more subtle problems like a semi dodgy 3rd party call, or reasonably flaky database writes can be eked out but not after some considered tweaking with Queries etc. Also without reasonable SLAs from the business in most places and the fact that a message queueing system is used, what alerts should be used and when is a question we are still asking ourselves.
Long nights and Instability
So getting
to a good place a lot of pain needs to be had.
And boy did we experience our fair share up to peak. With SAN migrations causing mayhem with our
messaging, Windows Cluster failovers misbehaving and deleted Orders in Azure there
was plenty of time to practice and gain experience in trying to solve
problems quickly and still meet “customer
promise” a term used to describe meeting
cutoff for various delivery options, which is not taken lightly.
Graph Watching First Blood Part III
So the culmination in all of the Black Friday prep is a series of graph watching sessions, with eyes trained on graphs rolling around in multi-tabbed browsers and lots of suggestions and conjecture about what this trend is saying and what that trend is saying. I was personally involved in 4 of the 5 days of Peak support, remotely on Thursday, Black Friday some of Saturday and on premise (by demand) on Cyber Monday. Certainly, at the beginning of the event focus was very high and everyone was apprehensive about what might occur and what problems might need to be addressed. Previous year’s Peak had seen all manner of catastrophes including site outages and various problems.
So the culmination in all of the Black Friday prep is a series of graph watching sessions, with eyes trained on graphs rolling around in multi-tabbed browsers and lots of suggestions and conjecture about what this trend is saying and what that trend is saying. I was personally involved in 4 of the 5 days of Peak support, remotely on Thursday, Black Friday some of Saturday and on premise (by demand) on Cyber Monday. Certainly, at the beginning of the event focus was very high and everyone was apprehensive about what might occur and what problems might need to be addressed. Previous year’s Peak had seen all manner of catastrophes including site outages and various problems.
The biggest
problem this year, then, from a Payments
perspective, was that a Legacy component on dealing with timeouts with a
Payment Service Provider couldn’t reliably recover and compensate from the timeouts and so Orders would be left
in a state where we didn’t know whether they had been billed. Simply retrying the Orders meant risking a
cancellation occurring as the legacy component is pretty dumb, so a quick
support tool was knocked up and like a scene from Lost, a button was pushed
every so often to make sure these were processed properly.
The highlight
though is that everything, generally, worked.
On Cyber Monday, due to a failure with a voucher service Orders dropped
off a cliff. Fixing the problem and
spreading the word to our persistent and patient customers meant that between
9-10pm an average of 55 orders as second were being processed and all of this
would land at Payments door step. The system coped admirably and a backlog of
messages, of which there were up to about 40k at one point (spread across our
system) were slowly but surely dealt with.
Summary
The
Payments Platform have now got a lot a data regarding capabilities during spiky
load periods, a greater operational understanding and has breathed a big sigh of relief having
been able to support Black Friday. Thing
is, its going to be even bigger next year and we are just about to start preparations
in updating our Platform to be even more resilient, scalable and
available. Looking forward to it. We will have a lot more control of our
systems come this time next year and a maturity and experience to put what we
have learnt to good use.
Up Next....Reactive Extensions or maybe even HAL (Hypermedia Application Language)
Up Next....Reactive Extensions or maybe even HAL (Hypermedia Application Language)
Comments
Post a Comment