Messaging and transient issues.
A small post but something worth considering.
Recently when discussing a situation where a MongoDB replica set was in the process of failing over, concern was raised about writing data - whilst a new primary was being elected.
This is going to be a transient issue and issues similar to it - such as temporary server outages and routing are too .
They are going to take a little time to resolve but should resolve fairly quickly. In the meantime, there are a few solutions available whilst this transient issue sorts itself out:
- Do nothing. In this case give up and find another job you lazy hacker.
- Let the process fall over, report the failure to users and let them try again via a button click. A users experience might be sullied - in the opinion of some - but it still could be a reasonable way to recover (this depends on what stakeholders/business think really). This might not be reasonable for important information which must stored and can not be optionally retried.
- Employ a retry mechanism, simply loop a few (reasonable times) until we get success, or employ an exponential back off to give reasonable time for recovery. I have done the former using, as someone I know put it, some funky "AOP" shit. However, I wouldn't recommend doing the AOP stuff for transaction management or bounded retries, because eventual consistency still may never occur and AOP, certainly in a lot of the frameworks I have used, is complete magic. I did have trouble explaining the concepts to some of the less experienced and even seasoned members in my team what that "Transaction" attribute was above my DAL method, and how this was setup on the IOC container and how code was executed pre and post the method in an execution pipeline .
- Use a durable messaging framework like NServiceBus. The command or event (message) which was sent will fail and if second level retries are enabled will be retried a specified number of times. If unsuccessful within the specified number of retries (which have exponential back off) The message will be placed on an error queue, relevant administrators will be notified or at least be reporting on this stuff. The exception and or problem will be noted and hopefully fixed and then the message in the error queue will be replayed bringing everything back into a consistent state. And all of this with the user completely unaware that four data centers where nuked.
That is all.