I spend a lot of time in large organizations that have spent an absolute fortune on highly-available systems, yet when those systems fail over (just as they were designed to do), most of the applications in the building break.
Because the developers have assumed that nothing ever breaks and have written their code in a far too optimistic manner. Did they do their jobs?
Is it possible for their next layer of code to deal with, say, a server disappearing for a few seconds? Of course it is. But it’s not going to happen by accident. It’s even more important in a cloud-based world.
There was a question about deadlocks again recently on one of our local mailing lists. Can you deal with deadlocks?
Again though, none of this is automatic. But allowing for (and perhaps even expecting) failure is one of the differences in building enterprise level code rather than toy code.
Plan for failure and be pleasantly surprised when it doesn’t happen often. But don’t plan for perfection or you’ll be disappointed.
While it is possible to handle deadlocks within T-SQL code, I prefer to catch them in the next layer of code (let’s call it client code here), as there are other types of errors that should be retried at that level anyway.
Applications should have retry logic to cope with things like:
- Deadlock (error 1205)
- Snapshot concurrency violations (error 3960)
- Server disconnection (can be due to network issues, fail-over of HA-based systems, etc.)
- Various resource issues on the server
It’s important to get into the habit of assuming that a transaction that you need to apply to the DB might work, rather than assuming that it will work. Always apply it via logic like:
- While we haven’t applied the transaction to the server, and while the retry time/count hasn’t expired, let’s try to make it happen.
- If an error occurs, depending upon the error, we might back off for a while and try again.
- For things like deadlocks, it’s good to have some sort of exponential back-off with a random component.
- Some errors are pointless to retry (ie: a primary key violation probably isn’t ever going to work)
Once you build it that way, things become much more robust and resilient. The user should generally be unaware of these issues, apart from a slight processing delay.