DevOps Zone is brought to you in partnership with:

I have been working for almost two years now on infrastructure and deployment automation, exploring programmatic solutions to traditional systems administration problems and configuration management. I'm fanatical about testing, the scientific method and building good tools to support awesome   Oliver is a DZone MVB and is not an employee of DZone and has posted 29 posts at DZone. You can read more from them at their website. View Full User Profile

Efficient Retry/Backoff Mechanisms

10.17.2013
| 5843 views |
  • submit to reddit

In day to day life as a systems engineer there aren’t too many opportunities to build solutions based purely on theoretical studies or research papers – generally solutions are built for you and you “just” need to make them work, tie up the spaghetti, shave the yak and so on. Entering the world of programming and software development though, a lot of new problems come up that you can find your own solution to. On one hand this can be extremely liberating – speaking from experience, having for the first time in my career being employed as a software engineer for just over a year now. You can let your creativity go wild, dream up whatever crazy solution and then implement it and see if it works.

On the other hand, there is a vast body of existing work out there to build on, almost exclusively by people smarter than you are and who have put a lot of effort and thought into solving a specific problem. Some good examples in recent times are Paxos and similarly Raft consensus algorithms, F1 Distributed Database, BigTable, Map/Reduce and many many more. A great blog that I try to keep on top of is Werner Vogels’ All Things Distributed, where he maintains a mix of AWS-related announcement-style posts and musings on the various research papers he is reading while on flights between engagements. Needless to say that the Google Research Papers are another valuable source of information, as well as a lot of things coming out of the various SIG* organizations (e.g. SIGOPS, SIGGRAPH).

But back to my point. Having recently coded up a few systems which hurl copious amounts of requests at various AWS services, I very quickly found out that you WILL get errors and need to handle them in a smart way. Not retrying is not an option, as is retrying indefinitely with no limit. Linearly-spaced retries with a limit tend to also suffer from abusing the target system beyond what it can tolerate, before giving up entirely and perhaps prematurely. Once you start to implement a system with exponential backoff similar to how most multi-access medium algorithms work (think Ethernet with CSMA/CD), you are finally starting to get somewhere.

Someone at $EMPLOYER recently linked to this paper which details backoff mechanisms utilising a Fibonacci sequence. It is very interesting work, and taking the time to read and understand such research allows you to build stronger, more resilient software with characteristics far exceeding what you could have come up with on your own. This in itself I find fascinating and addictive. I strongly recommend giving it a read and utilizing this approach to building software.



Published at DZone with permission of Oliver Hookins, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)