DevOps Zone is brought to you in partnership with:

An early believer in the ability of Java to deliver "enterprise-grade" software, Andrew Phillips quickly focused on the development of high-throughput, resilient and scalable J2EE applications. Specializing in concurrency and high performance development, Andrew gained substantial experience of the intricacies, complexity and challenges of enterprise application environments while working for a succession of multinationals. Continuously focused on effectively integrating promising new developments in the Java space into corporate software development, Andrew joined XebiaLabs in March 2009, where he is a member of the development team of their deployment automation product Deployit. Amongst others, he also contributes to Multiverse, an open-source Java STM implementation, and jclouds, a leading Java cloud library. Andrew is a DZone MVB and is not an employee of DZone and has posted 22 posts at DZone. You can read more from them at their website. View Full User Profile

Embracing Downtime: Why 99.999…% Availability is Not Always Better

12.09.2010
| 6392 views |
  • submit to reddit

A couple of weeks ago, my ever-active colleagues Marco Mulder and Serge Beaumont organised an nlscrum meetup about "Combining Scrum and Operations", with presentations by Jeroen Bekaert and devopsdays organiser Patrick Debois.

Unfortunately, I was late and only managed to catch the tail end of Patrick's well-delivered talk explaining how Dev/ops can become Devops. Thankfully, the lively open space discussions that followed provided plenty of interesting insights, comments and general food for thought.

One recurring theme that particularly struck me was the comment, uttered with regret by many in Operations, that they would very much like to help and coordinate with the development teams but inevitably were always too busy keeping the production environment up and running.
In other words, helping prepare for new releases might be desirable, but achieving the five nines, or whatever SLA Operations has committed to1, will always be paramount.

This is a fallacy! Indeed, one of the core realisations of the "Devops mindset", to me, is that 99.999...% uptime is not an end in itself, but a means to an end: delivering the greatest business value possible. And aiming for the highest possible availability may not be the best way to go about it!2

For instance, imagine a day's downtime in production costs $500k, and you have a new feature coming up for release that is estimated to bring in an extra $1m per day. Then for every day by which you can speed up the release you can afford almost two days of downtime!2

The point is: the ability to maintain a stable current environment cannot be considered independently of the ability to rapidly deliver change. Rather, they need to balanced against each other to determine which combination will likely deliver greatest value. This is a decision only the business owner or customer can make. And naturally, the balance needs to continuously monitored and updated in light of new requirements and experience.

devops-thefuture

There is a residual belief that the the tasks and responsibilities of developers and Operations are sufficiently different that they can't possibly benefit from each other's input. But whether it's the effects of placing nodes of a distributed system in different segments of the production network, or how the sharding and replication strategies of the database affect query performance, or even just knowing which version (and vendor!) of the JVM and container will be supported in production when the application goes live3 - developers need Operations input, and the earlier, the better.
And only developers can add the internal health checks, debugging and tracing information, integration points for monitoring tools etc. that can mean the difference between a five minute fix and a week's frustrated log trawling for the support team. It's revealing to see how quickly this crucial, yet often neglected feature of an application is improved if developers are also responsible for support - generally, the first callout at three in the morning makes a world of difference.3

It goes without saying that the acceptable balance between stability and change will differ from customer to customer, and from application to application. Globally shared infrastructure can cause problems here, because it's hard to be able to meet the requirements of the most demanding application without forcing all the others to pay the price.
In other words, modularity is an important goal architecturally, and if you're interacting with shared infrastructure it should be tunable to your requirements. Amazon's Dynamo and, indeed, most of the cloud and distributed platforms out there exemplify this trend. But I'd like to defer a detailed discussion of the technical implications to a later blog4.
My colleague Robert van Loghem and I will also be talking about this and related topics in our upcoming webinarplug!.

Going back to the nlscrum meetup, the takeaway message for me was clear: setting up two independent entities, Development and Operations, giving them opposing goals (delivering change on the one hand, ensuring stability on the other) and expecting them to fight it out when the inevitable conflict happens is not the way to best deliver business value. We should be looking to organise our teams and activities to deliver the balance between new features and running systems that is most appropriate for a given application.
And we can only do that if we first go to the customer, explain that there is a trade-off to be made and work together to make it!

Addendum: in the unexpectedly long time it's taken me to finish off this post, my colleague Gero Vermaas described a client scenario that featured a real-life version of this challenge. It's good to see the client finally came round to accepting the concept, hopefully with the expected positive results!

Footnotes
  1. Too often without drawing on actual day-to-day experience, a point made by Patrick.
  2. Of course, rushing inadequately tested, unstable software out just to release a feature on a certain date usually isn't a good way to go about it, either. This post is not supposed to be "Ops-bashing"; it's just that reducing the "feature frequency" is far less controvertial, in most organisations, than even considering reduced stability.
  3. Quite a few big companies are adopting this model for all their applications. A number of attendees at the nlscrum meeting also reported positive experiences with this approach.
  4. The relative magnitude of the two figures is not particularly realistic, for sure. It's just for example's sake.
  5. Don't laugh! I've seen it happen too often, to clever and experienced developers, to believe this only an isolated problem.
  6. Or even "blog series", who knows.

 

From http://blog.xebia.com/2010/12/08/embracing-downtime-why-99-999-availability-is-not-always-better/

Published at DZone with permission of Andrew Phillips, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Loren Kratzke replied on Thu, 2010/12/09 - 2:03pm

Most interesting is your comment about opposing goals of the two teams: delivering change on the one hand, ensuring stability on the other. Around here, ops sometimes hobbles dev with bureaucracy created to minimize their work effort, so dev fights back by taking over systems previously managed by ops to minimize our work effort - one permission at a time. It's a jungle.

But the requirement for 99.999% uptime is the result of NOT having brilliant ideas every day that double revenue over night. The reality on the ground is that you are trying to protect existing revenue (through stability and uptime) while trying to increase revenue (through creative and dev).

Our customer is our CEO. You don't tell him anything. He tells you how it's going to be. 'nuff said about that.

Also, downtime is more expensive than the time you are down. It's not like turning a faucet off and then on again. It may take hours, days, or longer for revenue to return to pre-downtime levels depending upon your business model.

It is totally hypothetical - your case where being down for a day will push a revenue-doubling product out a day sooner. Of course anybody would be a fool not to take that offer, but I have never seen this situation arise. It does make one think though.

Andrew Phillips replied on Tue, 2010/12/14 - 8:36am

@Loren: It is totally hypothetical - your case where being down for a day will push a revenue-doubling product out a day sooner.

Oh, sure. And of course I wasn't trying to suggest that switching off your servers will somehow get your features developed faster. The scenario I had in mind was more like:

 "Your application is going distributed, you're having a meeting to thrash out the new architecture and have asked someone from Operations to join in to give input on latencies in the production network. A couple of minutes into the meeting, a colleague from Ops rushes in to say that database replication in production is slow. Q: Does the Ops guy/girl stay or go?"

In most cases we've seen, it's a foregone conclusion that the Ops guy/girl goes, even if we're not talking about an "all servers down" event. The aim of the article was simply to point out that decisions like these are part of a trade-off that needs to be considered, and shouldn't be knee-jerk reactions.

 

Loren Kratzke replied on Wed, 2010/12/22 - 10:43am in response to: Andrew Phillips

Fair enough. Just keeping you on your toes! Good article.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.