DevOps Zone is brought to you in partnership with:

Leaving university, I thought I'd be a developer happily knocking out code but was always drawn to tech support. How do we help people use our tools better? Now, I mostly specialize in consulting and introducing new customers to our tools. I am a Tech Evangelist and Lead Consultant at Urbancode. Eric is a DZone MVB and is not an employee of DZone and has posted 77 posts at DZone. You can read more from them at their website. View Full User Profile

Breaking my Production Website: A Post-Mortem

05.19.2012
| 4754 views |
  • submit to reddit

The Really Short Story

I had a failed deployment. I know how to handle complex deployments, and didn’t follow my own advice. In the future, I should change my automation so that it is easier to do the right thing than mess up in this way. I should also use higher bandwidth communication when discussing complex deployments with my developers.


The Longer Story


DEPLOYMENT GOAL / DEVELOPMENT / TESTING

We recently put the finishing touches on a new white-paper, “Deployment Automation Basics” and wanted to post it up to the website. Unfortunately, we discovered that the backend CMS for that type of new content was broken in our move from anthillpro.com to urbancode.com in the summer. As a retired developer, I punted the bug fix work over to an active developer (Mike). Mike figured out the problem, made the required back-end and front-end changes and delivered them to the Test environment quickly. I tested the behavior there and after an iteration or two, we had something that would work well. My Test environment was now correct, all I had to do was promote what was in that environment to Production and all would be well.

Poor Dev / ops communication Planning prod deploy

At this point, I asked Mike for the scope of the changes over instant message. I learned that the secure content upload was missing and a number of configuration changes were required. As I read through the list, the changes sounded like they were contained to the white-paper management system. Mike agreed that there were exactly two impacted components:

  • Urbancode-com-content (the website content)
  • Urbancode-com-app (the backend system) via a change to its build-time dependency LC-CMS (content management widgets).
Using AnthillPro, I was able to easily determine which version of each component was currently in Test, and sanity check a number of things:

  • Deployments of those apps had targeted Test in between my tests being broken and tests working.
  • The developer had actually made changes impacting those components
  • No other source code changes fed into those components during that time
  • The developer’s other source code changes during that time were not at all related to the website – he’d actually been working on uDeploy.
So my instructions of “Move the two components and white paper uploads will work” looked extremely reasonable and roughly represented my deployment plan.

The Production Deployment

Updating both the front and back-end concurrently felt unsafe to me. I’d never done it. So I started with the simple deployment I do several times a week. I pushed updated content out the door. It is a simple secondary process in AnthillPro executed against the version currently in Test. It takes 3 minutes across the WAN so I checked email. When I got my “deployment complete” instant message, I wrapped up the email I was reading and checked Prod.

Disaster.

Even before I got to the white-paper area. Disaster. No website at urbancode.com. Just a stack-trace. Rational thought left me, and sheer animal terror set in. Rollback!

Over the years, I’ve demoed executing a simple rollback in AnthillPro dozens of times. I quickly looked up the previous production version of the content, and re-pushed it. Website was back up three minutes later and working perfectly. Our total outage was under five minutes. Given the role and traffic loads of our site, that qualifies as “Bad, but not tragic.”

Still, what the !@#$ happened?

Post-Mortem and success

A politer version of that question went to my developer when he was back in the office. It turns out that the back-end changes, actually impacted the whole site, not just the white-paper area. I should have pushed the back-end first, then the front. This, it turns out, is always the expected order when both elements change. Changing both components is quite rare for us though, and I’d never been responsible for a migration where that took place.

We executed the deployments in the correct order that day with perfect success.

Lessons Learned:

  • Like I preach in “Mastering Complex Application Deployments“, the whole deployment process including all components should be defined with partial deployments executing a subset.
  • We should have migrated this deployment from AnthillPro (which promotes components / builds) to uDeploy (which deploys the whole system). We ate our own dog food, but the wrong flavor.
  • Mike was in Cleveland while I was in Denver. Since we couldn’t sit and talk about this change, we should have had a phone or Skype conversation rather than instant message. We could have talked through the release a little better and caught the order dependency I had missed.

At the end of the day, intellectually knowing what to do isn’t enough. Doing things correctly always needs to be more natural and easy that doing things wrong. A “standard operating procedure” would have helped encourage the communication that was lacking and moving the dependency knowledge from our heads into our automation would have prevented the outage outright.

Published at DZone with permission of Eric Minick, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags:

Comments

Ximon Eighteen replied on Sun, 2012/05/20 - 8:46am

Hi Eric,

I'm missing something in your lessons learned section. Isn't the main lesson learned here "use the same upgrade procedure for prod as was shown to work in your prod-like test instance" ?

Thanks,

Ximon 

Eric Minick replied on Mon, 2012/05/21 - 1:05pm

To some extent, that's correct.

 

The trick was that we were updating the 2 parts of our site in our stage environment pretty frequently and usually independently. I did use the same upgrade procedure for the web content in each environment. But my procedure assumed independance which was incorrect.

This post on Dzone is a little delayed from when we originally posted this on the blog. The good news is that we did the most important part of a post-mortem: We actually implemented the change to broaden our definition of the release procedure to be aware of the dependency. No errors of this type since and no extra work on our part. 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.