DevOps Zone is brought to you in partnership with:

Geoff Papilion has made a living running infrastructure for the past 15 years. He is currently employeed at Wikia.com, scaling the infrastructure to 1.5 billion request per day. Geoffrey is a DZone MVB and is not an employee of DZone and has posted 26 posts at DZone. You can read more from them at their website. View Full User Profile

The Challenge of Small Ops: Part 1

07.09.2012
| 3466 views |
  • submit to reddit

I missed a open session at DevOps days, and I’m really disappointed that I did after hearing feedback from one of the conference participant. He said many people in the session we’re advocating for eliminating operations in the small scale.

I realize that the world is changing, and that operations teams need to adjust to the changing world. We have skilled development teams, cloud services, APIs for almost everything you need to build a moderately sized web service. Its no wonder that some smaller(and some large) organizations are beginning to question the need for ops teams. So, I’m going to write a series of articles discussing the challenges that exist in operations of small and medium sized teams, and how an operations expert can help solve these issues.

Ops as the Firefighter

In my discussion with a fellow conference-goer about this topic, when he said the general feeling was that you push responsibility to your vendors and eliminate ops, I suggested that perhaps we should think of ops as a firefighter.

Small towns still have firefighting teams, and they may be volunteers, but I’ll bet they were trained by a professional. You should think of an Operations Engineer as your companies trainer. You should lean on them for the knowledge that can only be gained working in an operational environment.

On-Call

Failure is the only constant for web services, and your should expect them to happen. You will need to respond to failures in a calm and organized manner, but this is likely too much for a single individual. You’ll need a better approach.

A mid-level or senior operations engineer should be able to develop an on-call schedule for you. They should be able to identify how many engineers you need on-call in order to meet any SLA response requirement. In addition they can train your engineers how to respond, and make sure any procedure is followed that you might owe to customers. They can make everyone more effective in an emergency.

Vendor Management

Amazon, Heroku, and their friends all provide excellent reliable platforms, but from time to time they fail. Vendors typically like to restrict communications to as few people as possible, since it makes it easier for them to communicate. If you’re not careful you may find yourself spreading responsibility for vendors across your organization, as individuals add new vendors.

I believe it makes more sense to consolidate the knowledge in an operations engineer. An operations engineer is used to seeing vendors fail, and will understand the workflow required to report and escalate a problem. They understand how to read your vendors SLA, and hold them accountable to failures. Someone else can fill this role, but this person needs to be available at all hours, since failure occur randomly, and they will need to understand how to talk to the NOC on the other end.

The Advocate

Your platform provides a service, and you have customers that rely on you. Your engineering team often becomes focused on individual systems, and repairing failures in those systems. It is useful if someone plays the role of the advocate for the service, and I think operations is a perfect fit. A typical ops engineer will be able to determine if the service is still failing, and push for a resolution within the organization. They are generally familiar with the parts of the service and who is responsible for them.

Published at DZone with permission of Geoffrey Papilion, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)