DevOps Zone is brought to you in partnership with:

Mark is a graph advocate and field engineer for Neo Technology, the company behind the Neo4j graph database. As a field engineer, Mark helps customers embrace graph data and Neo4j building sophisticated solutions to challenging data problems. When he's not with customers Mark is a developer on Neo4j and writes his experiences of being a graphista on a popular blog at http://markhneedham.com/blog. He tweets at @markhneedham. Mark is a DZone MVB and is not an employee of DZone and has posted 553 posts at DZone. You can read more from them at their website. View Full User Profile

Lessons from Supporting Production Code

07.23.2013
| 3899 views |
  • submit to reddit

Until I started working on the uSwitch energy website around 8 months ago I had not really done any support of a production system so I learnt some interesting lessons in my time there.

Look at the new code first

We had our application wired up to Airbrake so whenever a user did anything which resulted in an exception being thrown we received a report with the stack trace, environment variables and which page they were on.

When trying to work out what had happened I initially started from scratch and tried to work backwards from the source and create a scenario in my head of what they might have done to get that error.

After a few times of doing this it became clear that there was a reasonable chance that if a user was experiencing a problem it was probably because of some new code that we’d just introduced.

We therefore tweaked our bug hunting algorithm to initially check code that had been changed recently and only after ruling that out did we work back from first principles.

It may never have worked

Sometimes it became clear that new code wasn’t to blame but it seemed implausible that the error could have actually happened.

There was a tendency to assume that the user must be deliberately doing something to make the application break but it soon became clear that they had just managed to hit a code path that had not been hit before.

Even if you’ve done extensive testing on a system users still seem to find paths through the code that haven’t failed previously so it seems best to just assume that is going to happen at some stage.

Log all the things

As I mentioned earlier we were using a 3rd party service to collect errors and other helpful information which was really useful for helping us find the root cause of problems.

The type of logging that you need varies so for a product like neo4j as well as logging exceptions we also log system information and memory settings.

Obviously I’m quite new to this type of work so I’m sure others will have useful bits of advice to share as well.

Published at DZone with permission of Mark Needham, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)