Big Data/Analytics Zone is brought to you in partnership with:

Mark is a graph advocate and field engineer for Neo Technology, the company behind the Neo4j graph database. As a field engineer, Mark helps customers embrace graph data and Neo4j building sophisticated solutions to challenging data problems. When he's not with customers Mark is a developer on Neo4j and writes his experiences of being a graphista on a popular blog at http://markhneedham.com/blog. He tweets at @markhneedham. Mark is a DZone MVB and is not an employee of DZone and has posted 534 posts at DZone. You can read more from them at their website. View Full User Profile

Thoughts on the Investigation Stage of the Nygard Big Data Model

10.11.2012
| 3037 views |
  • submit to reddit

Earlier this year Michael Nygard wrote an extremely detailed post about his experiences in the world of big data projects and included in the post was the following diagram which I’ve found very useful.

Nygard

Nygard’s Big Data Model (shamelessly borrowed by me because it’s awesome)

Ashok and I have been doing some work in this area helping one of our clients make sense of and visualise some of their data and we realised retrospectively that we were very acting very much in the investigation stage of the model.

In particular Nygard makes the following suggestions about the way that we work when we’re in this mode:

We don’t want to invest in fully automated machine learning and feedback. That will follow once we validate a hypothesis and want to integrate it into our routine operation.

Ad-hoc analysis refers to human-based data exploration. This can be as simple as spreadsheets with line graphs…the key aspect is that most of the tools are interactive. Questions are expressed as code, but that code is usually just “one shot” and is not meant for production operations.

In our case we weren’t doing anything as complicated as machine learning – most of our work was working out the relationships between things, how best to model those and what visualisations would best describe what we were seeing.

We didn’t TDD any of the code, we copy/pasted a lot and when we had a longer running query, and we didn’t try and optimise it there and then. Instead, we ran it once, saved the results to a file, and then used the file to load it onto the UI.

We were able to work in iterations of 2/3 hours during which we tried to answer a question (or more than one if we had time) and then showed our client what we’d managed to do and then decided where we wanted to go next.

To start with we did all this with a subset of the actual data set and then once we were on the right track we loaded in the rest of the data.

We can easily get distracted by the difficulties of loading large amounts of data before checking whether what we’re doing makes sense.

We iterated 4 or 5 different ideas before we got to one that allowed us to explore an area which hadn’t previously been explored.

Now that we’ve done that we’re rewriting the application from scratch, still using the same ideas as from the initial prototype, but this time making sure the queries can run on the fly and making the code a bit closer to production quality!

We’ve moved into the implementation stage of the model for this avenue although if I understand the model correctly, it would be ok to go back into investigation mode if we want to do some discovery work with others parts of the data.

I’m probably quoting this model way too much to people that I talk to about this type of work but I think it’s really good so nice work Mr Nygard!

Published at DZone with permission of Mark Needham, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)