Big Data/Analytics Zone is brought to you in partnership with:

Cloud Computing/ Big Data/ Open Data/ Linked Data Consultant & Analyst. Podcasting with tech execs. Analyst for GigaOM Pro. Based in the UK, working world-wide. Paul is a DZone MVB and is not an employee of DZone and has posted 35 posts at DZone. You can read more from them at their website. View Full User Profile

Find the Data, Aggregate the Data, Make the Data Useful

05.05.2013
| 3174 views |
  • submit to reddit

I was in New York in March, taking part in GigaOM’sStructure:Data event. As usual on these trips, I spent the day before the event walking around the city, soaking up some air, getting rained on, using coffee to stay awake, and meeting with a number of local companies. Of the companies I met that day, one stood out. And this week, that same company was recognised by others when it won TechCrunch Disrupt NY. That company was Enigma.

Enigma pulls data from tens of thousands of public data sets, and then offers up an interface that makes it pretty straightforward to trawl through the whole lot in search of the data points that you actually need. As the company’s Marc DaCosta introduced it, a “search and discovery platform for public data.”

At present, everything in Enigma is publicly available data. It’s mostly from the USA right now, and is acquired by a combination of screen scraping/ crawling of .gov sites… and calling government agencies to request that they ship CDs of offline material. DaCosta stresses that the data is all — theoretically — available to any US citizen, but that it’s not really that easy for them to access. Even with the advent of central sites like data.gov, primary data remains spread across a multitude of portals and web pages, stored in a dizzying array of machine readable formats… and (all too often) as PDFs full of tabular data. DaCosta finishes his preamble by stressing the need for

infrastructure to acquire, index and search public data.

The team’s Disrupt pitch and the subsequent back and forth with the judges provides a pretty good overview. It is worth 12 minutes of your time to watch over on TechCrunch.

During the Q&A, Freestyle Capital’s Dave Samuel raised the same concern I did during my meeting with the Enigma team. The set of possibilities with public data is extremely large, but still finite. Far more so than Google, there are real limitations to the sorts of questions that it makes sense to ask. Presented with a search box, users struggle to understand what’s possible and what’s sensible. Simple tricks would go a long way toward helping here, such as highlighting today’s most popular queries, or providing sets of sample queries as Wolfram Alpha does. Users need to learn how to work with aggregations of data such as this, and the onus is on sites like Enigma to help grow their potential market through lightweight and accessible education.

In many ways, though, I think the biggest opportunity for Enigma doesn’t lie in their website at all. The real opportunity lies in their api, and in licensing it to third parties who will construct entirely new Enigma-powered vertical applications or integrate Enigma data into existing software such as an investor’s due diligence systems.

In and of itself, the data Enigma has gathered is interesting, but neither indispensable nor unique. The real value comes from offering easy, reliable, comprehensive and cost-effective means to get this data integrated into existing workflows, and that’s where the api needs to shine. There’s also an opportunity to encourage/enable companies to upload their own data into the platform, letting them combine it with the public data already there. This is, apparently, on the road map. I would imagine that this corporate data will initially only be available to the company that owns it, but there are a whole set of other opportunities around sharing data with supply chain partners, and even making it available to anyone.

Nice (and fast!) as the website is once you understand how it works, I tend to see it far more as a shop window than as a revenue-generating service in its own right. The Enigma team disagrees, seeing subscription-based access to the website as one of their two main products. We shall see who proves more right in due course!

Finally, for now, the company faces a real challenge in scaling. It needs to pull in more data from inside the US, and it needs to broaden its coverage outside the US. Its data acquisition processes, although highly automated, remain pretty labour intensive. And its data teams need to understand the data they’re working with. Some government data is rigorously documented, whereas other data sets are almost unintelligible. Growth, as the Enigma team recognises all too well, requires far more than simply pointing their crawlers at some new web domains.

March and April were busy with travel. May’s pretty quiet on the travel front, with plenty of opportunity to get some proper work done before heading to Brussels at the end of the month. The next trip States-side looks like being to San Francisco in June. I wonder who I’ll meet there… and what they’ll go on to win?

Published at DZone with permission of Paul Miller, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags: