Big Data/BI Zone is brought to you in partnership with:

Mitch Pronschinske is a Senior Content Analyst at DZone. That means he writes and searches for the finest developer content in the land so that you don't have to. He often eats peanut butter and bananas, likes to make his own ringtones, enjoys card and board games, and is married to an underwear model. Mitch is a DZone Zone Leader and has posted 2569 posts at DZone. You can read more from them at their website. View Full User Profile

Dedupe, Merge, and Purge: the Art of Normalization

12.10.2011
| 4360 views |
  • submit to reddit
Here is the abstract and video from the session by Tyler Bell and Leo Polovets’s talk from Strata NYC 2011

Big Noise always accompanies Big Data, especially when extracting entities from the tangle of duplicate, partial, fragmented and heterogeneous information we call the Internet. The ~17m physical businesses in the US, for example, are found on over 1 billion webpages and endpoints across 5 million domains and applications. Organizing such a disparate collection of pages into a canonical set of things requires a combination of distributed data processing and human-based domain knowledge. This presentation stresses the importance of entity resolution within a business context and provides real-world examples and pragmatic insight into the process of canonicalization.