Big Data/Analytics Zone is brought to you in partnership with:

Mark is a graph advocate and field engineer for Neo Technology, the company behind the Neo4j graph database. As a field engineer, Mark helps customers embrace graph data and Neo4j building sophisticated solutions to challenging data problems. When he's not with customers Mark is a developer on Neo4j and writes his experiences of being a graphista on a popular blog at http://markhneedham.com/blog. He tweets at @markhneedham. Mark is a DZone MVB and is not an employee of DZone and has posted 524 posts at DZone. You can read more from them at their website. View Full User Profile

A First Failed Attempt at Natural Language Processing

12.05.2012
| 3913 views |
  • submit to reddit
One of the things I find fascinating about dating websites is that the profiles of people are almost identical so I thought it would be an interesting exercise to grab some of the free text that people write about themselves and prove the similarity.

I’d been talking to Matt Biddulph about some Natural Language Processing (NLP) stuff he’d been working on and he wrote up a bunch of libraries, articles and books that he’d found useful.

I started out by plugging the text into one of the many NLP libraries that Matt listed with the vague idea that it would come back with something useful.

I’m not sure exactly what I was expecting the result to be but after 5/6 hours of playing around with different libraries I’d got nowhere and parked the problem not really knowing where I’d gone wrong.

Last week I came across a paper titled “That’s What She Said: Double Entendre Identification” whose authors wanted to work out when a sentence could legitimately be followed by the phrase “that’s what she said”.

While the subject matter is a bit risque I found that reading about the way the authors went about solving their problem was very interesting and it allowed me to see some mistakes I’d made.

Vague problem statement

Unfortunately I didn’t do a good job of working out exactly what problem I wanted to solve – my problem statement was too general.

In the paper the authors narrowed down their problem space by focusing on a specific set of words which are typically used as double entendres and then worked out the sentence structure that the targeted sentences were likely to have.

Instead of defining my problem more specifically I plugged the text into Mallet, morpha-stemmer and Stanford Core NLP and tried to cluster the most popular words.

That didn’t really work because people use slightly different words to describe the same thing so I ended up looking at Yawni – a wrapper around WordNet which groups sets of words into cognitive synonyms.

In hindsight a more successful approach might have been to find the common words that people tend to use in these types of profiles and then work from there.

No Theory

I recently wrote about how I’ve been learning about neural networks by switching in between theory and practice but with NLP I didn’t bother reading any of the theory and thought I could get away with plugging some data into one of the libraries.

I now realise that was a mistake as I didn’t know what to do when the libraries didn’t work as I’d hoped because I wasn’t sure what they were supposed to be doing in the first place!

My next step should probably be to understand how text gets converted into vectors, then move onto tf-idf and see if I have a better idea of how to solve my problem.

Published at DZone with permission of Mark Needham, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)