Simplistic noun-phrase chunking with POS tags in Java
I needed to extract Noun-Phrases from text. The way this is generally done is using Part-of-Speech (POS) tags. OpenNLP has a both a POS-tagger as well as a Noun-Phrase chunker. However, it's really really really slow!
I decided to look into alternatives, and chanced upon QTag.
QTag is a "freely available, language independent POS-Tagger. It is implemented in Java, and has been successfully tested on Mac OS X, Linux, and Windows."
It's waaay faster than OpenNLP for POS-tagging, though I haven't done any benchmarks as to a accuracy.
Here's my really simplistic but adequate implementation of noun-phrase chunking using QTag.
private Qtag qt;
public static List<String> chunkQtag(String str) throws IOException {
List<String> result = new ArrayList<String>();
if (qt == null) {
qt = new Qtag("lib/english");
qt.setOutputFormat(2);
}
String[] split = str.split("\n");
for (String line : split) {
String s = qt.tagLine(line, true);
String lastTag = null;
String lastToken = null;
StringBuilder accum = new StringBuilder();
for (String token : s.split("\n")) {
String[] s2 = token.split("\t");
if (s2.length < 2) continue;
String tag = s2[1];
if (tag.equals("JJ")
|| tag.startsWith("NN")
|| tag.startsWith("??")
|| (lastTag != null && lastTag.startsWith("NN") && s2[0].equalsIgnoreCase("of"))
|| (lastToken != null && lastToken.equalsIgnoreCase("of") && s2[0].equalsIgnoreCase("the"))
) {
accum.append(s2[0]).append("-");
} else {
if (accum.length() > 0) {
accum.deleteCharAt(accum.length() - 1);
result.add(accum.toString());
accum = new StringBuilder();
}
}
lastTag = tag;
lastToken = s2[0];
}
if (accum.length() > 0) {
accum.deleteCharAt(accum.length() - 1);
result.add(accum.toString());
}
}
return result;
}The method returns a list of noun phrases.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)




