My passion is building crawlers and search engines. In particular, I specialize in building vertical search engines like Indeed.com, Homethinking.com, Bright.com and Enormo.com (all companies I've worked with). I've also worked on products such as Atlassian Jira and Confluence to improve their search capabilities. Kelvin has posted 22 posts at DZone. You can read more from them at their website. View Full User Profile

Simplistic noun-phrase chunking with POS tags in Java

06.20.2012
| 7769 views |
  • submit to reddit

I needed to extract Noun-Phrases from text. The way this is generally done is using Part-of-Speech (POS) tags. OpenNLP has a both a POS-tagger as well as a Noun-Phrase chunker. However, it's really really really slow!

I decided to look into alternatives, and chanced upon QTag.

QTag is a "freely available, language independent POS-Tagger. It is implemented in Java, and has been successfully tested on Mac OS X, Linux, and Windows."

It's waaay faster than OpenNLP for POS-tagging, though I haven't done any benchmarks as to a accuracy.

Here's my really simplistic but adequate implementation of noun-phrase chunking using QTag.

private Qtag qt;
  public static List<String> chunkQtag(String str) throws IOException {
    List<String> result = new ArrayList<String>();
    if (qt == null) {
      qt = new Qtag("lib/english");
      qt.setOutputFormat(2);
    }

    String[] split = str.split("\n");
    for (String line : split) {
      String s = qt.tagLine(line, true);
      String lastTag = null;
      String lastToken = null;
      StringBuilder accum = new StringBuilder();
      for (String token : s.split("\n")) {
        String[] s2 = token.split("\t");
        if (s2.length < 2) continue;
        String tag = s2[1];

        if (tag.equals("JJ")
            || tag.startsWith("NN")
            || tag.startsWith("??")
            || (lastTag != null && lastTag.startsWith("NN") && s2[0].equalsIgnoreCase("of"))
            || (lastToken != null && lastToken.equalsIgnoreCase("of") && s2[0].equalsIgnoreCase("the"))
            ) {
          accum.append(s2[0]).append("-");
        } else {
          if (accum.length() > 0) {
            accum.deleteCharAt(accum.length() - 1);
            result.add(accum.toString());
            accum = new StringBuilder();
          }
        }
        lastTag = tag;
        lastToken = s2[0];
      }
      if (accum.length() > 0) {
        accum.deleteCharAt(accum.length() - 1);
        result.add(accum.toString());
      }
    }
    return result;
  }

The method returns a list of noun phrases. 

Published at DZone with permission of its author, Kelvin Tan. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags: