Rafal Kuc is a team leader and software developer. Right now he is a software architect and Solr and Lucene specialist. Mainly focused on Java, but open on every tool and programming language that will make the achievement of his goal easier and faster. Rafal is also one of the founders of solr.pl site where he tries to share his knowledge and help people with their problems. Rafał is a DZone MVB and is not an employee of DZone and has posted 75 posts at DZone. You can read more from them at their website. View Full User Profile

Document language identification

02.06.2012
| 6041 views |
  • submit to reddit

One of the features of the latest Solr version (3.5) is the ability to identify the language of the document during its indexation. In today's entry we will see how Apache Solr works together with Apache Tika to identify the language of the documents.

At the beginning

You should remember that the described functionality was introduced in Solr 3.5.

Assumptions

We will be using two fields to identify the document language: title and body. We want to store the information of the detected language in the lang field.

Index structure

The structure of our index is of course simplified and contain only fields needed for the test. So the field definition part of the schema.xml file looks like this:

<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="title" type="text_ws" indexed="true" stored="true" />
<field name="body" type="text_ws" indexed="true" stored="true" />
<field name="lang" type="string" indexed="true" stored="true" />

All the fields as marked as stored=”true” for simplicity.

Update request processor configuration

In order to be able to use the language identification feature we need to configure Solr update request processor. We will be using the one that is using Apache Tika (there is a second implementation based on http://code.google.com/p/language-detection/). In order to configure the process we add the following to the solrconfig.xml file:

<updateRequestProcessorChain name="langid">
  <processor name="langid" class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
    <lst name="defaults">
      <str name="langid.fl">title,body</str>
      <str name="langid.langField">lang</str>
    </lst>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

Other parameters of the TikaLanguageIdentifierUpdateProcessorFactory are described on Apache Solr wiki pages available at the following URL address: http://wiki.apache.org/solr/LanguageDetection.

Additional libraries

In order for the update request processor to be working we need some additional libraries. From the dist directory from Apache Solr distribution we copy the apache-solr-langid-3.5.0.jar to tikaDir (for example), which we make on the same level as the webapps directory. Then we add the following line to the solrconfig.xml file:

<lib dir="../tikaLib/" regex="apache-solr-langid-d.*.jar" />

The next library we will need is the Tika jar with all the goodiess (tika-app-1.0.jar) which we can download at the following URL address: http://tika.apache.org/. We place it in the same tikaDir directory and then we add the following entry to the solrconfig.xml file:

<lib dir="../tikaLib/" regex="tika-app-1.0.jar" />

Test documents

For the testing purposes I decided to prepare three documents. The first was in English, the second one in Polish and the third one in German. Their content was downloaded from Wikipedia. They look as follows:

tika_en.xml

<add>
<doc>
  <field name="id">1</field>
  <field name="title">Water</field>
  <field name="body">Water is a chemical substance with the chemical formula H2O. A water molecule contains one oxygen and two hydrogen atoms connected by covalent bonds. Water is a liquid at ambient conditions, but it often co-exists on Earth with its solid state, ice, and gaseous state (water vapor or steam). Water also exists in a liquid crystal state near hydrophilic surfaces.[1][2] Under nomenclature used to name chemical compounds, Dihydrogen monoxide is the scientific name for water, though it is almost never used.</field>
</doc>
</add>
tika_pl.xml
<add>
<doc>
  <field name="id">2</field>
  <field name="title">Woda</field>
  <field name="body">Woda (tlenek wodoru; nazwa systematyczna IUPAC: oksydan) – związek chemiczny o wzorze H2O, występujący w warunkach standardowych w stanie ciekłym. W stanie gazowym wodę określa się mianem pary wodnej, a w stałym stanie skupienia – lodem. Słowo woda jako nazwa związku chemicznego może się odnosić do każdego stanu skupienia.</field>
</doc>
</add>
tika_de.xml
<add>
<doc>
  <field name="id">3</field>
  <field name="title">Wasser</field>
  <field name="body">Wasser (H2O) ist eine chemische Verbindung aus den Elementen Sauerstoff (O) und Wasserstoff (H). Wasser ist die einzige chemische Verbindung auf der Erde, die in der Natur in allen drei Aggregatzuständen vorkommt. Die Bezeichnung Wasser wird dabei besonders für den flüssigen Aggregatzustand verwendet. Im festen (gefrorenen) Zustand spricht man von Eis, im gasförmigen Zustand von Wasserdampf.</field>
</doc>
</add>

More testing

To index the data I used the following shell commands:

curl 'http://localhost:8983/solr/update?update.chain=langid' --data-binary @tika_pl.xml -H 'Content-type:application/xml'
curl 'http://localhost:8983/solr/update?update.chain=langid' --data-binary @tika_en.xml -H 'Content-type:application/xml'
curl 'http://localhost:8983/solr/update?update.chain=langid' --data-binary @tika_de.xml -H 'Content-type:application/xml'
curl 'http://localhost:8983/solr/update?update.chain=langid' --data-binary '<commit/>' -H 'Content-type:application/xml'

It is worth to notice the additional update.chain=langid parameter added to the request. This parameter is used to tell Solr which update processor to use when indexing the data. In the example we told Solr that it should use our defined update processor.

Indexed data

So let’s have a look at the indexed data. We will do that by running the following query: q=*:*&indent=true.

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">0</int>
  <lst name="params">
    <str name="indent">true</str>
    <str name="q">*:*</str>
  </lst>
</lst>
<result name="response" numFound="3" start="0">
  <doc>
    <str name="body">Woda (tlenek wodoru; nazwa systematyczna IUPAC: oksydan) – związek chemiczny o wzorze H2O, występujący w warunkach standardowych w stanie ciekłym. W stanie gazowym wodę określa się mianem pary wodnej, a w stałym stanie skupienia – lodem. Słowo woda jako nazwa związku chemicznego może się odnosić do każdego stanu skupienia.</str>
    <str name="id">2</str>
    <str name="lang">pl</str>
    <str name="title">Woda</str>
  </doc>
  <doc>
    <str name="body">Water is a chemical substance with the chemical formula H2O. A water molecule contains one oxygen and two hydrogen atoms connected by covalent bonds. Water is a liquid at ambient conditions, but it often co-exists on Earth with its solid state, ice, and gaseous state (water vapor or steam). Water also exists in a liquid crystal state near hydrophilic surfaces.[1][2] Under nomenclature used to name chemical compounds, Dihydrogen monoxide is the scientific name for water, though it is almost never used.</str>
    <str name="id">1</str>
    <str name="lang">en</str>
    <str name="title">Water</str>
  </doc>
  <doc>
    <str name="body">Wasser (H2O) ist eine chemische Verbindung aus den Elementen Sauerstoff (O) und Wasserstoff (H). Wasser ist die einzige chemische Verbindung auf der Erde, die in der Natur in allen drei Aggregatzuständen vorkommt. Die Bezeichnung Wasser wird dabei besonders für den flüssigen Aggregatzustand verwendet. Im festen (gefrorenen) Zustand spricht man von Eis, im gasförmigen Zustand von Wasserdampf.</str>
    <str name="id">3</str>
    <str name="lang">de</str>
    <str name="title">Wasser</str>
  </doc>
</result>
</response>

As you can see, Solr with the use of Tika, was able to identify the languages of the indexed documents. Of course, let’s not be too optimistic, because mistakes happen, especially when dealing with multi-language documents, but that’s understandable.

To sum up

You should remember, that the language identification feature is not perfect and can make mistakes. Also remember, that the longer the documents, the better the functionality will work. Of course the problem is that we can’t use the language identification during query time, but it’s not only problem with Solr and Tika. You can deal with that by identifying your user, it’s web browser or place he is located in.

Source:  http://solr.pl/en/2012/01/23/document-language-identification/



Published at DZone with permission of Rafał Kuć, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)