Rafal Andrzejewski is a team leader and software developer. Taking part in many Solr and Lucene projects. Recently joined solr.pl site as blogger where can share his knowledge about search engine oriented topics. Rafał is a DZone MVB and is not an employee of DZone and has posted 8 posts at DZone. You can read more from them at their website. View Full User Profile

“Car sale application” – schema.xml designing to gain what we really need (part 1)

06.09.2011
| 5182 views |
  • submit to reddit

One of the fundamental pieces of solr’s configuration file is the schema.xml file. It is a kind of connector between what we need and what solr understands. If we want to have a search engine, that gives us search results we really expect, then it is very important to properly design the schema.xml configuration file.
We would like to introduce you the first of the series of articles which will hopefully show you how to design the schema.xml file and how to handle and modify all of the file’s components.

Requirements specification

Imagine we would like to use solr to provide our car sale website with a search engine. The functional part of our website is, at the beginning, rather primitive and takes advantage of only the small piece of every car information:

  • make
  • model
  • year of production
  • price
  • engine size
  • mileage
  • colour
  • damaged

We would like to design a simple configuration schema file, which will make it possible to index data from the given fields. But before we open the schema.xml file and start typing, let’s answer the seven fundamental questions related to our fields:

1. What is the field type ?

Let’s determine the type of every field:

  • make – text field
  • model – text field
  • year of production – integer field
  • price – float field
  • engine size – integer field
  • mileage – integer field
  • colour – textual field
  • damaged – logical field
So what ?

So we will need some basic type definitions like string, boolean, int, float.

2. Is it the field used in search process ?

We would like to use the data from some fields in order to enable our search engine to find the proper documents (car sale announcements). To accomplish that we are going to use 3 fields: make, model, year of production.

So what ?

So we will need to create another field type, which will contain some filters to make finding the documents easy and efficient. We will create another field of the newly created type, where we will put all the data from make, model and year of production fields.

3. Is it the field used in faceting or sorting operation ?

In our website we would like to sort search results using 4 fields: model, year of production, price and mileage. We would also like to be able to to use facet operation on fields: make, model, year of production and colour.

So what ?

When we want to create a field type for fields used for sorting/faceting, then we need to know that this type cannot contain tokenizers and filters which can tokenize values in those fields. But still we want the values to be lowercased, so the letters size does not influence the sorting/faceting results. So that’s the kind of another field type we will need to create.

4. Is it the field used to filter search results?

We would like to have the possibility to filter search results using ranges on fields: year of production, price, engine size and mileage.

So what ?

So let’s use the field types which will accelerate range queries.

5. Are there any fields which are not mentioned in the questions number 2, 3 or 4 ?

There is a field “damaged” which is not supposed to be involved in any of the mentioned operations.

So what?

So we will set the value of the “indexed” attribute to “false”.

6. Is the field required ?

We assume that there are 3 fields which are supposed to be required: make, model and year. We don’t want to have documents in index (car sale announcements available in the search process), which do not have values in those fields.

So what ?

So we will set the value of the “required” attribute to “true”.

7. Do we need to retrieve the information from the field in the original state?

We would like to retrieve the information from all of the fields mentioned in the requirements specification and present them directly on the website.

So what?

So we will set the value of the “stored” attribute to “true”.

Let’s add field type definitions

We’ve answered our questions, we’ve come to some conclusions so let’s add field types to the schema file:

We add the solr.StrField type, which is not analysed and can be used for example as the type for the unique document key.

<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>

Add the boolean type:

<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/>

Now the numerical types. Remember that we need types that can help us to accelerate range queries. So let’s use the tint and tfloat types:

<fieldType name="tint" class="solr.TrieIntField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="tfloat" class="solr.TrieFloatField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>

Now let’s create the textual type, which will be a definition type for the catch-all field used for searching. For now, the type with the whitespace tokenizer and the lowercase filter will be just fine:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

And last, but not least, the type for the sortable/facetable fields. What we need is the type that lowercases the entire field value, keeping it as a single token. KeywordTokenizer does no actual tokenizing, so it is the ideal tokenizer for our need. The TrimFilterFactory removes any leading or trailing whitespace:

<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.TrimFilterFactory" />
  </analyzer>
</fieldType>


Time to add field definitions

Document id:

<field name="id" type="string" indexed="true" stored="true" required="true" />

Make and model:

<field name="make" type="text" indexed="false" stored="true" required="true" />
<field name="model" type="text" indexed="false" stored="true" required="true" />

Now why is the value of the “indexed” attribute set to “false” ? As far as we know, we need those fields to search, sort and facet operations. That’s true … but we need to notice that for the searching purposes we will copy the data from those fields to one catch-all field:

<field name="content" type="text" indexed="true" stored="false" multiValued="true"/>

and for the sorting/faceting purposes we will copy the data yet to other fields of the type “lowercase”:

<field name="make_sort" type="lowercase" indexed="true" stored="false" />
<field name="model_sort" type="lowercase" indexed="true" stored="false" />

So the fields make and model will not take part in the operations itself and we can set the “indexed” attribute to “false” for best index size.

The rest of the fields:

<field name="year" type="tint" indexed="true" stored="true" required="true" />
<field name="price" type="tfloat" indexed="true" stored="true" />
<field name="engine_size" type="tint" indexed="true" stored="true" />
<field name="mileage" type="tint" indexed="true" stored="true" />
<field name="colour" type="lowercase" indexed="true" stored="true" />

Remember about the “false” value of the “indexed” attribute of the “damaged” field:

<field name="damaged" type="boolean" indexed="false" stored="true" />


copyField – let’s index the same data differently

We have mentioned the field values copying several times already so now let’s define copy fields.

Fields used for searching are copied to catch-all “content” field. There is more than one source field, that’s why the “content” field definition contains the multiValued attribute set to “true”:

<copyField source="make" dest="content"/>
<copyField source="model" dest="content"/>
<copyField source="year" dest="content"/>

Copying the sortable/facetable fields:

<copyField source="make" dest="make_sort"/>
<copyField source="model" dest="model_sort"/>


Anything else ?

We shall add 3 more elements to the schema:

The unique key of the document:

<uniqueKey>id</uniqueKey>

Default search field:

<defaultSearchField>content</defaultSearchField>

Default query parser operator. Let’s set it to “AND”.

<solrQueryParser defaultOperator="AND"/>

It’s done! The schema.xml configuration file is ready and looks like this:

<?xml version="1.0" encoding="UTF-8" ?>

<schema name="carsale" version="1.2">

  <types>
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>

    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/>

     <fieldType name="tint" class="solr.TrieIntField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>
    <fieldType name="tfloat" class="solr.TrieFloatField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>

    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

    <fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.TrimFilterFactory" />
      </analyzer>
    </fieldType>

 </types>

 <fields>
   <field name="id" type="string" indexed="true" stored="true" required="true" />
   <field name="make" type="text" indexed="false" stored="true" required="true" />
   <field name="model" type="text" indexed="false" stored="true" required="true" />
   <field name="make_sort" type="lowercase" indexed="true" stored="false" />
   <field name="model_sort" type="lowercase" indexed="true" stored="false" />
   <field name="year" type="tint" indexed="true" stored="true" required="true" />
   <field name="price" type="tfloat" indexed="true" stored="true" />
   <field name="engine_size" type="tint" indexed="true" stored="true" />
   <field name="mileage" type="tint" indexed="true" stored="true" />
   <field name="colour" type="lowercase" indexed="true" stored="true" />
   <field name="damaged" type="boolean" indexed="false" stored="true" />
   <field name="content" type="text" indexed="true" stored="false" multiValued="true"/>

 </fields>

 <uniqueKey>id</uniqueKey>

 <defaultSearchField>content</defaultSearchField>

 <solrQueryParser defaultOperator="AND"/>

 <copyField source="make" dest="content"/>
 <copyField source="model" dest="content"/>
 <copyField source="make" dest="make_sort"/>
 <copyField source="model" dest="model_sort"/>
 <copyField source="year" dest="content"/>

</schema>


The end

In today’s post we have created the simple schema.xml file, which allows us to index data, so that we are able to face our car sale website search functionalities. But still we want to develop our website which will surely affects the schema … and not only the schema. In the next “car sale” related post we will try to face some new requirements and provide next modifications.


 

References
Published at DZone with permission of Rafał Andrzejewski, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags: