Rafal Andrzejewski is a team leader and software developer. Taking part in many Solr and Lucene projects. Recently joined solr.pl site as blogger where can share his knowledge about search engine oriented topics. Rafał is a DZone MVB and is not an employee of DZone and has posted 8 posts at DZone. You can read more from them at their website. View Full User Profile

“Car sale application”– Result Grouping, let’s group some search results (part 6)

07.06.2011
| 5640 views |
  • submit to reddit

In today’s post we will try to add to our car sale application the new functionality, which allows us to group some search results. Let’s imagine a user who would like to search for “audi a4” advertisements and as a result get the results grouped by car’s year of production, with 2-3 results in every group. And how about some range grouping, for example mileage ranges? Today we will accept the challenge.

New functionality request parameters description

Result grouping functionality is available since solr 3.3. Let’s get to know some of it’s request parameters we will surely need:

  • group – turn on and off result grouping
  • group.field – field name used to group search results. We have to be sure that the field used for grouping (year of production in our case) is single-valued and have the string/text type
  • group.query – query used to group results by ranges, for example mileage ranges
  • group.limit – the number of results to return for each group


This four basic parameters allow us to achieve what we want.

schema.xml changes

Possible schema.xml changes can be made in order to be sure that the group field is of the proper type (“string” or “text”). We would like to group our search results by “year” field, so let’s recall how the definition looks right now:

<field name="year" type="tint" indexed="true" stored="true" required="true" />
 

The field is of integer type. In order to be able to group results using this field, we create another “year” field, let’s call it “year_group”, which will have the string type:

<field name="year_group" type="string" indexed="true" stored="false" />
 

and copy the content of the “year” field to the new field called “year_group”:

<copyField source="year" dest="year_group"/>
 

That’s practically all the changes we should do in our schema.xml configration file.

Some sample data

Let’s now create some sample data in order to test the new functionality. We assume that we have some samples of Audi A4 car data. Two of them are year 2002, another two 2003 and the last one is 2006. Additionally, one of them has the mileage below 100 000 km, three of them have the mileage in the range between 100 000 km and 199 999 km and the last one has the mileage over 200 000 km:

<add>
   <doc>
      <field name="id">1</field>
      <field name="make">Audi</field>
      <field name="model">A4</field>
      <field name="year">2002</field>
      <field name="price">22700</field>
      <field name="engine_size">1900</field>
      <field name="mileage">197000</field>
      <field name="colour">green</field>
      <field name="damaged">false</field>
      <field name="city">Koszalin</field>
      <field name="loc">54.12,16.11</field>
   </doc>
   <doc>
      <field name="id">2</field>
      <field name="make">Audi</field>
      <field name="model">A4</field>
      <field name="year">2003</field>
      <field name="price">27800</field>
      <field name="engine_size">1900</field>
      <field name="mileage">220000</field>
      <field name="colour">black</field>
      <field name="damaged">false</field>
      <field name="city">Bialystok</field>
      <field name="loc">53.08,23.09</field>
   </doc>
   <doc>
      <field name="id">3</field>
      <field name="make">Audi</field>
      <field name="model">A4</field>
      <field name="year">2002</field>
      <field name="price">21300</field>
      <field name="engine_size">1900</field>
      <field name="mileage">125000</field>
      <field name="colour">black</field>
      <field name="damaged">false</field>
      <field name="city">Szczecin</field>
      <field name="loc">53.25,14.35</field>
   </doc>
   <doc>
      <field name="id">4</field>
      <field name="make">Audi</field>
      <field name="model">A4</field>
      <field name="year">2003</field>
      <field name="price">30300</field>
      <field name="engine_size">1900</field>
      <field name="mileage">150000</field>
      <field name="colour">red</field>
      <field name="damaged">false</field>
      <field name="city">Gdansk</field>
      <field name="loc">54.21,18.40</field>
   </doc>
  <doc>
      <field name="id">5</field>
      <field name="make">Audi</field>
      <field name="model">A4</field>
      <field name="year">2006</field>
      <field name="price">32100</field>
      <field name="engine_size">1900</field>
      <field name="mileage">9900</field>
      <field name="colour">red</field>
      <field name="damaged">false</field>
      <field name="city">Swidnik</field>
      <field name="loc">52.15,21.00</field>
   </doc>
</add>

Let’s create queries

Using the parameters described at the beginning of the article, we create the “audi A4” query, which will show us some search results grouped by the year of production:

?q=audi+a4&group=true&group.field=year_group&group.limit=2&fl=id,mileage,make,model,year
 

As we see, we have limited the results in every group to max 2. In response we would like to have only those fields, which will help us clearly and readably identify the documents, so: id, mileage, make, model and year. As a result we have the response:

<lst name="grouped">
  <lst name="year_group">
    <int name="matches">5</int>
    <arr name="groups">
      <lst>
        <str name="groupValue">2002</str>
        <result name="doclist" numFound="2" start="0">
          <doc>
            <str name="id">1</str>
            <str name="make">Audi</str>
            <int name="mileage">197000</int>
            <str name="model">A4</str>
            <int name="year">2002</int>
          </doc>
          <doc>
            <str name="id">3</str>
            <str name="make">Audi</str>
            <int name="mileage">125000</int>
            <str name="model">A4</str>
            <int name="year">2002</int>
          </doc>
        </result>
      </lst>
      <lst>
        <str name="groupValue">2003</str>
        <result name="doclist" numFound="2" start="0">
          <doc>
            <str name="id">2</str>
            <str name="make">Audi</str>
            <int name="mileage">220000</int>
            <str name="model">A4</str>
            <int name="year">2003</int>
          </doc>
          <doc>
            <str name="id">4</str>
            <str name="make">Audi</str>
            <int name="mileage">150000</int>
            <str name="model">A4</str>
            <int name="year">2003</int>
          </doc>
        </result>
      </lst>
      <lst>
        <str name="groupValue">2006</str>
        <result name="doclist" numFound="1" start="0">
          <doc>
            <str name="id">5</str>
            <str name="make">Audi</str>
            <int name="mileage">9900</int>
            <str name="model">A4</str>
            <int name="year">2006</int>
          </doc>
        </result>
      </lst>
    </arr>
  </lst>
</lst>

Let’s analyse the response. We have 5 matches:

<int name="matches">5</int>

The response has been split into 3 independent groups:

  1. <str name="groupValue">2002</str>
     

    where we have two (numFound=”2″) 2002 cars

  2. <str name="groupValue">2003</str>
     

    where we have two (numFound=”2″) 2003 cars

  3. <str name="groupValue">2006</str>
     

    where we have one (numFound=”1″) 2006 car

That’s correct!

Now let’s create query, which will group our search results by the mileage ranges. We assume that we have 3 ranges:

  1. <0km ; 99999km>
  2. <100000km ; 199999km>
  3. <200000km ; * >

Query:

?q=audi+a4&group=true&group.query=mileage:[0+TO+99999]&group.query=mileage:[100000+TO+199999]&group.query=mileage:[200000+TO+*]&group.limit=3&fl=id,mileage,make,model,year
 

and response:

<lst name="grouped">
  <lst name="mileage:[0 TO 99999]">
    <int name="matches">5</int>
    <result name="doclist" numFound="1" start="0">
      <doc>
        <str name="id">5</str>
        <str name="make">Audi</str>
        <int name="mileage">9900</int>
        <str name="model">A4</str>
        <int name="year">2006</int>
      </doc>
    </result>
  </lst>
  <lst name="mileage:[100000 TO 199999]">
    <int name="matches">5</int>
    <result name="doclist" numFound="3" start="0">
      <doc>
        <str name="id">1</str>
        <str name="make">Audi</str>
        <int name="mileage">197000</int>
        <str name="model">A4</str>
        <int name="year">2002</int>
      </doc>
      <doc>
        <str name="id">3</str>
        <str name="make">Audi</str>
        <int name="mileage">125000</int>
        <str name="model">A4</str>
        <int name="year">2002</int>
      </doc>
      <doc>
        <str name="id">4</str>
        <str name="make">Audi</str>
        <int name="mileage">150000</int>
        <str name="model">A4</str>
        <int name="year">2003</int>
      </doc>
    </result>
  </lst>
  <lst name="mileage:[200000 TO *]">
    <int name="matches">5</int>
    <result name="doclist" numFound="1" start="0">
      <doc>
        <str name="id">2</str>
        <str name="make">Audi</str>
        <int name="mileage">220000</int>
        <str name="model">A4</str>
        <int name="year">2003</int>
      </doc>
    </result>
  </lst>
</lst>

Again we have 5 search results. In the first group there is a car with the mileage of 9900 km, in the second group there are cars with the mileage of 197000 km, 125000 km and 150000 km, and finally in the third group there is a car with the mileage of 220000km. We achieve what we wanted. Mission accomplished.

The end

Yet another functionality, this time search results grouping one, is now added to our car sale application. We will surely see what will be the users opinions :)

References
Published at DZone with permission of Rafał Andrzejewski, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags:

Comments

David Martin replied on Wed, 2011/07/06 - 9:16am

Nice article, thanks for sharing your knowledge.

I have a few questions if you can spend some more minutes about this topic :

1- How can I know how many groups there are (let's take your first example : how can I know that there are 3 groups, each corresponding to one or more docs) ?

2- What if I want one and only one document per group (group.limit=1) but I want it to be the one that has the the lowest price? Is this possible, and if yes, how ?

3- Let's say now that you have many different brands (and not only Audi) and you want to group the result based on the brand AND have a subgrouping field like the mileage. Is this possible, and if yes, can you tell me how ?

 

Thanks

Rafał Andrzejewski replied on Wed, 2011/07/06 - 2:46pm in response to: David Martin

Hi Martin,

1. You can simply add another request parameter called group.ngroups and set its value to "true", for example:

http://localhost:8983/solr/select?q=audi+a4&group=true&group.field=year_group&
group.limit=2&fl=id,mileage,make,model,year&group.ngroups=true

Then in the response you have the name attribute with the "ngroups" value which shows how many groups are created:

<lst name="grouped">
    <lst name="year_group">
      <int name="matches">5</int>
      <int name="ngroups">3</int>
      <arr name="groups">
    <lst>
       ...
</lst>

2. Yes, it is possible. as you have mentioned, you firstly add group.limit=1 and then add another request parameter called group.sort with the field and order you want the results within the group to be sorted by. For example:

?q=audi+a4&group=true&group.field=year_group&group.limit=1&
fl=id,mileage,make,model,year,price&group.sort=price+asc

in the response, for example for 2003 group, you will get:

<lst>
          <str name="groupValue">2003</str>
          <result name="doclist" numFound="2" start="0">
            <doc>
              <str name="id">2</str>
              <str name="make">Audi</str>
              <int name="mileage">220000</int>
              <str name="model">A4</str>
              <float name="price">27800.0</float>
              <int name="year">2003</int>
            </doc>
          </result>
 </lst>

3. No, unfortunately it is not possible (yet hopefully:)

David Martin replied on Thu, 2011/07/07 - 10:16am

Thanks for your answer. Field grouping is clearly a must have in an index, and I'm glad to see this at least in solr/lucene core (not as some patches anymore...)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.