Rafal Kuc is a team leader and software developer. Right now he is a software architect and Solr and Lucene specialist. Mainly focused on Java, but open on every tool and programming language that will make the achievement of his goal easier and faster. Rafal is also one of the founders of solr.pl site where he tries to share his knowledge and help people with their problems. Rafał is a DZone MVB and is not an employee of DZone and has posted 72 posts at DZone. You can read more from them at their website. View Full User Profile

Simple Photo Search with Solr and Tika

02.21.2012
| 6568 views |
  • submit to reddit

Recently we had a change to help with a non-commercial project which included search as its part. One of the assumptions, although not the key ones, was the photo search functionality, so that the user could find the pictures fast and accurately. Because the search had to work with meta data of JPEG files, the idea was simple – use Apache Solr with Apache Tika.

Assumptions

Assumptions were quite simple – the user should be able to find photos by their file name, author and other data available in EXIF, like aperture, shutter speed, focal length or ISO value. Another thing was that Solr should take care of grabbing the meta data from JPEG files, so this was definitely something we wanted use Solr cell for. As You can see, those assumptions were simple.

Index structure

Index structure was very simple and contained only most needed fields. The fields section of the schema.xml file looked as follows:

<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="name" type="text" indexed="true" stored="true" />
<field name="author" type="text" indexed="true" stored="true" />
<field name="iso" type="text" indexed="true" stored="true" multiValued="true" />
<field name="iso_string" type="text" indexed="true" stored="true" multiValued="true" />
<field name="aperture" type="double" indexed="true" stored="true" />
<field name="exposure" type="string" indexed="true" stored="true" />
<field name="exposure_time" type="double" indexed="true" stored="true" />
<field name="focal" type="string" indexed="true" stored="true" />
<field name="focal_35" type="string" indexed="true" stored="true" />
<dynamicField name="ignored_*" type="string" indexed="false" stored="false" multiValued="true" />

The dynamic field was added to ignore the data we weren’t interested in. Also the copyField was introduced to copy the iso field value to iso_string field to enable faceting.

Solr configuration

The following handler definition was added to solrconfig.xml file:

<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler">
 <lst name="defaults">
  <str name="uprefix">ignored_</str>
  <str name="lowernames">true</str>
  <str name="captureAttr">true</str>
  <str name="fmap.stream_name">name</str>
  <str name="fmap.artist">author</str>
  <str name="fmap.exif_isospeedratings">iso</str>
  <str name="fmap.exif_fnumber">aperture</str>
  <str name="fmap.exposure_time">exposure</str>
  <str name="fmap.exif_exposuretime">exposure_time</str>
  <str name="fmap.focal_length">focal</str>
  <str name="fmap.focal_length_35">focal_35</str>
 </lst>
</requestHandler>

A few words about configuration. The uprefix parameter tells Solr which prefix it should use for the fields that were not mentioned explicitly in the handler configuration. In the above case, the fields which were not mentioned will be prefixed with the ignored_ word. That means that they will be matched by the dynamic field and thus they won’t be indexed (stored=”false” and indexed=”false”). The lowernames parameter with the value of true will cause all the field names to be lowercased. The captureAttr parameter tell Solr, to catch file attributes. The next parameters in the above configuration is mapping definition between fields returned by Tika and fields in the index. For example, fmap.exif_fnumber with the value of aperture says Solr to place the value of Tika exif_fnumber in the aperture index field.

Additional, needed libraries

In order for the above configuration to work we need some additional libraries (similar to the ones described in language identification). From the dist directory that is available in Solr distribution we copy the apache-solr-cell-3.5.0.jar file to tikaDir directory that should be created at the same level as the webapps directory in Solr deployment (of course this is an example). Next we add the following like to the solrconfig.xml file:

<lib dir="../tikaLib/" />

The above tell Solr to include all the libraries from the given directory. Next we need to copy all the jar files from the contrib/extraction/ Solr distribution directory to the created tikaDir directory. Additional solrconfig.xml changes are not needed.

Data indexation

The assumptions were, that there will be about 10.000 new photos a week that will need to be indexed. Those photos will be stored in a shared file system location. A simple bash script was responsible for choosing the files that were needed to be indexed and during its work it run the following command for each file:

curl 'http://solrmaster:8983/solr/photos/update/extract?literal.id=9926&commit=true" -F "myfile=@Wisla_2011_10_10.JPG"

The above command sends a file names Wisla_2011_10_10.JPG to /extract handler and says to run commit command after its processing. In addition to that, the unique id of the file is set (the literal.id parameter).

Queries

I addition to some standard filtering by author or other attributes of the photo it was also desired for the search to work. Yeah, just work :) We decided, that if we were the users of the application, we would like the fields like author or file name to be important. So, we decided to start with the following query:

q=jan+kowalski+wisla&qf=name^100+author^1000+iso+aperture+exposure_time+focal&defType=dismax

As you can see, the query is simple. Two fields in the index are more valuable then others – name of the photo and its author. The value of those fields were set up by adding query time boosts. The rest of the fields are without boost, so the default boost of 1 applies.

To sum up

The described deployment is really simple. The applications works as so the search :) The next steps that will have to be done is the JVM and Solr tunning. One of the most important things would be looking at the users behavior and tune up searches to make search experience as good as possible. But let’s leave it for other solr.pl post.


Source:  http://solr.pl/en/2012/02/20/simple-photo-search/

Published at DZone with permission of Rafał Kuć, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Goel Yatendra replied on Thu, 2012/03/15 - 3:45pm

You can analyze the results returned by Solr and also the queries that returned zero results. The simplest way is to only analyze logs and queries that return zero results. You could also use some search analytics software to see Your users behavior. Solr doesn’t have something like that out of the box, but there are some tools out there.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.