Rafal Kuc is a team leader and software developer. Right now he is a software architect and Solr and Lucene specialist. Mainly focused on Java, but open on every tool and programming language that will make the achievement of his goal easier and faster. Rafal is also one of the founders of solr.pl site where he tries to share his knowledge and help people with their problems. Rafał is a DZone MVB and is not an employee of DZone and has posted 75 posts at DZone. You can read more from them at their website. View Full User Profile

Lucene and Solr's CheckIndex to the Rescue!

09.22.2011
| 7736 views |
  • submit to reddit

While using Lucene and Solr we are used to a very high reliability. However, there may come a day when Solr will inform us that our index is corrupted, and we need to do something about it. Is the only way to repair the index to restore it from the backup or do full indexation? No – there is hope in the form of CheckIndex tool.

What is CheckIndex ?

CheckIndex is a tool available in the Lucene library, which allows you to check the files and create new segments that do not contain problematic entries. This means that this tool, with little loss of data is able to repair a broken index, and thus save us from having to restore the index from the backup (of course if we have it) or do the full indexing of all documents that were stored in Solr.

Where do I start?

Please note that, according to what we find in Javadocs, this tool is experimental and may change in the future. Therefore, before starting to work with it we should create a copy of the index. In addition, it is worth knowing that the tool analyzes the index byte by byte, and thus for large indexes the time of analysis and repair may be large. It is important not to run the tool with the -fix option at the moment when it is used by Solr or other applications based on the Lucene library. Finally, be aware that the launch of the tool in repairing mode may result in removal of some or all documents that are stored in the index.

How to run it

To run the utility, go to the directory where the Lucene library files are located and run the following command:
java -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex INDEX_PATH -fix

In my case, it looked as follows:

java -cp lucene-core-2.9.3.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex E:\Solr\solr\data\index\ -fix
After a while I got the following information:
Opening index @ E:Solrsolrdataindex

Segments file=segments_2 numSegments=1 version=FORMAT_DIAGNOSTICS [Lucene 2.9]
1 of 1: name=_0 docCount=19
compound=false
hasProx=true
numFiles=11
size (MB)=0,018
diagnostics = {os.version=6.1, os=Windows 7, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=flush, os.arch=x86, java.version=1.6.0_23, java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.........OK
test: fields..............OK [15 fields]
test: field norms.........OK [15 fields]
test: terms, freq, prox...OK [900 terms; 1517 terms/docs pairs; 1707 tokens]
test: stored fields.......OK [232 total field count; avg 12,211 fields per doc]
test: term vectors........OK [3 total vector count; avg 0,158 term/freq vector fields per doc]

No problems were detected with this index.

It mean that the index is correct and there was no need for any corrective action. Additionally, you can learn some interesting things about the index ;)

Broken index

But what happens in the case of the broken index? There is only one way to see it – let’s try. So, I broke one of the index files and ran the CheckIndex tool. The following appeared on the console after I’ve run the CheckIndex tool:


Opening index @ E:Solrsolrdataindex

Segments file=segments_2 numSegments=1 version=FORMAT_DIAGNOSTICS [Lucene 2.9]
1 of 1: name=_0 docCount=19
compound=false
hasProx=true
numFiles=11
size (MB)=0,018
diagnostics = {os.version=6.1, os=Windows 7, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=flush, os.arch=x86, java.version=1.6.0_23, java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.........FAILED
WARNING: fixIndex() would remove reference to this segment; full exception:
org.apache.lucene.index.CorruptIndexException: did not read all bytes from file "_0.fnm": read 150 vs size 152
at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:370)
at org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:71)
at org.apache.lucene.index.SegmentReader$CoreReaders.<init>(SegmentReader.java:119)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:652)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:605)
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:491)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)

WARNING: 1 broken segments (containing 19 documents) detected
WARNING: 19 documents will be lost

NOTE: will write new segments file in 5 seconds; this will remove 19 docs from the index. THIS IS YOUR LAST CHANCE TO CTRL+C!
5...
4...
3...
2...
1...
Writing...
OK
Wrote new segments file "segments_3"

As you can see, all the 19 documents that were in the index have been removed. This is an extreme case, but you should realize that this tool might work like this.

The end

If you remember about the basisc assumptions associated with the use of the CheckIndex tool you may find yourself in a situation when this tool will come in handy and you will not have to ask yourself a question like “When was the last backup was made?”


References
Published at DZone with permission of Rafał Kuć, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)