There are three things I’ve been meaning to do for a long time now…
- Learn more about Solr’s DataImportHandler
- Build a kick ass Search system for the ISFDB (For those of you who don’t know, i’m kind of a fiend about collecting antique Sci-Fi paperbacks and magazines)
- Write more blogs about Solr
So this year, my New Years resolution is to do all three at the same time.
One of my frequent annoyances about most tech articles you can find online is that they are either too simplistic, or two vague. Authors who “show the code” typically wind up using trivial, simplistic, examples so that the code can all be explained in the article w/o it turning into a novel. Authors who want to talk about “bigger” more interesting concepts tend to be had wavy, and either don’t show the code, or don’t have the time/energy to go into it all in details. In both cases, most authors gloss over the problems they may encounter along the way, and just show finished products.
My goal for this year is to write a series of blog posts that don’t suffer from either of those problems. I’m going to try and spend a few hours every week (or every other week) adding code to to a public repository on github, which will serve as the foundation for an article i’ll write that week on how I made incremental progress Improving the existing setup in some way. You, the inquisitive reader, will not only be able to “see the code” behind each article, but you can follow along with every commit, and see all of the individual progress — warts and all.
This week, being week one, is all about boot-straping. What you’ll find if you check out the blog_1 tag Is a README.txt file, a fairly straight forward build.xml file (used by “ant”), and some very simplistic solr configuration files for indexing some of the ISFDB data. If you want to try running all of this yourself, take a look at the README.txtfile for instructions on how to fetch all the ISFDB data and load it into your own mysql server, and then how to index it with Solr using the included configs. If you have problems, or are confused by the process, please post them here in the comments, and i’ll look into improving the steps in future iterations.
As I mentioned, the Solr configs are extremely simplistic:
- One Solr document for each author+title pair (a title in ISFDB might be a novel or a short story, etc… but each title may have multiple authors, and it might have appeared in multiple publications — we’ll worry about publications later).
- Every column in the DB is being indexed as a solr field using a dynamicField on “*”
- All of the solr fields use a simple “string” field type
In future articles we’ll look at iteratively improving on these configs, but for now I want to point out some of the pain points I encountered just with this simple setup — Like I said before: Warts and all! (Unfortunately, because this was my ‘bootstraping’ of the code bases, i screwed up and only committed to git once at the end of getting everything setup, so I can’t point out each of the mistakes as I made them — I’ll try harder to preserve my screw ups next time)
- DIH Configuration is awkward VooDoo – DataImportHandler uses it’s own config file to drive all of the behavior about where to find all of the “entities” that you wnat to index. There is an open bug about the fact that you have to specify this config file in some very peculiar ways (that are very different from how most things in solr work) otherwise it won’t work properly. I got bit by this bug by trying to specify my “config” property as an “invariant” param instead of a “default” (because it’s not something that i want anyone to try to change at request time. The result was an ever so confusing “DataImportHandler started. Not Initialized. No commands can be run” which went away once I did some digging and discovered that bug.
- DIH Really Wants a Primary Key – Solr doesn’t require that you have a uniqueKey field, your index can just be a pile of unstructured documents if you want, and DataImportHandler’s docs say thta having a primary key is optional, but when I tried setting up my index w/o one, I got some painful RuntimeExceptions trying to initialize the DIH.
- MySql ’0000-00-00′ Dates – The DataImportHandler FAQ Has a helpful tip about dealing with MySql Databases that contain bogus dates and how to tweak your JDBC URL to convertToNull …but… before i realized that I had that problem, i was getting some really confusing errors…
[java] WARNING: Error reading data
[java] java.sql.SQLException: Value '#2386Meredith Price#######191##0##69Price#2296#2173#2386#1#2173The Dark Angel#####
[java] 0000-00-00#NOVEL##61#0##20#0nk_M._Robinson#4016#http://www.imdb.com/name/nm0732631/#1##177Robinson#2295#2172#1266#1#2172#The Dark Beyond the Stars#####
[java] 1991-07-00#NOVEL6http://en.wikipedia.org/wiki/The_Dark_Beyond_the_Stars#538#0##228#0,http://en.wikipedia.org/wiki/The_Dark_Design#647#0##330#03339#0#8.8#1775#0766#0##370#0les_of_Shadow_Valley#4471137995##181#0' can not be represented as java.sql.Date
[java] at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1075)
… some error message huh? Fortunately the “can not be represented as java.sql.Date” part combined with a little poking in the database to inspect the record for “Meredith Price” caused the “0000-00-00″ part of her record to jump out at me and i remembered seeing that issue in the FAQ. But I honestly have no idea why a bogus date would cause the JDBC Driver to try and interpret the entire row as one Date object — fortunately i don’t think this is Solr’s fault.
Update: I modified the error message after the initial post to replace the jiberish non-characters with ‘#’ … they were making the Feed Readers cry in pain.
Once I got past those problems, everything worked really nicely. On my laptop it took under 2 minutes for DIH to index the 658842 documents in the database, and using the Schema Browser I can already see some interesting trends in the data (Asimov, Isaac is surprisingly not the most prolific author in ISFDB, apparently it’s Bleiler, Everett Franklin — but that’s evidently because he was an editor on thousands of titles written by other authors, and I didn’t exclude that type of relationship)
Ok, that’s all for this week … My plan for next time is to talk about iteratively improving the schema, but who knows what I’ll think of between now and then. If you have any requests or questions, please post them in the comments.