Big Data/Analytics Zone is brought to you in partnership with:

Doug has been engrossed in programming since his parents first bought him an Apple IIe computer in 4th grade. Throughout his early career, Doug proved his flexibility and ingenuity in crafting solutions in a variety of environments. Doug’s most recent work has been in the telecom industry developing tools to analyze large amounts of network traffic using C++ and Python. Doug loves learning and synthesizing this knowledge into code and blog articles. Doug is a DZone MVB and is not an employee of DZone and has posted 36 posts at DZone. You can read more from them at their website. View Full User Profile

Escaping Solr Query Characters In Python

  • submit to reddit

I’ve been working in some Python Solr client code. One area where bugs have cropped up is in query terms that need to be escaped before passing to Solr. For example, how do you send Solr an argument term with a “:” in it? Or a “(“?

It turns out that generally you just put a \ in front of the character you want to escape. So to search for “:” in the “funnychars” field, you would send q=funnychars:\:.

Php programmer Mats Lindh has solved this pretty well, using str_replace. str_replace is a convenient, general-purpose string replacement function that lets you do batch string replacement. For example you can do:

$matches = array("Mary","lamb","fleece");
$replacements = array("Harry","dog","fur");
str_replace($matches, $replacements,"Mary had a little lamb, its fleece was white as snow");

Python doesn’t quite have str_replace. There is translate which does single character to single character batch replacement. That can’t be used for escaping because the destination values are strings(ie “\:”), not single characters. There’s a general purpose “str_replace” drop-in replacement at this Stackoverflow question:

edits =[("Mary","Harry"),("lamb","dog"),("fleece","fur")]# etc.for search, replace in edits:
  s = s.replace(search, replace)

You’ll notice that this algorithm requires multiple passes through the string for search/replacement. This is because that earlier search/replaces may impact later search/replaces. For example, what if edits was this:

edits =[("Mary","Harry"),("Harry","Larry"),("Larry","Barry")]

First our str_replace will replace Mary with Harry in pass 1, then Harry with Larry in pass 2, etc.

It turns out that escaping characters is a narrower string replacement case that can be done more efficiently without too much complication. The only character that one needs to worry about impacting other rules is escaping the \, as the other rules insert \ characters, we wouldn’t want them double escaped.

Aside from this caveat, all the escaping rules can be processed from a single pass through the string which my solution below does, performing a heck of a lot faster:

# These rules all independent, order of# escaping doesn't matter
escapeRules ={'+':r'\+','-':r'\-','&':r'\&','|':r'\|','!':r'\!','(':r'\(',')':r'\)','{':r'\{','}':r'\}','[':r'\[',']':r'\]','^':r'\^','~':r'\~','*':r'\*','?':r'\?',':':r'\:','"':r'\"',';':r'\;',' ':r'\ '}defescapedSeq(term):""" Yield the next string based on the        next character (either this char        or escaped version """forcharin term:ifcharin escapeRules.keys():yield escapeRules[char]else:yieldchardefescapeSolrArg(term):""" Apply escaping to the passed in query terms        escaping special characters like : , etc"""
    term = term.replace('\\',r'\\')# escape \ firstreturn"".join([nextStr for nextStr in escapedSeq(term)])

Aside from being a good general solution to this problem, in some basic benchmarks, this turns out to be about 5 orders of magnitude faster than doing it the naive way! Pretty cool, but you’ll probably rarely notice the difference. Nevertheless it could matter in specialized cases if you are automatically constructing and batching large/gnarly queries that require a lot of work to escape.

Anyway, enjoy! I’d love to hear what you think!

Published at DZone with permission of Doug Turnbull, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)