Uwe is committer and PMC member of Apache Lucene and Solr. His main focus is on development of Lucene Java. He implemented fast numerical search and is maintaining the new attribute-based text analysis API. He studied Physics at the University of Erlangen-Nuremberg and works as managing director for SD DataSolutions GmbH in Bremen, Germany, a company that provides consulting and support for Apache Lucene and Solr. Uwe is a DZone MVB and is not an employee of DZone and has posted 2 posts at DZone. You can read more from them at their website. View Full User Profile

The Policeman's Horror: Default Locales, Default Charsets, and Default Timezones

  • submit to reddit

Time for a tool to prevent any effects coming from them!

Did you ever try to run software downloaded from the net on a computer with Turkish locale? I think most of you never did that. And if you ask Turkish IT specialists, they will tell you: “It is better to configure your computer using any other locale, but not tr_TR”. I think you have no clue what I am talking about? Maybe this article gives you a hint: “A Cellphone’s Missing Dot Kills Two People, Puts Three More in Jail”.

What you see in lots of software is a so-called case-insensitive matching of keywords like parameter names or function names. This is implemented in most cases by lowercasing or upper-casing the input text and compare it with a list of already lowercased/uppercased items. This works in most cases fine, if you are anywhere in the world, except Turkey! Because most programmers don’t care about running their software in Turkey, they do not test their software under the Turkish locale.

But what happens with the case-insensitive matching if running in Turkey? Let’s take an example:

User enters “BILLY” in the search field of you application. The application then uses the approach presented before and lower-cases “BILLY” and then compares it to an internal table (e.g. our search index, parameter table, function table,...). So we search in this table for “billy”. So far so good, works perfect in USA, Germany, Kenia, almost everywhere - except Turkey. What happens in the Turkish locale when we lowercase “BILLY”? After reading the above article, you might expect it: The “BILLY”.toLowerCase() statement in Java returns “bılly” (note the dot-less i: 'ı' U+0131). You can try this out on your local machine without reconfiguring it to use the Turkish locale, just try the following Java code:
assertEquals(“bılly”, “BILLY”.toLowerCase(new Locale(“tr_TR”)));
The same happens vice versa, if you uppercase a ‘i’, it gets I with dot (‘İ’ U+0130). This is really serious, million lines of code out there in Java and other languages don’t take care that the String.toLowerCase() and String.toUpperCase() methods can optionally take a defined Locale (more about that later). Some examples from projects I am involved in:

  • Try to run an XSLT stylesheet using Apache XALAN-XSLTC (or Java 5’s internal XSLT interpreter) in the Turkish locale. It will fail with “unknown instruction”, because XALAN-XSLTC compiles the XSLT to Java Bytecode and somehow lowercases a virtual machine opcode before compiling it with BCEL (see XALANJ-2420, BCEL bug #38787).
  • The HTML SAX parser NekoHTML uses locale-less uppercasing/lowercasing to normalize charset names and element names. I opened a bug report (issue #3544334).
  • If you use PHP as your favourite scripting language, which is not case sensitive for class names and other language constructs, it will throw a compile error once you try to call a function with an “i” in it (see PHP bug #18556). Unfortunately it is unlikely that this serious bug is fixed in PHP 5.3 or 5.4!

The question is now: How to solve this?

The most correct way to do this is to not lowercase at all! For comparing case insensitive, Unicode defines “case folding”, which is a so-called canonical form of text where all upper/lower case of any character is normalized away. Unfortunately this case folded text may no longer be readable text (this depends on the implementation, but in most cases it is). It just ensures, that case-folded text can be compared to each other in a case-insensitive way. Unfortunately Java does not offer you a function to get this string, but ICU-4J can do (see UCharacter#foldCase). But Java offers something much better: String.equalsIgnoreCase(String), which internally handles case folding! But in lots of cases you cannot use this fantastic method, because you want to lookup such strings in a HashMap or other dictionary. Without modifying HashMap to use equalsIgnoreCase, this would never work. So we are back at lower-casing! As mentioned before, you can pass a locale to String.toLowerCase(), so the naive approach would be to tell Java, that we are in the US or using the English language: String.toLowerCase(Locale.US) or String.toLowerCase(Locale.ENGLISH). This produces identical results but is still not consistent. What happens if the US government decides to lowercase/uppercase like in Turkey? -- OK, don’t use Locale.US (this is also too US-centric). Locale.ENGLISH is fine and very generic, but languages also change over the years (who knows?), but we want to have it language invariant! If you are using Java 6, there is a much better constant: Locale.ROOT -- You should use this constant for our lowercase example: String.toLowerCase(Locale.ROOT).
You should start now and do a global search/replace on all your Java projects (if you do not rely on language specific presentation of text)! REALLY!
String.toLowerCase is not the only example of “automatic default locale usage” in the Java API. There are also things like transforming dates or numbers to strings. If you use the Formatter class, and you run it somewhere in another country, String.format(“%f”, 15.5f) may not always use a period (‘.’) as decimal separator; most Germans will know this. Passing a specific locale here helps in most cases. If you are writing a GUI in English language, pass Locale.ENGLISH everywhere, otherwise text output of numbers or dates may not match the language of your GUI! If you want Formatter to behave in a invariant way, use Locale.ROOT, too (then it will for sure format numbers with period and no comma for thousands, just like Float.toString(float) does).

A second problem affecting lot’s of software are two other system-wide configurable default settings: default charset/encoding and timezone. If you open a text file with FileReader or convert an InputStream to a Reader with InputStreamReader, Java assumes automatically, that the input is in the default platform encoding. This may be fine, if you want the text to be parsed by the defaults of the operating system -- but if you pass a text file together with your software package (maybe as resource in your JAR file) and then accidentally read it using the platform’s default charset... it’ll break your app! So my second recommendation:
Always pass a character set to any method converting bytes to strings (like InputStream <=> Reader, String.getBytes(),...). If you wrote the text file and ship it together with your app, only you know its encoding!
For timezones, similar examples can be found.

How this affects Apache Lucene!

Apache Lucene is a full-text search engine and deals with text from different languages all the time; Apache Solr is a enterprise search server on top of Lucene and deals with input documents in lots of different charsets and languages. It is therefore essential for a search library like Lucene to be as most independent from local machine settings as possible. A library must make it explicit what input it wants to have. So we require charsets and locales in all public and private APIs (or we only take e.g. java.io.Reader instead of InputStream if we expect text coming in), so the user must take care.

Robert Muir and I reviewed the source code of Apache Lucene and Solr for the coming version 4.0 (an alpha version is already available on Lucene’s homepage, documentation is here). We did this quite often, but whenever a new piece of code is committed to the source tree, it may happen that undefined locales, charsets, or similar things appear again. In most cases it is not the fault of the committer, this happens because auto-complete in IDE automatically lists possible methods and parameters to the developer. Often you select the easiest variant (like String.toLowerCase()).

Using default locales, charsets and timezones are in my opinion a big design issue in programming languages like Java. If there are locale-sensitive methods, those methods should take a locale, if you convert a byte[] stream to a char[] stream, a charset must be given. Automatically falling back to defaults is a no-go in the server environment. 
If a developer is interested in using the default locale of the user’s computer, he can always explicitely give the locale or charset. In our example this would be String.toLowerCase(Locale.getDefault()). This is more verbose, but it is obvious what the developer intends to do.

My proposal is to ban all those default charset and locale methods / classes in the Java API by deprecating them as soon as possible, so users stop using them implicit!

Robert’s and my intention is to automatically fail the nightly builds (or compilation on the developer’s machine) when somebody uses one of the above methods in Lucene’s or Solr’s source code. We looked at different solutions like PMD or FindBugs, but both tools are too sloppy to handle that in a consistent way (PMD does not have any “default charset” method detection and Findbugs has only a very short list of method signatures). In addition, both PMD and FindBugs are very slow and often fail to correctly detect all problems. For Lucene builds we only need a tool, that looks into the byte code of all generated Java classes of Apache Lucene and Solr, and fails the build if any signature that violates our requirements is found.

A new Tool for the Policeman

I started to hack a tool as a custom ANT task using ASM 4.0 (Lightweight Java Bytecode Manipulation Framework). The idea was to provide a list of methods signatures, field names and plain class names that should fail the build, once bytecode accesses it in any way. A first version of this task was published in issue LUCENE-4199, later improvements was to add support for fields (LUCENE-4202) and a sophisticated signature expansion to also catch calls to subclasses of the given signatures (LUCENE-4206).

In the meantime, Robert worked on the list of “forbidden” APIs. This is what came out in the first version:
Using this easily extend-able list, saved in a text file (UTF-8 encoded!), you can invoke my new ANT task (after registering it with <taskdef/>) very easy -- taken from Lucene/Solr’s build.xml:
<taskdef resource="lucene-solr.antlib.xml">
    <pathelement location="${custom-tasks.dir}/build/classes/java" />
    <fileset dir="${custom-tasks.dir}/lib" includes="asm-debug-all-4.0.jar" />
  <classpath refid="additional.dependencies"/>
  <apiFileSet dir="${custom-tasks.dir}/forbiddenApis">
    <include name="jdk.txt" />
    <include name="jdk-deprecated.txt" />
    <include name="commons-io.txt" />
  <fileset dir="${basedir}/build" includes="**/*.class" />
The classpath given is used to look up the API signatures (provided as apiFileSet). Classpath is only needed if signatures are coming from 3rd party libraries. The inner fileset should list all class files to be checked. For running the task you also need asm-all-4.0.jar available in the task’s classpath.

If you are interested, take the source code, it is open source and released as part of the tool set shipped with Apache Lucene & Solr: Source, API lists (revision number 1360240).

At the moment we are investigating other opportunities brought by that tool:
  • We want to ban System.out/err or things like horrible Eclipse-like try...catch...printStackTrace() auto-generated Exception stubs. We can just ban those fields from the java.lang.System class and of course, Throwable#printStackTrace().
  • Using optimized Lucene-provided replacements for JDK API calls. This can be enforced by failing on the JDK signatures.
  • Failing the build on deprecated calls to Java’s API. We can of course print warnings for deprecations, but failing the build is better. And: We use deprecation annotations in Lucene’s own library code, so javac-generated warnings don’t help. We can use the list of deprecated stuff from JDK Javadocs to trigger the failures.
I hope other projects take a similar approach to scan their binary/source code and free it from system dependent API calls, which are not predictable for production systems in the server environment.

Thanks to Robert Muir and Dawid Weiss for help and suggestions!
Published at DZone with permission of Uwe Schindler, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)