Mitch Pronschinske is the Lead Research Analyst at DZone. Researching and compiling content for DZone's research guides is his primary job. He likes to make his own ringtones, watches cartoons/anime, enjoys card and board games, and plays the accordion. Mitch is a DZone Zone Leader and has posted 2576 posts at DZone. You can read more from them at their website. View Full User Profile

Apache Tika 1.0 Solidifies Position in Content and Metadata Detection and Analysis

11.09.2011
| 7415 views |
  • submit to reddit

The 1.0 release of Apache Tika, a collection of Java libraries for the detection and extraction of structured text and metadata, has been 5 years in the making according to Chris Mattmann, Apache Tika Vice President and Senior Computer Scientist at NASA Jet Propulsion Laboratory. "From a toolkit perspective," he says "it's easy to integrate, and provides maximum functionality with little configuration."

Apache Tika began in 2007 as a subproject of the Lucene search library.  In 2010 it became a Top Level Project because of it's value for search engines and for other applications that manage a variety of files with text.  Tika can be used to extract text and metadata from a variety of sources including:

  • HTML
  • XML
  • Microsoft Office documents (OLE2 and OOXML)
  • OpenDocument Formats
  • PDF
  • ePub
  • RTF
  • Java class files and archives
  • Compressed and packaged files
  • Outlook and mbox mailboxes 
  • Text associated with audio files
  • Text associated with Image and video files


Apache Tika has been battle-tested often in repositories exceeding 500 million documents.  In web crawling it's proven to be invaluable for extracting text and metadata from the plethora of formats you find on the web.  Apache Tika is not only usable through flexible interfaces in Java, but also from the command line, RESTful Web services, and through Python, .NET and C++. It also has a great GUI for exploring content interactively.

New Features

The highlights in 1.0 included a cleaned up API that drops retrotranslated Java 1.4 support along with improved OSGi integration.  Here's a complete list of new features:

  • API: All methods, classes and interfaces that were marked as deprecated in Tika 0.10 have been removed to clean up the API (TIKA-703). You may need to adjust and recompile client code accordingly. The declared OSGi package versions are now 1.0, and will thus not resolve for client bundles that still refer to 0.x versions (TIKA-565).
  • Configuration: The context class loader of the current thread is no longer used as the default for loading configured parser and detector classes. You can still pass an explicit class loader to the configuration mechanism to get the previous behaviour. (TIKA-565).
  • OSGi: The tika-core bundle will now automatically pick up and use any available Parser and Detector services when deployed to an OSGi environment. The tika-parsers bundle provides such services based on for all the supported file formats for which the upstream parser library is available. If you don't want to track all the parser libraries as separate OSGi bundles, you can use the tika-bundle bundle that packages tika-parsers together with all its upstream dependencies. (TIKA-565).
  • RTF: Hyperlinks in RTF documents are now extracted as an a href=....../a element (TIKA-632). The RTF parser is also now more robust when encountering too many closing {'s vs. opening {'s (TIKA-733).
  • MS Word: From Word (.doc) documents we now extract optional hyphen as Unicode zero-width space (U+200B), and non-breaking hyphen as Unicode non-breaking hyphen (U+2011). (TIKA-711).
  • Outlook: Tika can now process also attachments in Outlook messages. (TIKA-396).
  • MS Office: Performance of extracting embedded office docs was improved. (TIKA-753).
  • PDF: The PDF parser now extracts paragraphs within each page (TIKA-742) and can now optionally extract text from PDF annotations (TIKA-738). There's also an option to enable (the default) or disable auto-space insertion (TIKA-724).
  • Language detection: Tika can now detect Belarusian, Catalan, Esperanto, Galician, Lithuanian (TIKA-582), Romanian, Slovak, Slovenian, and Ukrainian (TIKA-681).
  • Java: Tika no longer ships retrotranslated Java 1.4 binaries along with the normal ones that work with Java 5 and higher. (TIKA-744).
  • OpenOffice documents: header/footer text is now extracted for text, presentation and spreadsheet documents (TIKA-736).