Text extraction tools are essential for obtaining the textual content and metadata of computer files for use in a wide variety of applications, including search and natural language processing tools. Techniques and tools for evaluating text extraction tools are largely missing from academia and industry. This talk will focus on recent improvements to Apache Tika’s tika-eval module to help integrators evaluate content extraction at scale. The tika-eval module was initially developed for a single batch mode on a single vm. In recent months, however, Apache Tika has refactored this module to allow for easier scaling within Apache Solr and other large scale processing frameworks. This talk will offer an overview of the techniques used to identify potential extraction problems — including garbled text without ground truth; and the talk will show how the tika-eval module can be used at scale to identify potential problems with content extraction.
Evaluating Content/Text Extraction at Scale with Apache Tika Tim Allison
September 12, 2019