FeatherCast

The voice of The Apache Software Foundation

Evaluating Content/Text Extraction at Scale with Apache Tika Tim Allison

September 12, 2019
timothyarthur

Text extraction tools are essential for obtaining the textual content and metadata of computer files for use in a wide variety of applications, including search and natural language processing tools. Techniques and tools for evaluating text extraction tools are largely missing from academia and industry. This talk will focus on recent improvements to Apache Tika’s tika-eval module to help integrators evaluate content extraction at scale. The tika-eval module was initially developed for a single batch mode on a single vm. In recent months, however, Apache Tika has refactored this module to allow for easier scaling within Apache Solr and other large scale processing frameworks. This talk will offer an overview of the techniques used to identify potential extraction problems — including garbled text without ground truth; and the talk will show how the tika-eval module can be used at scale to identify potential problems with content extraction.

Leave a Reply

Required fields are marked *.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.
%d bloggers like this: