Apache Tika detects and extracts metadata and text from a huge range of file formats and types. From Search to Big Data, single file to internet scale, if you’ve got files, Tika can help you get out useful information!
Apache Tika has been around for nearly 10 years now, and with the passage of all that time, plus the new 2.0 release, a lot has changed. Not only has there been a huge increase in the number of supported formats, but the ways of using Tika have expanded, and some of the philosophies on the best way to handle things have altered with experience. Tika has gained support for a wide range of programming languages to, and more recently, Big-Data scale support.
Whether you’re an old-hand with Tika looking to know what’s hot or different with 2.0, or someone new looking to learn more about the power of Tika, this talk will have something in it for you!
What’s With the 1s and 0s? Making Sense of Binary Data at Scale with Tika and Friends – Nick Burch
Large amounts of unknown data seeks helpful tools to identify itself and generate content!
With one or two files, you can take time to manually identify them, and get out their contents. With thousands of files, or the internet’s worth, this won’t scale, even with mechanical turks! Luckily, there are open source tools and programs out there to help.
First we’ll look at how we can work out what a given blob of 1s and 0s actually is, be it textual or binary. We’ll then see how to extract common metadata from it, along with text, embedded resources, images, and maybe even the kitchen sink! We’ll see how Apache Tika can do all of this for you, along with alternate and additional tools. Finally, we’ll look a how to roll this all out on a Big Data scale.