Apache Big Data Seville 2016 – What’s With the 1s and 0s? Making Sense of Binary Data at Scale with Tika and Friends – Nick Burch

What’s With the 1s and 0s? Making Sense of Binary Data at Scale with Tika and Friends – Nick Burch

Large amounts of unknown data seeks helpful tools to identify itself and generate content!

With one or two files, you can take time to manually identify them, and get out their contents. With thousands of files, or the internet’s worth, this won’t scale, even with mechanical turks! Luckily, there are open source tools and programs out there to help.

First we’ll look at how we can work out what a given blob of 1s and 0s actually is, be it textual or binary. We’ll then see how to extract common metadata from it, along with text, embedded resources, images, and maybe even the kitchen sink! We’ll see how Apache Tika can do all of this for you, along with alternate and additional tools. Finally, we’ll look a how to roll this all out on a Big Data scale.

More information about this talk

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s