FOSDEM 2017 interview with Holden Karau from the Apache Spark community.
February 4, 2017
FOSDEM 2017 interview with Holden Karau from the Apache Spark community.
February 4, 2017
A short interview about Apache Mesos from FOSDEM 2017
January 24, 2017
What 50 Years of Data Science Leaves Out – Sean Owen
(Recording is incomplete.)
We’re told “data science” is the key to unlocking the value in big data, but, nobody seems to agree just what it is — engineering, statistics, both? David Donoho’s paper “50 Years of Data Science” offers one of the best criticisms of the hype around data science from a statistics perspective, and proposes that data science is not new, if it’s anything at all. This talk will examine these points, and respond with an engineer’s counterpoints, in search of a better understanding of data science.
January 24, 2017
Scio, a Scala DSL for Apache Beam – Robert Gruener
Learn about Scio, a Scala DSL for Apache Beam. Beam introduces a simple, unified programming model for both batch and streaming data processing while Scio brings it much closer to the high level API many data engineers are familiar with. We will cover design and implementation of the framework, including features like type safe BigQuery and REPL. There will also a live coding demo.
January 24, 2017
Apache CouchDB 2.0 Sync Deep Dive – Jan Lehnardt
This talks takes a deep dive below the magic and explains how to build robust sync systems, whether you want use CouchDB or build your own.
The talk will go through the components of a successful data sync system and which trade-offs you can take that solves your particular problems.
Reliable data sync, from Big Data to Mobile.
January 24, 2017
Introduction to Apache Beam – Jean-Baptiste Onofré, & Dan Halperin
Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines. The same Beam pipelines work in batch and streaming, and on a variety of open source and private cloud big data processing backends including Apache Flink, Apache Spark, and Google Cloud Dataflow. This talk will introduce Apache Beam’s programming model and mechanisms for efficient execution. The speakers will show how to build Beam pipelines, and demo how to use it to execute the same code across different runners.
January 23, 2017
Parquet Format in Practice & Detail – Uwe L. Korn
Apache Parquet is among the most commonly used column-oriented data formats in the big data processing space. It leverages various techniques to store data in a CPU- and I/O-efficient way. Furthermore, it has the capabilities to push-down analytical queries on the data to the I/O layer to avoid the loading of nonrelevant data chunks. With various Java and a C++ implementation, Parquet is also the perfect choice to exchange data between different technology stacks.
As part of this talk, a general introduction to the format and its techniques will be given. Their benefits and some of the inner workings will be explained to give a better understanding how Parquet achieves its performance. At the end, benchmarks comparing the new C++ & Python implementation with other formats will be shown.
January 23, 2017
Apache Kudu: A Distributed, Columnar Data Store for Fast Analytics – Mike Percy
The Hadoop ecosystem has recently made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems like Apache Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems like Apache HBase, applications can achieve millisecond-scale random access to arbitrarily-sized datasets. However, gaps remain when scans and random access are both required.
This talk will investigate the trade-offs between real-time random access and fast analytic performance from the perspective of storage engine internals. It will also describe Apache Kudu, the new addition to the open source Hadoop ecosystem with out-of-the-box integration with Apache Spark, that fills the gap described above to provide a new option to achieve fast scans and fast random access from a single API.
January 23, 2017
Shared Memory Layer and Faster SQL for Spark Applications – Dmitriy Setrakyan
In this presentation we will talk about the need to share state in memory across different Spark jobs or applications and Apache Ignite as the technology that makes it possible. We will dive into importance of In Memory File Systems, Shared In-Memory RDDs with Apache Ignite, as well as the need to index data in-memory for fast SQL execution. We will also present a hands on demo demonstrating advantages and disadvantages of one approach over another. We will also discuss requirements of storing data off-heap in order to achieve large horizontal and vertical scale of the applications using Spark and Ignite.
January 23, 2017
What’s With the 1s and 0s? Making Sense of Binary Data at Scale with Tika and Friends – Nick Burch
Large amounts of unknown data seeks helpful tools to identify itself and generate content!
With one or two files, you can take time to manually identify them, and get out their contents. With thousands of files, or the internet’s worth, this won’t scale, even with mechanical turks! Luckily, there are open source tools and programs out there to help.
First we’ll look at how we can work out what a given blob of 1s and 0s actually is, be it textual or binary. We’ll then see how to extract common metadata from it, along with text, embedded resources, images, and maybe even the kitchen sink! We’ll see how Apache Tika can do all of this for you, along with alternate and additional tools. Finally, we’ll look a how to roll this all out on a Big Data scale.