What 50 Years of Data Science Leaves Out – Sean Owen
(Recording is incomplete.)
We’re told “data science” is the key to unlocking the value in big data, but, nobody seems to agree just what it is — engineering, statistics, both? David Donoho’s paper “50 Years of Data Science” offers one of the best criticisms of the hype around data science from a statistics perspective, and proposes that data science is not new, if it’s anything at all. This talk will examine these points, and respond with an engineer’s counterpoints, in search of a better understanding of data science.
Scio, a Scala DSL for Apache Beam – Robert Gruener
Learn about Scio, a Scala DSL for Apache Beam. Beam introduces a simple, unified programming model for both batch and streaming data processing while Scio brings it much closer to the high level API many data engineers are familiar with. We will cover design and implementation of the framework, including features like type safe BigQuery and REPL. There will also a live coding demo.
Introduction to Apache Beam – Jean-Baptiste Onofré, & Dan Halperin
Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines. The same Beam pipelines work in batch and streaming, and on a variety of open source and private cloud big data processing backends including Apache Flink, Apache Spark, and Google Cloud Dataflow. This talk will introduce Apache Beam’s programming model and mechanisms for efficient execution. The speakers will show how to build Beam pipelines, and demo how to use it to execute the same code across different runners.
Apache Parquet is among the most commonly used column-oriented data formats in the big data processing space. It leverages various techniques to store data in a CPU- and I/O-efficient way. Furthermore, it has the capabilities to push-down analytical queries on the data to the I/O layer to avoid the loading of nonrelevant data chunks. With various Java and a C++ implementation, Parquet is also the perfect choice to exchange data between different technology stacks.
As part of this talk, a general introduction to the format and its techniques will be given. Their benefits and some of the inner workings will be explained to give a better understanding how Parquet achieves its performance. At the end, benchmarks comparing the new C++ & Python implementation with other formats will be shown.
Apache Kudu: A Distributed, Columnar Data Store for Fast Analytics – Mike Percy
The Hadoop ecosystem has recently made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems like Apache Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems like Apache HBase, applications can achieve millisecond-scale random access to arbitrarily-sized datasets. However, gaps remain when scans and random access are both required.
This talk will investigate the trade-offs between real-time random access and fast analytic performance from the perspective of storage engine internals. It will also describe Apache Kudu, the new addition to the open source Hadoop ecosystem with out-of-the-box integration with Apache Spark, that fills the gap described above to provide a new option to achieve fast scans and fast random access from a single API.
Shared Memory Layer and Faster SQL for Spark Applications – Dmitriy Setrakyan
In this presentation we will talk about the need to share state in memory across different Spark jobs or applications and Apache Ignite as the technology that makes it possible. We will dive into importance of In Memory File Systems, Shared In-Memory RDDs with Apache Ignite, as well as the need to index data in-memory for fast SQL execution. We will also present a hands on demo demonstrating advantages and disadvantages of one approach over another. We will also discuss requirements of storing data off-heap in order to achieve large horizontal and vertical scale of the applications using Spark and Ignite.