FeatherCast

The voice of The Apache Software Foundation

Apache Big Data Seville 2016 – What 50 Years of Data Science Leaves Out – Sean Owen

January 24, 2017
rbowen

What 50 Years of Data Science Leaves Out – Sean Owen

(Recording is incomplete.)

We’re told “data science” is the key to unlocking the value in big data, but, nobody seems to agree just what it is — engineering, statistics, both? David Donoho’s paper “50 Years of Data Science” offers one of the best criticisms of the hype around data science from a statistics perspective, and proposes that data science is not new, if it’s anything at all. This talk will examine these points, and respond with an engineer’s counterpoints, in search of a better understanding of data science.

More information about this talk

Apache Big Data Seville 2016 – Scio, a Scala DSL for Apache Beam – Robert Gruener

January 24, 2017
rbowen

Scio, a Scala DSL for Apache Beam – Robert Gruener

Learn about Scio, a Scala DSL for Apache Beam. Beam introduces a simple, unified programming model for both batch and streaming data processing while Scio brings it much closer to the high level API many data engineers are familiar with. We will cover design and implementation of the framework, including features like type safe BigQuery and REPL. There will also a live coding demo.

More information about this talk

Apache Big Data Seville 2016 – Apache CouchDB 2.0 Sync Deep Dive – Jan Lehnardt

January 24, 2017
rbowen

Apache CouchDB 2.0 Sync Deep Dive – Jan Lehnardt

This talks takes a deep dive below the magic and explains how to build robust sync systems, whether you want use CouchDB or build your own.

The talk will go through the components of a successful data sync system and which trade-offs you can take that solves your particular problems.

Reliable data sync, from Big Data to Mobile.

More information about this talk

Apache Big Data Seville 2017 – Introduction to Apache Beam – Jean-Baptiste Onofré, & Dan Halperin

January 24, 2017
rbowen

Introduction to Apache Beam – Jean-Baptiste Onofré,  & Dan Halperin

Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines. The same Beam pipelines work in batch and streaming, and on a variety of open source and private cloud big data processing backends including Apache Flink, Apache Spark, and Google Cloud Dataflow. This talk will introduce Apache Beam’s programming model and mechanisms for efficient execution. The speakers will show how to build Beam pipelines, and demo how to use it to execute the same code across different runners.

More information about this talk

Apache Big Data Seville 2016 – Parquet Format in Practice & Detail – Uwe L. Korn

January 23, 2017
rbowen

Parquet Format in Practice & Detail – Uwe L. Korn

Apache Parquet is among the most commonly used column-oriented data formats in the big data processing space. It leverages various techniques to store data in a CPU- and I/O-efficient way. Furthermore, it has the capabilities to push-down analytical queries on the data to the I/O layer to avoid the loading of nonrelevant data chunks. With various Java and a C++ implementation, Parquet is also the perfect choice to exchange data between different technology stacks.

As part of this talk, a general introduction to the format and its techniques will be given. Their benefits and some of the inner workings will be explained to give a better understanding how Parquet achieves its performance. At the end, benchmarks comparing the new C++ & Python implementation with other formats will be shown.

More information about this talk

Blog at WordPress.com.