FeatherCast

The voice of The Apache Software Foundation

Apache Big Data Seville 2016 – What 50 Years of Data Science Leaves Out – Sean Owen

January 24, 2017
rbowen

What 50 Years of Data Science Leaves Out – Sean Owen

(Recording is incomplete.)

We’re told “data science” is the key to unlocking the value in big data, but, nobody seems to agree just what it is — engineering, statistics, both? David Donoho’s paper “50 Years of Data Science” offers one of the best criticisms of the hype around data science from a statistics perspective, and proposes that data science is not new, if it’s anything at all. This talk will examine these points, and respond with an engineer’s counterpoints, in search of a better understanding of data science.

More information about this talk

Apache Big Data Seville 2016 – Scio, a Scala DSL for Apache Beam – Robert Gruener

January 24, 2017
rbowen

Scio, a Scala DSL for Apache Beam – Robert Gruener

Learn about Scio, a Scala DSL for Apache Beam. Beam introduces a simple, unified programming model for both batch and streaming data processing while Scio brings it much closer to the high level API many data engineers are familiar with. We will cover design and implementation of the framework, including features like type safe BigQuery and REPL. There will also a live coding demo.

More information about this talk

Apache Big Data Seville 2016 – Apache CouchDB 2.0 Sync Deep Dive – Jan Lehnardt

January 24, 2017
rbowen

Apache CouchDB 2.0 Sync Deep Dive – Jan Lehnardt

This talks takes a deep dive below the magic and explains how to build robust sync systems, whether you want use CouchDB or build your own.

The talk will go through the components of a successful data sync system and which trade-offs you can take that solves your particular problems.

Reliable data sync, from Big Data to Mobile.

More information about this talk

Apache Big Data Seville 2017 – Introduction to Apache Beam – Jean-Baptiste Onofré, & Dan Halperin

January 24, 2017
rbowen

Introduction to Apache Beam – Jean-Baptiste Onofré,  & Dan Halperin

Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines. The same Beam pipelines work in batch and streaming, and on a variety of open source and private cloud big data processing backends including Apache Flink, Apache Spark, and Google Cloud Dataflow. This talk will introduce Apache Beam’s programming model and mechanisms for efficient execution. The speakers will show how to build Beam pipelines, and demo how to use it to execute the same code across different runners.

More information about this talk

Apache Big Data Seville 2016 – Parquet Format in Practice & Detail – Uwe L. Korn

January 23, 2017
rbowen

Parquet Format in Practice & Detail – Uwe L. Korn

Apache Parquet is among the most commonly used column-oriented data formats in the big data processing space. It leverages various techniques to store data in a CPU- and I/O-efficient way. Furthermore, it has the capabilities to push-down analytical queries on the data to the I/O layer to avoid the loading of nonrelevant data chunks. With various Java and a C++ implementation, Parquet is also the perfect choice to exchange data between different technology stacks.

As part of this talk, a general introduction to the format and its techniques will be given. Their benefits and some of the inner workings will be explained to give a better understanding how Parquet achieves its performance. At the end, benchmarks comparing the new C++ & Python implementation with other formats will be shown.

More information about this talk

Apache Big Data Seville 2016 – Apache Kudu: A Distributed, Columnar Data Store for Fast Analytics – Mike Percy

January 23, 2017
rbowen

Apache Kudu: A Distributed, Columnar Data Store for Fast Analytics – Mike Percy

The Hadoop ecosystem has recently made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems like Apache Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems like Apache HBase, applications can achieve millisecond-scale random access to arbitrarily-sized datasets. However, gaps remain when scans and random access are both required.

This talk will investigate the trade-offs between real-time random access and fast analytic performance from the perspective of storage engine internals. It will also describe Apache Kudu, the new addition to the open source Hadoop ecosystem with out-of-the-box integration with Apache Spark, that fills the gap described above to provide a new option to achieve fast scans and fast random access from a single API.

More information about this talk

Apache Big Data Seville 2016 – Shared Memory Layer and Faster SQL for Spark Applications – Dmitriy Setrakyan

January 23, 2017
rbowen

Shared Memory Layer and Faster SQL for Spark Applications – Dmitriy Setrakyan

In this presentation we will talk about the need to share state in memory across different Spark jobs or applications and Apache Ignite as the technology that makes it possible. We will dive into importance of In Memory File Systems, Shared In-Memory RDDs with Apache Ignite, as well as the need to index data in-memory for fast SQL execution. We will also present a hands on demo demonstrating advantages and disadvantages of one approach over another. We will also discuss requirements of storing data off-heap in order to achieve large horizontal and vertical scale of the applications using Spark and Ignite.

More information about this talk

Apache Big Data Seville 2016 – What’s With the 1s and 0s? Making Sense of Binary Data at Scale with Tika and Friends – Nick Burch

January 23, 2017
rbowen

What’s With the 1s and 0s? Making Sense of Binary Data at Scale with Tika and Friends – Nick Burch

Large amounts of unknown data seeks helpful tools to identify itself and generate content!

With one or two files, you can take time to manually identify them, and get out their contents. With thousands of files, or the internet’s worth, this won’t scale, even with mechanical turks! Luckily, there are open source tools and programs out there to help.

First we’ll look at how we can work out what a given blob of 1s and 0s actually is, be it textual or binary. We’ll then see how to extract common metadata from it, along with text, embedded resources, images, and maybe even the kitchen sink! We’ll see how Apache Tika can do all of this for you, along with alternate and additional tools. Finally, we’ll look a how to roll this all out on a Big Data scale.

More information about this talk

Powered by WordPress.com.