What 50 Years of Data Science Leaves Out – Sean Owen
(Recording is incomplete.)
We’re told “data science” is the key to unlocking the value in big data, but, nobody seems to agree just what it is — engineering, statistics, both? David Donoho’s paper “50 Years of Data Science” offers one of the best criticisms of the hype around data science from a statistics perspective, and proposes that data science is not new, if it’s anything at all. This talk will examine these points, and respond with an engineer’s counterpoints, in search of a better understanding of data science.
Scio, a Scala DSL for Apache Beam – Robert Gruener
Learn about Scio, a Scala DSL for Apache Beam. Beam introduces a simple, unified programming model for both batch and streaming data processing while Scio brings it much closer to the high level API many data engineers are familiar with. We will cover design and implementation of the framework, including features like type safe BigQuery and REPL. There will also a live coding demo.
Introduction to Apache Beam – Jean-Baptiste Onofré, & Dan Halperin
Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines. The same Beam pipelines work in batch and streaming, and on a variety of open source and private cloud big data processing backends including Apache Flink, Apache Spark, and Google Cloud Dataflow. This talk will introduce Apache Beam’s programming model and mechanisms for efficient execution. The speakers will show how to build Beam pipelines, and demo how to use it to execute the same code across different runners.
Apache Parquet is among the most commonly used column-oriented data formats in the big data processing space. It leverages various techniques to store data in a CPU- and I/O-efficient way. Furthermore, it has the capabilities to push-down analytical queries on the data to the I/O layer to avoid the loading of nonrelevant data chunks. With various Java and a C++ implementation, Parquet is also the perfect choice to exchange data between different technology stacks.
As part of this talk, a general introduction to the format and its techniques will be given. Their benefits and some of the inner workings will be explained to give a better understanding how Parquet achieves its performance. At the end, benchmarks comparing the new C++ & Python implementation with other formats will be shown.
Apache Kudu: A Distributed, Columnar Data Store for Fast Analytics – Mike Percy
The Hadoop ecosystem has recently made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems like Apache Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems like Apache HBase, applications can achieve millisecond-scale random access to arbitrarily-sized datasets. However, gaps remain when scans and random access are both required.
This talk will investigate the trade-offs between real-time random access and fast analytic performance from the perspective of storage engine internals. It will also describe Apache Kudu, the new addition to the open source Hadoop ecosystem with out-of-the-box integration with Apache Spark, that fills the gap described above to provide a new option to achieve fast scans and fast random access from a single API.
Shared Memory Layer and Faster SQL for Spark Applications – Dmitriy Setrakyan
In this presentation we will talk about the need to share state in memory across different Spark jobs or applications and Apache Ignite as the technology that makes it possible. We will dive into importance of In Memory File Systems, Shared In-Memory RDDs with Apache Ignite, as well as the need to index data in-memory for fast SQL execution. We will also present a hands on demo demonstrating advantages and disadvantages of one approach over another. We will also discuss requirements of storing data off-heap in order to achieve large horizontal and vertical scale of the applications using Spark and Ignite.
What’s With the 1s and 0s? Making Sense of Binary Data at Scale with Tika and Friends – Nick Burch
Large amounts of unknown data seeks helpful tools to identify itself and generate content!
With one or two files, you can take time to manually identify them, and get out their contents. With thousands of files, or the internet’s worth, this won’t scale, even with mechanical turks! Luckily, there are open source tools and programs out there to help.
First we’ll look at how we can work out what a given blob of 1s and 0s actually is, be it textual or binary. We’ll then see how to extract common metadata from it, along with text, embedded resources, images, and maybe even the kitchen sink! We’ll see how Apache Tika can do all of this for you, along with alternate and additional tools. Finally, we’ll look a how to roll this all out on a Big Data scale.
Hands On! Deploying Apache Hadoop Spark Cluster with HA, Monitoring, and Logging in AWS – Andrew Mcleod & Peter Vander Giessen
This is a hands-on workshop style session where attendees will learn how to deploy complex workloads such as a 10 node Hadoop Spark cluster complete with HA, Logging, and Monitoring. We can then scale the cluster from there pending needs. Attendees will also learn how to deploy other workloads such as connecting Apache Kafka into the Solution, connecting Apache Zeppelin into the solution, or trying the latest Cloud Native Kubernetes. We will then run a sample TeraSort, Spark Job, and Pagerank benchmak to get familiar with the cluster. An AWS controller will be provided for folks who don’t have cloud access.
No prior knowledge is needed, but if you want to get a head start install the Juju client by following the docs @ http://jujucharms.com/get-started
Developers are a possible attack vector for targeted attacks to infiltrate malicious code into enterprises.
The Speaker did a network traffic analysis with the Bro Network Security Monitor (bro.org) backed by an ELK Stack while compiling Apache Bigtop, a Big Data Distribution containing Apache Hadoop, Spark, HBase, Hive, Flink et al.
While there are no obvious traces of a malicious code within the traffic, there are many findings of possible attack vectors like unsecurely configured critical software infrastructure servers, usage of private repositories or unsecure protocols.
The Analysis showed that many compile jobs are downloading and running executables from untrusted sources. The author will shortly explain how these weaknesses can be exploited and will give recommendations on how to resolve these issues.