How to Secure Apache Spark? – Neelesh Srinivas Salian
Security has been a crucial component of the Big Data ecosystem. The need to protect data from exploits and vulnerability are evident in the strong push for cybersecurity and secure clusters across businesses and industries alike. Spark itself has been a major analytic backbone of that infrastructure. Similar to the evolution of the security infrastructure on Hadoop, we see Spark growing as well. How does one ensure Security with Spark without much hassle ? This talk focuses on the steps need to be taken to setup and discuss the potential issues on Spark Core, Streaming and other components that would follow. The speaker has been helping out large enterprise customers setup and ensure their infrastructure maintains the secure environment.
Apache Kudu: A Distributed, Columnar Data Store for Fast Analytics – Mike Percy
The Hadoop ecosystem has recently made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems like Apache Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems like Apache HBase, applications can achieve millisecond-scale random access to arbitrarily-sized datasets. However, gaps remain when scans and random access are both required.
This talk will investigate the trade-offs between real-time random access and fast analytic performance from the perspective of storage engine internals. It will also describe Apache Kudu, the new addition to the open source Hadoop ecosystem with out-of-the-box integration with Apache Spark, that fills the gap described above to provide a new option to achieve fast scans and fast random access from a single API.
Shared Memory Layer and Faster SQL for Spark Applications – Dmitriy Setrakyan
In this presentation we will talk about the need to share state in memory across different Spark jobs or applications and Apache Ignite as the technology that makes it possible. We will dive into importance of In Memory File Systems, Shared In-Memory RDDs with Apache Ignite, as well as the need to index data in-memory for fast SQL execution. We will also present a hands on demo demonstrating advantages and disadvantages of one approach over another. We will also discuss requirements of storing data off-heap in order to achieve large horizontal and vertical scale of the applications using Spark and Ignite.
Apache Pig is a popular scripting platform for processing and analyzing large data sets in the Hadoop ecosystem. With its open architecture and backend neutrality, Pig scripts can currently run on MapReduce and Tez. Apache Spark is an open-source data analytics cluster computing framework that has gained significant momentum recently. Besides offering performance advantages, Spark is also a more natural fit for the query plan produced by Pig. Pig on Spark enables improved ETL performance while also supporting users intending to standardize to Spark as the execution engine.