Apache Kudu: A Distributed, Columnar Data Store for Fast Analytics – Mike Percy
The Hadoop ecosystem has recently made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems like Apache Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems like Apache HBase, applications can achieve millisecond-scale random access to arbitrarily-sized datasets. However, gaps remain when scans and random access are both required.
This talk will investigate the trade-offs between real-time random access and fast analytic performance from the perspective of storage engine internals. It will also describe Apache Kudu, the new addition to the open source Hadoop ecosystem with out-of-the-box integration with Apache Spark, that fills the gap described above to provide a new option to achieve fast scans and fast random access from a single API.
Hands On! Deploying Apache Hadoop Spark Cluster with HA, Monitoring, and Logging in AWS – Andrew Mcleod & Peter Vander Giessen
This is a hands-on workshop style session where attendees will learn how to deploy complex workloads such as a 10 node Hadoop Spark cluster complete with HA, Logging, and Monitoring. We can then scale the cluster from there pending needs. Attendees will also learn how to deploy other workloads such as connecting Apache Kafka into the Solution, connecting Apache Zeppelin into the solution, or trying the latest Cloud Native Kubernetes. We will then run a sample TeraSort, Spark Job, and Pagerank benchmak to get familiar with the cluster. An AWS controller will be provided for folks who don’t have cloud access.
No prior knowledge is needed, but if you want to get a head start install the Juju client by following the docs @ http://jujucharms.com/get-started
Developers are a possible attack vector for targeted attacks to infiltrate malicious code into enterprises.
The Speaker did a network traffic analysis with the Bro Network Security Monitor (bro.org) backed by an ELK Stack while compiling Apache Bigtop, a Big Data Distribution containing Apache Hadoop, Spark, HBase, Hive, Flink et al.
While there are no obvious traces of a malicious code within the traffic, there are many findings of possible attack vectors like unsecurely configured critical software infrastructure servers, usage of private repositories or unsecure protocols.
The Analysis showed that many compile jobs are downloading and running executables from untrusted sources. The author will shortly explain how these weaknesses can be exploited and will give recommendations on how to resolve these issues.
Create a Hadoop Cluster and Migrate 39PB Data Plus 150000 Jobs/Day – Stuart Pook
Criteo had an Hadoop cluster with 39 PB raw stockage, 13404 CPUs, 105 TB RAM, 40 TB data imported per day and >100000 jobs per day. This cluster was critical in both stockage and compute but without backups. This talk describes: 0/ the different options considered when deciding how to protect our data and compute capacity 1/ the criteria established for the 800 new computers and comparison tests between suppliers’ hardware 2/ the non-blocking network infrastructure with 10 Gb/s endpoints scalable to 5000 machines 3/ the installation and configuration, using Chef, of a cluster on new hardware 4/ the problems encountered in moving our jobs and data from the old CDH4 cluster to the new CDH5 cluster 600 km distant 5/ running and feeding with data the two clusters in parallel 6/ fail over plans 7/ operational issues 8/ the performance of the 16800 core, 200 TB RAM and 60 PB disk CDH5 cluster.
Apache Hadoop is used to run jobs that execute tasks over multiple machines with complex dependencies between tasks. And at scale, there can be 10’s to 1000’s of tasks running over 100’s to 1000äó»s of machines which increases the challenge of making sense of their performance. Pipelines of such jobs that logically run a business workflow add another level of complexity. No wonder that the question of why Hadoop jobs run slower than expected remains a perennial source of grief for developers. In this talk, we will draw on our experience in debugging and analyzing Hadoop jobs to describe some methodical approaches to this and present current and new tracing and tooling ideas that can help semi-automate parts of this difficult problem.
Apache Big Data is just a few weeks away. John Mertic will be keynoting about the challenges of deploying Apache Hadoop in the real world. I spoke with him about what he’ll be talking about, and about his day job at ODPi.