Hands On! Deploying Apache Hadoop Spark Cluster with HA, Monitoring, and Logging in AWS – Andrew Mcleod & Peter Vander Giessen
This is a hands-on workshop style session where attendees will learn how to deploy complex workloads such as a 10 node Hadoop Spark cluster complete with HA, Logging, and Monitoring. We can then scale the cluster from there pending needs. Attendees will also learn how to deploy other workloads such as connecting Apache Kafka into the Solution, connecting Apache Zeppelin into the solution, or trying the latest Cloud Native Kubernetes. We will then run a sample TeraSort, Spark Job, and Pagerank benchmak to get familiar with the cluster. An AWS controller will be provided for folks who don’t have cloud access.
No prior knowledge is needed, but if you want to get a head start install the Juju client by following the docs @ http://jujucharms.com/get-started
Developers are a possible attack vector for targeted attacks to infiltrate malicious code into enterprises.
The Speaker did a network traffic analysis with the Bro Network Security Monitor (bro.org) backed by an ELK Stack while compiling Apache Bigtop, a Big Data Distribution containing Apache Hadoop, Spark, HBase, Hive, Flink et al.
While there are no obvious traces of a malicious code within the traffic, there are many findings of possible attack vectors like unsecurely configured critical software infrastructure servers, usage of private repositories or unsecure protocols.
The Analysis showed that many compile jobs are downloading and running executables from untrusted sources. The author will shortly explain how these weaknesses can be exploited and will give recommendations on how to resolve these issues.
The Myth of the Big Data Silver Bullet – Why Requirements Still Matter – Nick Burch
We’ve all heard the hype – Big Data will solve all your storage, processing and analytic problems effortlessly! As Big Data moves along the adoption cycle, there’s a wider range of possible technologies and platforms you could use, but sadly picking the right one still remains crucial to success. Some moving beyond the buzzwords to deploy Big Data find things really do work well, but others rapidly run into issues. The difference usually isn’t the technologies or the vendors per-se, but their appropriateness to the requirements, which aren’t always clear up-front…
This session won’t tell you what Big Data solution you need. Instead, we’ll cover some of the pitfalls, and help you with the questions towards working out your requirements in time for your Big Data system to be a success!
SASI, Cassandra on the Full Text Search Ride! – DuyHai Doan
Apache Cassandra is a scalable database with high availability features. But they come with severe limitations in term of querying capabilities.
Since the introduction of SASI in Cassandra 3.4, the limitations belong to the pass. Now you can create indices on your columns as well as benefit from full text search capabilities with the introduction of the new `LIKE ‘%term%’` syntax.
To illustrate how SASI works, we’ll use a database of 100 000 albums and artists. We’ll also show how SASI can help to accelerate analytics scenarios with Apache Spark using SparkSQL predicate push-down.
We also highlight some use-cases where SASI is not a good fit and should be avoided (there is no magic, sorry)
When working with BigData & IoT systems we often feel the need for a Common Query Language. The system specific languages usually require longer adoption time and are harder to integrate within the existing stacks.
To fill this gap some NoSql vendors are building SQL access to their systems. Building SQL engine from scratch is a daunting job and frameworks like Apache Calcite can help you with the heavy lifting. Calcite allow you to integrate SQL parser, cost-based optimizer, and JDBC with your NoSql system.We will walk through the process of building a SQL access layer for Apache Geode (In-Memory Data Grid). I will share my experience, pitfalls and technical consideration like balancing between the SQL/RDBMS semantics and the design choices and limitations of the data system.
Hopefully this will enable you to add SQL capabilities to your prefered NoSQL data system.
A thorough introduction to CouchDB 2.0, the five-years-in-the-making final delivery of the larger CouchDB vision.
Apache CouchDB 2,0 finally puts the C back in C.O.U.C.D.B: Cluster of unreliable commodity hardware. With a production-proofed implementation of the Amazon Dynamo paper, CouchDB has now high-availability, multi-machine clustering as well scaling options built-in, making it ready for Big Data solutions that benefit from CouchDB’s unique multi-master replication.
Multi-Tenant Machine Learning with Apache Aurora and Apache Mesos – Stephan Erb
Data scientists care about statistics and fast iteration cycles for their experiments. They should not be concerned with technicalities like hardware failures, tenant isolation, or low cluster utilization. In order to shield its data scientists from these matters, Blue Yonder is using Apache Aurora.
When adopting Aurora, our goal was to run multiple machine learning projects on the same physical cluster. This talk will go into details of this adoption process and highlight key engineering decisions we have made. Particular focus will reside on the multi-tenancy and oversubscription features of Apache Aurora and Apache Mesos, its underlying resource manager.
Audience members will learn about the fundamentals of both Apache projects and how those can be assembled into a capable machine learning platform.
Native and Distributed Machine Learning with Apache Mahout – Suneel Marthi
Data scientists love tools like R and Scikit-Learn since they are declarative and offer convenient and intuitive syntax for analysis tasks but are limited by local memory, Mahout offers similar features with near seamless distributed execution.
In this talk, we will look at Mahout-Samsara’s distributed linear algebra capabilities and demonstrate the same by building a classification algorithm for the popular ‘Eigenfaces’ problem using the Samsara DSL from an Apache Zeppelin notebook. We will demonstrate how a simple classification algorithm may be prototyped and executed, and show the performance using Samsara DSL with GPU acceleration. This will demonstrate how ML algorithms built with Samsara DSL are automatically parallelized and optimized to execute on Apache Flink and Apache Spark without the developer having to deal with the underlying semantics of the execution engine.