The Myth of the Big Data Silver Bullet – Why Requirements Still Matter – Nick Burch
We’ve all heard the hype – Big Data will solve all your storage, processing and analytic problems effortlessly! As Big Data moves along the adoption cycle, there’s a wider range of possible technologies and platforms you could use, but sadly picking the right one still remains crucial to success. Some moving beyond the buzzwords to deploy Big Data find things really do work well, but others rapidly run into issues. The difference usually isn’t the technologies or the vendors per-se, but their appropriateness to the requirements, which aren’t always clear up-front…
This session won’t tell you what Big Data solution you need. Instead, we’ll cover some of the pitfalls, and help you with the questions towards working out your requirements in time for your Big Data system to be a success!
When working with BigData & IoT systems we often feel the need for a Common Query Language. The system specific languages usually require longer adoption time and are harder to integrate within the existing stacks.
To fill this gap some NoSql vendors are building SQL access to their systems. Building SQL engine from scratch is a daunting job and frameworks like Apache Calcite can help you with the heavy lifting. Calcite allow you to integrate SQL parser, cost-based optimizer, and JDBC with your NoSql system.We will walk through the process of building a SQL access layer for Apache Geode (In-Memory Data Grid). I will share my experience, pitfalls and technical consideration like balancing between the SQL/RDBMS semantics and the design choices and limitations of the data system.
Hopefully this will enable you to add SQL capabilities to your prefered NoSQL data system.
A thorough introduction to CouchDB 2.0, the five-years-in-the-making final delivery of the larger CouchDB vision.
Apache CouchDB 2,0 finally puts the C back in C.O.U.C.D.B: Cluster of unreliable commodity hardware. With a production-proofed implementation of the Amazon Dynamo paper, CouchDB has now high-availability, multi-machine clustering as well scaling options built-in, making it ready for Big Data solutions that benefit from CouchDB’s unique multi-master replication.
Multi-Tenant Machine Learning with Apache Aurora and Apache Mesos – Stephan Erb
Data scientists care about statistics and fast iteration cycles for their experiments. They should not be concerned with technicalities like hardware failures, tenant isolation, or low cluster utilization. In order to shield its data scientists from these matters, Blue Yonder is using Apache Aurora.
When adopting Aurora, our goal was to run multiple machine learning projects on the same physical cluster. This talk will go into details of this adoption process and highlight key engineering decisions we have made. Particular focus will reside on the multi-tenancy and oversubscription features of Apache Aurora and Apache Mesos, its underlying resource manager.
Audience members will learn about the fundamentals of both Apache projects and how those can be assembled into a capable machine learning platform.
Native and Distributed Machine Learning with Apache Mahout – Suneel Marthi
Data scientists love tools like R and Scikit-Learn since they are declarative and offer convenient and intuitive syntax for analysis tasks but are limited by local memory, Mahout offers similar features with near seamless distributed execution.
In this talk, we will look at Mahout-Samsara’s distributed linear algebra capabilities and demonstrate the same by building a classification algorithm for the popular ‘Eigenfaces’ problem using the Samsara DSL from an Apache Zeppelin notebook. We will demonstrate how a simple classification algorithm may be prototyped and executed, and show the performance using Samsara DSL with GPU acceleration. This will demonstrate how ML algorithms built with Samsara DSL are automatically parallelized and optimized to execute on Apache Flink and Apache Spark without the developer having to deal with the underlying semantics of the execution engine.
Unified Benchmarking of Big Data Platforms – Axel-Cyrille Ngonga Ngomo
Which Big Data Platform shown I use for my problem? This question remains one of the most important question for practitioners. In this talk, we will present the universal benchmarking platform for Big Data HOBBIT (htpp://project-hobbit.eu). The platform providies a unified approach for benchmarking Big Data frameworks. Mimicking algorithms generated from real data ensure that the dataset used for benchmarking resemble real data but are open for all to use, therewith circumventing the issues that come about when using company-bound data. The core of the platform implements industry-relevant KPI gathered from more than 70 Big-Datad-driven organizations. The results are generated using machine-readable formats so as to ensure that they can be analyzed and use for improving toold and frameworks. In the talk, I will present the architecture of the framework and some preliminary results.
Building and Running a Solr-as-a-Service for IBM Watson – Shai Erera
Running a managed Solr service brings fun challenges with it, to both the users and the service itself. Users typically do not have access to all components of the Solr system (e.g. the ZK ensemble, the actual nodes that Solr runs on etc.). On the other hand the service must ensure high-availability at all times, and handle what is often user-driven tasks such as version upgrades, taking nodes offline for maintenance and more.
In this talk I will describe how we tackle these challenges to build a managed Solr service on the cloud, which currently hosts few thousands of Solr clusters. I will focus on the infrastructure that we chose to run the Solr clusters on, as well how we ensure high-availability, cluster balancing and version upgrades.
Large Scale Open Source Data Processing Pipelines at Trivago – Clemens Valiente
trivago is processing roughly 7 billion events per day with an architecture that is entirely open source – from producing the data until its visualization in dashboards and reports. This talk will explain the idea behind the pipeline, highlight a particular business use case and share the experience and engineering challenges from two years in production. Clemens Valiente will furthermore show the different tools, frameworks and systems used, with Kafka for data ingestion, hadoop and Hive for processing and Impala for querying as the main focus. The successful implementation of this large scale data processing pipeline fundamentally transformed the way trivago was able to approach its business.