FeatherCast

The voice of The Apache Software Foundation

Apache Big Data Seville 2016 – The Myth of the Big Data Silver Bullet – Why Requirements Still Matter – Nick Burch

January 19, 2017
rbowen

The Myth of the Big Data Silver Bullet – Why Requirements Still Matter – Nick Burch

We’ve all heard the hype – Big Data will solve all your storage, processing and analytic problems effortlessly! As Big Data moves along the adoption cycle, there’s a wider range of possible technologies and platforms you could use, but sadly picking the right one still remains crucial to success. Some moving beyond the buzzwords to deploy Big Data find things really do work well, but others rapidly run into issues. The difference usually isn’t the technologies or the vendors per-se, but their appropriateness to the requirements, which aren’t always clear up-front…

This session won’t tell you what Big Data solution you need. Instead, we’ll cover some of the pitfalls, and help you with the questions towards working out your requirements in time for your Big Data system to be a success!

More information about this talk

Apache Big Data Seville 2016 – User Defined Functions and Materialized Views in Cassandra 3.0 – DuyHai Doan,

January 19, 2017
rbowen

User Defined Functions and Materialized Views in Cassandra 3.0 – DuyHai Doan,

Cassandra is evolving at a very fast pace and keeps introducing new features that close the gap with traditional SQL world, but they are always designed with a distributed approach in mind.

First we’ll throw an eye at the recent user-defined functions and show how they can improve your application performance and enrich your analytics use-cases.

Next, a tour on the materialized views, a major improvement that drastically changes the way people model data in Cassandra and makes developers’ life easier!

More information about this talk

Apache Big Data Seville 2016 – Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Other NoSQL Data Systems – Christian Tzolov

January 19, 2017
rbowen

Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Other NoSQL Data Systems – Christian Tzolov

When working with BigData & IoT systems we often feel the need for a Common Query Language. The system specific languages usually require longer adoption time and are harder to integrate within the existing stacks.

To fill this gap some NoSql vendors are building SQL access to their systems. Building SQL engine from scratch is a daunting job and frameworks like Apache Calcite can help you with the heavy lifting. Calcite allow you to integrate SQL parser, cost-based optimizer, and JDBC with your NoSql system.We will walk through the process of building a SQL access layer for Apache Geode (In-Memory Data Grid). I will share my experience, pitfalls and technical consideration like balancing between the SQL/RDBMS semantics and the design choices and limitations of the data system.

Hopefully this will enable you to add SQL capabilities to your prefered NoSQL data system.

More information about this talk

Apache Big Data Seville 2016 – Introducing Apache CouchDB 2.0 – Jan Lehnardt

January 19, 2017
rbowen

Introducing Apache CouchDB 2.0 – Jan Lehnardt

A thorough introduction to CouchDB 2.0, the five-years-in-the-making final delivery of the larger CouchDB vision.
Apache CouchDB 2,0 finally puts the C back in C.O.U.C.D.B: Cluster of unreliable commodity hardware. With a production-proofed implementation of the Amazon Dynamo paper, CouchDB has now high-availability, multi-machine clustering as well scaling options built-in, making it ready for Big Data solutions that benefit from CouchDB’s unique multi-master replication.

Apache Big Data Seville 2016 – Multi-Tenant Machine Learning with Apache Aurora and Apache Mesos – Stephan Erb

January 13, 2017
rbowen

Multi-Tenant Machine Learning with Apache Aurora and Apache Mesos – Stephan Erb

Data scientists care about statistics and fast iteration cycles for their experiments. They should not be concerned with technicalities like hardware failures, tenant isolation, or low cluster utilization. In order to shield its data scientists from these matters, Blue Yonder is using Apache Aurora.

When adopting Aurora, our goal was to run multiple machine learning projects on the same physical cluster. This talk will go into details of this adoption process and highlight key engineering decisions we have made. Particular focus will reside on the multi-tenancy and oversubscription features of Apache Aurora and Apache Mesos, its underlying resource manager.

Audience members will learn about the fundamentals of both Apache projects and how those can be assembled into a capable machine learning platform.

More information about this talk

Women in Big Data Luncheon & Program – ApacheCon Seville

January 13, 2017
rbowen

The Women’s Luncheon from ApacheCon Seville:

Luncheon Agenda

1:50pm – WiBD Overview – Anna Marchon

2:00pm – Keynote: Tina Rosario, Global VP, Enterprise Data Management at SAP

2:30pm Keynote: Marina Alekseeva, GM of the Intel Software and Service Group in Russia

3:00pm – Networking

More information about this talk

Apache Big Data Seville 2016 – Native and Distributed Machine Learning with Apache Mahout – Suneel Marthi

January 13, 2017
rbowen

Native and Distributed Machine Learning with Apache Mahout – Suneel Marthi

Data scientists love tools like R and Scikit-Learn since they are declarative and offer convenient and intuitive syntax for analysis tasks but are limited by local memory, Mahout offers similar features with near seamless distributed execution.

In this talk, we will look at Mahout-Samsara’s distributed linear algebra capabilities and demonstrate the same by building a classification algorithm for the popular ‘Eigenfaces’ problem using the Samsara DSL from an Apache Zeppelin notebook. We will demonstrate how a simple classification algorithm may be prototyped and executed, and show the performance using Samsara DSL with GPU acceleration. This will demonstrate how ML algorithms built with Samsara DSL are automatically parallelized and optimized to execute on Apache Flink and Apache Spark without the developer having to deal with the underlying semantics of the execution engine.

More information about this talk

Apache Big Data Seville 2016 – Unified Benchmarking of Big Data Platforms – Axel-Cyrille Ngonga Ngomo

January 12, 2017
rbowen

Unified Benchmarking of Big Data Platforms – Axel-Cyrille Ngonga Ngomo

Which Big Data Platform shown I use for my problem? This question remains one of the most important question for practitioners. In this talk, we will present the universal benchmarking platform for Big Data HOBBIT (htpp://project-hobbit.eu). The platform providies a unified approach for benchmarking Big Data frameworks. Mimicking algorithms generated from real data ensure that the dataset used for benchmarking resemble real data but are open for all to use, therewith circumventing the issues that come about when using company-bound data. The core of the platform implements industry-relevant KPI gathered from more than 70 Big-Datad-driven organizations. The results are generated using machine-readable formats so as to ensure that they can be analyzed and use for improving toold and frameworks. In the talk, I will present the architecture of the framework and some preliminary results.

More information about this talk

Apache Big Data Seville 2016 – Building and Running a Solr-as-a-Service for IBM Watson – Shai Erera

January 12, 2017
rbowen

Building and Running a Solr-as-a-Service for IBM Watson – Shai Erera

Running a managed Solr service brings fun challenges with it, to both the users and the service itself. Users typically do not have access to all components of the Solr system (e.g. the ZK ensemble, the actual nodes that Solr runs on etc.). On the other hand the service must ensure high-availability at all times, and handle what is often user-driven tasks such as version upgrades, taking nodes offline for maintenance and more.

In this talk I will describe how we tackle these challenges to build a managed Solr service on the cloud, which currently hosts few thousands of Solr clusters. I will focus on the infrastructure that we chose to run the Solr clusters on, as well how we ensure high-availability, cluster balancing and version upgrades.

More information about this talk

Apache Big Data Seville 2016 – Large Scale Open Source Data Processing Pipelines at Trivago – Clemens Valiente

January 12, 2017
rbowen

Large Scale Open Source Data Processing Pipelines at Trivago – Clemens Valiente

trivago is processing roughly 7 billion events per day with an architecture that is entirely open source – from producing the data until its visualization in dashboards and reports. This talk will explain the idea behind the pipeline, highlight a particular business use case and share the experience and engineering challenges from two years in production. Clemens Valiente will furthermore show the different tools, frameworks and systems used, with Kafka for data ingestion, hadoop and Hive for processing and Impala for querying as the main focus. The successful implementation of this large scale data processing pipeline fundamentally transformed the way trivago was able to approach its business.

More information about this talk

Powered by WordPress.com.