Interactive Analytics at Scale in Apache Hive Using Druid – Jesús Camacho Rodríguez
Druid is an open-source analytics data store specially designed to execute OLAP queries on event data. Its speed, scalability and efficiency have made it a popular choice to power user-facing analytic applications. However, it does not provide important features requested by many of these applications, such as a SQL interface or support for complex operations such as joins. This talk presents our work on extending Druid indexing and querying capabilities using Apache Hive. In particular, our solution allows to index complex query results in Druid using Hive, query Druid data sources from Hive using SQL, and execute complex Hive queries on top of Druid data sources. We describe how we built an extension that brings benefits to both systems alike, leveraging Apache Calcite to overcome the challenge of transparently generating Druid JSON queries from the input Hive SQL queries.
Hadoop, Hive, Spark and Object Stores – Steve Loughran
Cloud deployments of Apache Hadoop are becoming more commonplace. Yet Hadoop and it’s applications don’t integrate that well äóîsomething which starts right down at the file IO operations.
This talk looks at how to make use of cloud object stores in Hadoop applications, including Hive and Spark. It will go from the foundational “what’s an object store?” to the practical “what should I avoid” and the timely “what’s new in Hadoop?” äóî the latter covering the improved S3 support in Hadoop 2.8+.
I’ll explore the details of benchmarking and improving object store IO in Hive and Spark, showing what developers can do in order to gain performance improvements in their own code äóîand equally, what they must avoid.
Finally, I’ll look at ongoing work, especially “S3Guard” and what its fast and consistent file metadata operations promise.
An Overview on Optimization in Apache Hive: Past, Present, Future – Jesús Camacho Rodríguez
Apache Hive has been continuously evolving to support a broad range of use cases, bringing it beyond its batch processing roots to its current support for interactive queries with LLAP. However, the development of its execution internals is not sufficient to guarantee efficient performance, since poorly optimized queries can create a bottleneck in the system. Hence, each release of Hive has included new features for its optimizer aimed to generate better plans and deliver improvements to query execution. In this talk, we present the development of the optimizer since its initial release. We describe its current state and how Hive leverages the latest Apache Calcite features to generate the most efficient execution plans. We show numbers demonstrating the improvements brought to Hive performance, and we discuss future directions for the next-generation Hive optimizer.
Apache Hive is the most commonly used SQL interface for Hadoop. To meet users data warehousing needs it must scale to petabytes of data, provide the necessary SQL, and perform in interactive time. The Hive community ihas produced a 2.0 release of Hive that includes significant improvements. These include:
* LLAP, a daemon layer that enables sub-second response time.
* HBase to store Hiveäó»s metadata, resulting in significantly reduced planning time.
* Using Apache Calcite to build a cost based optimizer
* Adding procedural SQL
* Improvements in using Spark as an engine for Hive execution
This talk will cover the use cases these changes enable, the architectural changes being made in Hive as part of building these features, and share performance test results on how these improvements are speeding up Hive.