Interactive Analytics at Scale in Apache Hive Using Druid – Jesús Camacho Rodríguez
Druid is an open-source analytics data store specially designed to execute OLAP queries on event data. Its speed, scalability and efficiency have made it a popular choice to power user-facing analytic applications. However, it does not provide important features requested by many of these applications, such as a SQL interface or support for complex operations such as joins. This talk presents our work on extending Druid indexing and querying capabilities using Apache Hive. In particular, our solution allows to index complex query results in Druid using Hive, query Druid data sources from Hive using SQL, and execute complex Hive queries on top of Druid data sources. We describe how we built an extension that brings benefits to both systems alike, leveraging Apache Calcite to overcome the challenge of transparently generating Druid JSON queries from the input Hive SQL queries.