FeatherCast

The voice of The Apache Software Foundation

Building BigData Query Optimization with Apache Calcite – Best Practices from Alibaba MaxCompute Haisheng Yuan

September 12, 2019
timothyarthur

 

MaxCompute is a large scale, distributed big data platform of Alibaba, which provides Exabyte storage capacity and massive computing power through tens of thousands of commodity machines. The system supports SQL-like declarative language for advanced query and analysis on web-scale data set. Millions of jobs with hundreds of petabytes of data are processed every day, powering mission critical business within Alibaba, including e-Commerce, mobile payment, logistics, etc. 

Query optimizer plays a key part in determining the optimal execution plan. We first provide a general introduction about the overall architecture of MaxCompute, then we introduce how MaxCompute leverages Apache Calcite to build an efficient and robust query optimizer. We will also discuss the physical operators that MaxCompute creates in order to adapt to Apache Calcite, and the improvements that have been done to Calcite, such as IN list optimization, outer join null skew optimization, and logical relational node preprocessing. We further introduce our data partitioning techniques, including hash and range partitioning, to support advanced query parallelism. We will also talk about HBO (Historical Based Optimization) for regular ETL tasks.

Leave a Reply

Powered by WordPress.com.
%d bloggers like this: