FeatherCast

The voice of The Apache Software Foundation

Building BigData Query Optimization with Apache Calcite – Best Practices from Alibaba MaxCompute Haisheng Yuan

September 12, 2019
timothyarthur

 

MaxCompute is a large scale, distributed big data platform of Alibaba, which provides Exabyte storage capacity and massive computing power through tens of thousands of commodity machines. The system supports SQL-like declarative language for advanced query and analysis on web-scale data set. Millions of jobs with hundreds of petabytes of data are processed every day, powering mission critical business within Alibaba, including e-Commerce, mobile payment, logistics, etc. 

Query optimizer plays a key part in determining the optimal execution plan. We first provide a general introduction about the overall architecture of MaxCompute, then we introduce how MaxCompute leverages Apache Calcite to build an efficient and robust query optimizer. We will also discuss the physical operators that MaxCompute creates in order to adapt to Apache Calcite, and the improvements that have been done to Calcite, such as IN list optimization, outer join null skew optimization, and logical relational node preprocessing. We further introduce our data partitioning techniques, including hash and range partitioning, to support advanced query parallelism. We will also talk about HBO (Historical Based Optimization) for regular ETL tasks.

Leave a Reply

Required fields are marked *.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.
%d bloggers like this: