The voice of The Apache Software Foundation

Using Relational Cache to Boost Apache Spark SQL Daoyuan Wang

September 12, 2019

Apache Spark has become a popular engine for data analysis in industry, providing a handful SQL interface and processing data from various data sources. While many of our customers are using Apache Spark as an interactive query platform to meet their business requirement, one of the most common complaints we heard is that spark is not quite ‘interactive’ while data goes big. In traditional database systems, we can utilize materialized views to accelerate query processing, through pre-computation and query plan rewriting. We adopt a similar method to Spark, allow user to create flexible caches from a query or cube definition, and rewrite user queries at runtime to utilize pre-computed results. The cached data can persist to external data sources that Spark supports, or stay in memory, and updates automatically when new data ingested. nIn this talk, we’d like to take a deep dive into our design, and show real-world performance gain from our customers. This feature will be contributed to community.

Leave a Reply

Required fields are marked *.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.
%d bloggers like this: