Apache Spark has become a popular engine for data analysis in industry, providing a handful SQL interface and processing data from various data sources. While many of our customers are using Apache Spark as an interactive query platform to meet their business requirement, one of the most common complaints we heard is that spark is not quite ‘interactive’ while data goes big. In traditional database systems, we can utilize materialized views to accelerate query processing, through pre-computation and query plan rewriting. We adopt a similar method to Spark, allow user to create flexible caches from a query or cube definition, and rewrite user queries at runtime to utilize pre-computed results. The cached data can persist to external data sources that Spark supports, or stay in memory, and updates automatically when new data ingested. nIn this talk, we’d like to take a deep dive into our design, and show real-world performance gain from our customers. This feature will be contributed to community.
Using Relational Cache to Boost Apache Spark SQL Daoyuan Wang
September 12, 2019