FeatherCast

The voice of The Apache Software Foundation

Apache Big Data Seville 2016 – Parquet Format in Practice & Detail – Uwe L. Korn

January 23, 2017
rbowen

Parquet Format in Practice & Detail – Uwe L. Korn

Apache Parquet is among the most commonly used column-oriented data formats in the big data processing space. It leverages various techniques to store data in a CPU- and I/O-efficient way. Furthermore, it has the capabilities to push-down analytical queries on the data to the I/O layer to avoid the loading of nonrelevant data chunks. With various Java and a C++ implementation, Parquet is also the perfect choice to exchange data between different technology stacks.

As part of this talk, a general introduction to the format and its techniques will be given. Their benefits and some of the inner workings will be explained to give a better understanding how Parquet achieves its performance. At the end, benchmarks comparing the new C++ & Python implementation with other formats will be shown.

More information about this talk

Powered by WordPress.com.