The voice of The Apache Software Foundation

Schema-Managed Big Data Column Access Control and Use Cases Mohammad Islam Xinli Shang Pavi Subenderan

September 12, 2019

Motivation: Access control via encryption improves security coverage compared with traditional enforcement in the access path because encryption can prevent invalid accesses from any angle. Finer-grained access control at the column-level is needed because in a typical big dataset, only a few columns are sensitive and need to be protected, and different columns could have different sensitivities and a different set of eligible readers. Design: With encryption features in columnar file format like Apache Parquet, column access control via encryption becomes possible. But to adopt these features into existing analytic frameworks that might use Apache Hive, Apache Spark, Apache Preto etc, is a challenge because significant changes are needed to those query engines to control the encryption. n nTo avoid massive changes in existing frameworks, the Apache Parquet community designed a schema controlled column encryption mechanism. The schema of data tables can be leveraged by a system architect in order to define the sensitivity of a column, that will be propagated through the stack and will eventually trigger the encryption on that column in the Apache Parquet writing layer. This solution is applicable in many analytic frameworks via transparent plug-in invocation that avoids massive changes in the frameworks, making Apache Parquet encryption easy to adopt. This mechanism is possible to be extended to support Apache ORC too. n nUse Cases: 1. HDFS ingesting pipelines with Spark and Hudi encrypt sensitive columns automatically with schema-controlled crypto settings. n nIn this use case, we will talk about how to use schema to control column encryption in Apache Parquet, and what is the flow that the pipeline can automatically encrypt the columns once schema sets the sensitivity. n n2. Column access control scalability and performance analysis for analytic frameworks with Apache Hive and Presto on Apache Parquet.n nLike Apache Spark, Apache Hive and Presto are two popular query engines. We will show our analysis on scalability and performance of column access control in an analytic framework. n nSummary: In this talk, we will present the motivation and design of schema-controlled column encryption at file format level, and how it can ease the adoption of encryption features to current popular query engines like Apache Hive, Apache Spark and Presto. Use cases will be presented to show how to use schema controlled column encryption in analytic pipeline and frameworks.

Leave a Reply

Powered by WordPress.com.
%d bloggers like this: