FeatherCast

The voice of The Apache Software Foundation

Combining schema-on-read and schema-provisioning in Apache Drill Aman Sinha

September 12, 2019
timothyarthur

The data generated from IoT devices, machine logs and similar sources is often semi-structured or unstructured. This poses a challenge for traditional schema-on-write systems that require a fixed schema up-front for querying. Modern analytic applications often have ad hoc usage patterns and require tremendous flexibility over how this data can be consumed. Further, they demand that the data be made available for querying soon after it has landed in their data platform – which may be a distributed file system or NoSQL database or something similar. In this talk I will first describe how Apache Drill’s innovative schema-on-read capability built into a distributed SQL query engine meets the demands of such applications. The raw data in various formats can be queried directly from either the distributed file system or other data sources such as NoSQL DBs without having to define a schema up-front. The Drill query planner (in conjunction with Apache Calcite) supports the ‘ANY’ type for columns which allows type validations. During execution, readers produce ‘RecordBatches’ where all rows in one RecordBatch have the same schema but across batches the schema may change. This provides the core foundation for the schema-on-read capability. Downstream SQL operators in the query pipeline perform run-time Java code generation by generating schema specific code. This code is compiled and executed on the JVM and based on this the operator produces output batches. In cases where Drill is not able to infer the schema correctly or when there are ambiguities, Drill has introduced schema provisioning to complement the schema-on-read. I will describe the functionality to define columns types and nullability, specify columns format and default values and control which columns are projected and their projection order. The schema can be specified declaratively as part of the query or in a separate file. Our expectation is that these complementary strategies will meet the demands of the modern analytic applications that require schema flexibility in addition to performance and scalability.

Leave a Reply

Powered by WordPress.com.
%d bloggers like this: