The voice of The Apache Software Foundation

Managing Trillions of Rows with Aplomb (well, actually with Drill) Ted Dunning

September 12, 2019

‘Ingesting lots of data isn’t very hard any more. Ingesting it on a critical schedule, within strict time bounds while minimizing the risk of bogus data showing up is much harder. In practice, grownup data ingestion and access requires the following capabilities * Incoming data can be fully ingested into our working dataset but hidden from users until all quality checks are completedn * Individual batches of data can be released atomicallyn * Any indexing updates should also appear appear atomicallyn * Expiring data should disappear atomically either according to ingest batch or precise time bounds Apache Drill provides several capabilities that make it much easier to meet these goals. You can handle large volumes of data while allowing in-situ quality controls and while controlling the visibility of unverified data. I will describe a worked example that shows how Drill helps make this happen. (with aplomb)n’


Leave a Reply

Powered by WordPress.com.
%d bloggers like this: