FeatherCast

The voice of The Apache Software Foundation

Managing Trillions of Rows with Aplomb (well, actually with Drill) Ted Dunning

September 12, 2019
timothyarthur

‘Ingesting lots of data isn’t very hard any more. Ingesting it on a critical schedule, within strict time bounds while minimizing the risk of bogus data showing up is much harder. In practice, grownup data ingestion and access requires the following capabilities * Incoming data can be fully ingested into our working dataset but hidden from users until all quality checks are completedn * Individual batches of data can be released atomicallyn * Any indexing updates should also appear appear atomicallyn * Expiring data should disappear atomically either according to ingest batch or precise time bounds Apache Drill provides several capabilities that make it much easier to meet these goals. You can handle large volumes of data while allowing in-situ quality controls and while controlling the visibility of unverified data. I will describe a worked example that shows how Drill helps make this happen. (with aplomb)n’

 

Leave a Reply

Required fields are marked *.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.
%d bloggers like this: