FeatherCast

The voice of The Apache Software Foundation

Apache Hudi (Incubating) : the past, present and future of efficient data lake architectures vinoth chandar Balaji Varadarajan

September 12, 2019
timothyarthur

Apache Hudi is a newly incubating project at the ASF. Originally created at Uber, to power it’s vast big data lake, Hudi provides key features like atomic writes, snapshot isolation, incremental views, rollbacks, point-in-time restores, file size management & many more. To date, the big data community has been polarized amongst batch and streaming systems, when balancing data freshness and scale. Hudi also addresses a combined need for speed and scale, that does not naturally fit into existing batch and streaming data processing architectures, by way of ability to support continuous ingestion and asynchronous compaction of row & columnar data. In this talk, we will briefly discuss the history of the project – the motivating use-cases & the architectural underpinnings, that spurred the need for such a system. We will examine a blueprint for reliable and state-of-the-art data lake architecture and explain how Hudi’s current capabilities play a central role in it. We will also share hands-on recipes to leverage Hudi in your organization, for popular tasks like data ingestion or ETLs. We will dedicate the remainder of the talk to discussing the roadmap ahead, touching upon areas like external indexing, storage layout organization, as well as advanced topics like building realtime ML feature stores or composing safe multi-stream ETL joins.

Leave a Reply

Powered by WordPress.com.
%d bloggers like this: