FeatherCast

The voice of The Apache Software Foundation

Optimizing Big Data Pipelines with Apache Nemo (Incubating) John Youngseok Yang

September 12, 2019
timothyarthur

Optimizing scheduling and communication behaviors of big data pipelines for resource and data characteristics is crucial for achieving high performance. For example, optimizing for geographically-distributed resources, cheap transient resources, disk-based large data shuffle, and skewed data have recently received a lot of interest and attention. The new incubating Apache Nemo project aims to make it easy to express and enforce such optimizations by providing a policy interface that transforms an intermediate representation (IR) of data processing applications. Apache Nemo executes the optimized IR on distributed computers while enforcing the specified scheduling and communication behaviors, and at the same time maintaining the correct application semantics. Apache Nemo is closely integrated with existing Apache Big Data projects. Apache Nemo currently supports optimizing Apache Beam and Apache Spark applications, and uses Apache REEF to run on Apache Hadoop YARN and Apache Mesos. In this talk I will describe how to develop and compose new optimization policies on Apache Nemo, how the optimizations are actually enforced in the distributed execution, and the roadmap to further improve Apache Nemo.

Leave a Reply

Required fields are marked *.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.
%d bloggers like this: