Optimizing scheduling and communication behaviors of big data pipelines for resource and data characteristics is crucial for achieving high performance. For example, optimizing for geographically-distributed resources, cheap transient resources, disk-based large data shuffle, and skewed data have recently received a lot of interest and attention. The new incubating Apache Nemo project aims to make it easy to express and enforce such optimizations by providing a policy interface that transforms an intermediate representation (IR) of data processing applications. Apache Nemo executes the optimized IR on distributed computers while enforcing the specified scheduling and communication behaviors, and at the same time maintaining the correct application semantics. Apache Nemo is closely integrated with existing Apache Big Data projects. Apache Nemo currently supports optimizing Apache Beam and Apache Spark applications, and uses Apache REEF to run on Apache Hadoop YARN and Apache Mesos. In this talk I will describe how to develop and compose new optimization policies on Apache Nemo, how the optimizations are actually enforced in the distributed execution, and the roadmap to further improve Apache Nemo.
Optimizing Big Data Pipelines with Apache Nemo (Incubating) John Youngseok Yang
September 12, 2019