FeatherCast

The voice of The Apache Software Foundation

Portable Spark Runner: Running Beam Pipelines Written in Python and Go with Spark Kyle Weaver Ismaël Mejía

September 12, 2019
timothyarthur

Apache Spark is the most popular open source analytics engine for large-scale data processing. Spark is not only a mature system, but thanks to its support of multiple resource managers like Hadoop, Mesos, and Kubernetes it has become a popular choice for both batch and streaming workloads in the industry. Apache Beam has included a Spark runner since its inception to allow users to execute Beam pipelines on Spark, but until recently the Spark runner could only execute pipelines written in Java. In this talk we will introduce the portability framework and how we adapted it into the existing Spark runner translation to make the Spark runner portable. We will show you how to execute Beam pipelines written in Python and Golang in Spark with Beam and invite you to use the new Spark Portable Runner. We will mention the use case of Tensorflow Extended, the end-to-end platform for data validation and transformation and ML model analysis. Finally we will discuss ongoing work and some future plans for the portable runner.

Leave a Reply

Required fields are marked *.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.
%d bloggers like this: