Apache Arrow and Apache Beam deserve to be together. It is a stated goal of both projects to provide mechanisms for performing data processing tasks across many languages. To accomplish this, both projects define a common binary “model” and provide multiple language implementations of that model.
However, they are approaching the overarching problem of cross-language data processing from two very different angles: Arrow is primarily concerned with moving columnar data across language boundaries with minimal overhead and providing optimized computation primitives to operate on that data, while Beam seeks to enable users to write scalable data pipelines in their language of choice and execute them on any distributed data-processing system. There is significant overlap between these two problem spaces, and we should consider using these models together for the benefit of both projects.
In this talk, I will describe both of these projects at a high-level, and present a vision for how they can be used together. I will also discuss some of the concerns around using a batched columnar data format in the Beam model, along with potential solutions.
In this session we will illustrate how Sky moved from an on-premise, batch-focused Hadoop based analytical system, to a streaming-oriented analytics pipeline deployed using Apache Beam and Google Cloud Dataflow. We will go through the architecture of the system, including the design decision made along the way, as well as learnings from the implementation and deployment process. We will end the session with an end-to-end demo and code examples!