One of the key components in any data processing system is IO connectors. These fundamental blocks allow to read and write data, which is stored in different type of sources, in a unified and distributed way. In this sense, Apache Beam is not an exception – it provides a rich API to develop a new connector with your favorite SDK and easily integrate it with Beam runners. In this talk we are going to show you how to write your own IO connector (in Java). We will see what are the differences between bounded and unbounded sources, how to implement efficient sources and sinks and where we need to pay more attention during the development, what kind of API your connector should provide to users and how it can be tested. To achieve this we will rely on examples from existing Beam connectors. We will also give a brief overview of a rather recent feature/pattern in Beam IO – Composable IO connectors and the Splittable DoFn API. We will discuss the advantages of modular IO API design and some new IO design patterns allowed by this style. This talk will be interesting for people, who consider to write their own IO connectors or want to contribute to existing ones, as well as for Beam users, who wish to understand how existing Beam connectors work under the hood.
Developing new IO connectors in Apache Beam Alexey Romanenko Ismaël Mejía
September 12, 2019