At LinkedIn, we have a large and ever-expanding corpus of big data processing code written using batch processing frameworks like Pig, Hive, and Spark. These data processing pipelines produce derived data artifacts like metrics, dimensions, and features and are generally run at daily or hourly end-to-end latencies. For a subset of these artifacts, there often arises a need to produce them at a faster cadence, say minutely or even continuously. At that point, the common practice used to be that the developer would re-write the same derivation logic using a stream processing framework (Apache Samza in case of LinkedIn). This causes two problems:
1. The time to go from a pipeline that generates data at hourly grain to a pipeline that generates the same data continuously is governed by developer time to understand the batch code and rewrite it in streaming language; typically weeks to a month to fully operationalize this pipeline.
2. The rewritten pipeline may not be exactly the same as the batch pipeline because of subtle differences in the logic or UDFs across the two code bases. Over time these pipelines can continue to diverge especially if different teams own the batch and streaming pipelines.
In this talk, we present our solution to this batch-stream divide; a system that can auto-generate streaming code from batch logic and employ the Lambda architecture to transparently deliver merged results to data consumers. Technically, our Lambda architecture is very similar to the traditional Lambda architecture; however, our users only need to maintain a single code base in batch logic. This code base serves as a single source of truth to specify what users want to compute. If they need fresher results, they just need to turn on the streaming flag.
We will deep-dive into technical details for this system which is built using open source components: Apache Beam and Apache Calcite. We’ll also share our experience using this system to auto-migrate batch scripts in our unified metrics platform into near realtime pipelines running on Apache Samza and our plans in the future to share this with the community.