FeatherCast

The voice of The Apache Software Foundation

Creating a Stream Data Pipeline on Google Cloud Platform using Apache Beam Shuichi Suzuki

September 13, 2019
timothyarthur

We built a scalable and flexible stream data pipeline for our microservices on Google Cloud Platform (GCP), using Cloud Pub/Sub, Google Cloud Storage, BigQuery, and Cloud Dataflow, using Apache Beam. The stream data pipeline is working on the production system for Mercari, one of the biggest C2C e-commerce services in Japan. The pipeline currently accepts logs from 5+ microservices, and the number will increase soon.
Our microservice architecture is based on the following three concepts:
1. Split the log collection and data processing phases to keep the system simple.
2. Use stream processing in order to achieve low latency.
3. Don’t just accumulate raw data—support structured output that is easier to use.
Based on these concepts, we built a data pipeline in GCP.
For each microservice, we will provide a Cloud Pub/Sub “Ramp” to send logs to. Cloud Pub/Sub can have messages that contain an optional byte array in their payloads. The entire message that is ingested into the Ramp uses Cloud Dataflow streaming processing to collect them in the “RawDataHub,” a Cloud Pub/Sub topic for collection. When this happens, the PubsubMessage payload is not changed at all; the metadata necessary for subsequent processing (destination and schema information, data necessary for pipeline metrics, etc.) is provided in the PubsubMessage’s attribute map. This Dataflow job does not do processing for each service or topic—it treats all messages uniformly.
The raw data from RawDataHub is then output to two independent Cloud DataFlow streaming processes: “RawDataLake”(the infrastructure is in Google Cloud Storage, or “GCS”) and “StructuredDataHub”, another Cloud Pub/Sub topic. The StructuredDataHub has structured avro records.
The structured data from StructuredDataHub is then sent to more two independent Cloud DataFlow streaming processes: “StructuredDataLake” on GCS and “Data WareHouse” (Google BigQuery).

Leave a Reply

Required fields are marked *.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.
%d bloggers like this: