FeatherCast

The voice of The Apache Software Foundation

Lessons Learned from Leveraging Real-Time Power Consumption Data with Apache Kudu Masahiro Ito

September 12, 2019
timothyarthur

IoT is widely spreading in various industries, and various sensor devices generate large amounts of data in real-time. Such big data is used for visualization, machine learning, and other data analysis. Kudu is one of the data stores for such use cases which require fast inserts and efficient scans. Apache HBase is also often used to handle time series data such as sensor data. We also evaluated HBase for smart meter data a few years ago. HBase is suitable for storing sensor data because it has excellent insert performance, but it is not suitable for big data analysis because scan performance is not so good. On the other hand, Kudu is suitable for both fast insert and big data scan. We evaluated the performance of Kudu for leveraging the power consumption data collected in every second from sensor devices. With this performance test, we confirmed that 4-node Kudu cluster can storing and referring data in real-time generated by 1 million sensor devices. In order to handle a large amount of data in real-time, it is important to increase the memory allocation, to reduce the tablet size, and to define the order of the primary key to reduce the load of compaction. This presentation will introduce the differences between Kudu and other data stores (HBase, HDFS), the use cases suitable for Kudu, and the tuning know-how obtained through performance test.

Leave a Reply

Powered by WordPress.com.
%d bloggers like this: