The voice of The Apache Software Foundation

Hadoop Submarine Ecosystem: Bringing Machine Learning and Big Data world (YARN and Kubernetes) together Wangda Tan Kequi Hu

September 13, 2019

‘Data Scientists focuses on developing ML models with applications such as TensorFlow / MXNet / Caffe / XGBoost and do not deep dive into the complexities of computing and storage to run ML/DL jobs. Today’s most of ETL processed data are stored in HDFS and cloud, and leveraging this data to design strong ML models is a big challenge for a data scientist. In Big Data ecosystem, most of the ETL and batch jobs are running on Spark and Hive which process/ingest data to the same data stores. Data scientists find this challenging in order to effectively utilize these big data workloads to develop an effective ML model. Hadoop Submarine (https://hadoop.apache.org/submarine/) helps to bring these two worlds together and provides seamless integration across. Ecosystem around Hadoop Submarine helps to design and run ML workloads from a notebook. Integrations with notebook such as Zeppelin and workflow scheduler like Azkaban, helps user to consume data from spark or hive and effectively run ML jobs with ease on any compute cluster. In this deep dive session, we will demo the simplicity of Submarine by running distributed deep learning/machine learning applications on YARN and Kubernetes as simple as running it locally. We will also showcase the community effort in developing Submarine’s ecosystem which eases integrations with Zeppelin and Azkaban. Submarine project could easily launch in the same cluster to run DL/ML jobs without any additional upgrades or complexities of having different machines.n’

Leave a Reply

Powered by WordPress.com.
%d bloggers like this: