‘Apache Hadoop YARN is the modern Distributed Operating System for big data applications. It morphed the Hadoop compute layer to be a common resource-management platform that can host a wide variety of applications. Many organizations leverage YARN in building their applications on top of Hadoop without themselves repeatedly worrying about resource management, isolation, multi-tenancy issues etc. The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters. In this talk, we’ll start with the current status of Apache Hadoop 3.x – how it is used today in deployments large and small. We’ll then move on to the exciting present & future of Hadoop 3.x – features that are further strengthening Hadoop as the primary resource-management platform as well as the storage system for enterprise data-centers. We’ll discuss the current status as well as the future promise of features and initiatives for both YARN and HDFS of Hadoop 3.x: For YARN 3.x, we have powerful container placement, global scheduling, support for machine learning (Spark) and deep learning (TensorFlow) workloads through GPU & FPGA scheduling and isolation support, extreme scale with YARN federation, containerized apps on YARN, support for long-running services (alongside applications) natively without any changes, seamless application/services upgrades, powerful scheduling features like application priorities, intra-queue preemption across applications and operational enhancements including insights through Timeline Service V2, a new web UI, better queue management, etc.nAlso, HDFS 3.0 announced GA for erasure coding which doubles the storage efficiency of data and thus reduces the cost of storage for enterprise use cases. HDFS added support for multiple standby NameNodes for better availability. For better reliability of metadata and easier operations, Journal nodes have been enhanced to sync the edit log segments to protect against rolling failures.nDisk balancing within a DataNode was another important feature added to ensure disks are evenly utilized in a DataNode, which also ensures better aggregate throughput, and prevents from lopsided utilization if new disks are added or replaced in a DataNode. HDFS team is currently driving the Ozone initiative which lays the foundation of the next generation of storage architecture for HDFS where data blocks are organized in Storage Containers for higher scale and handling of small objects in HDFS. Ozone project also includes an object store implementation to support new use cases. At last, since more and more users are planning to upgrade from 2.x to 3.x to get all the benefits mentioned above, we will also briefly talk about upgrade guidance from Hadoop 2.x to 3.×.n’
Apache Hadoop 3.x State of The Union and Upgrade Guidance Anu Engineer Suma Shivaprasad
September 12, 2019