Popis: |
The computing frameworks running in the cloud environment at an extreme scale provide efficient and high-performance computing services to various domains. These cloud computing frameworks build scalable, reliable, and highly accessible data pipelines for many academia, science, and industry services. Data analytics generates a large amount of intermediate data at the back of cloud computing frameworks while processing large amounts of data from different data sources. However, enormous data addresses the challenges to these frameworks to deal with data high performance and efficiency. The data orchestration based on memory and high-performance storage devices has become a key concern to optimize these cloud computing frameworks' performance. The increasing data scale and complexity of the cloud environment pose challenges to run applications fast and efficiently. The existing computing clusters can fetch the data from different cloud infrastructure, including common storage, high-performance storage devices, and high-speed fabric interconnection. However, it is still challenging to provide the corresponding data orchestration for the existing computing frameworks. First, computing frameworks access the underlying persistent data storage layer based on the different storage devices and memory. Furthermore, the revolution of storage devices addresses new challenges for existing computing frameworks to utilize advanced storage devices efficiently. Second, most of the existing computing frameworks use an intermediate data layer for intermediate storage. However, providing an efficient and high-performant storage layer for large-scale computing frameworks, such as intermediate data storage and shuffle data storage, is still challenging. The imbalance and small data storage introduce new challenges, including new hardware devices and appropriate data orchestration designs. Consequently, the revolution of hardware devices requires a new paradigm for data orchestration for cloud computing frameworks. This thesis addresses the above challenges and proposes novel mechanisms and solutions for building efficient and high-performance data orchestration for big data frameworks, and makes the following contributions: (1) Studies representative workloads for big data processing frameworks using different storage technologies and design choices and explores the I/O bottleneck of in-memory big data frameworks on high-performance computing clusters with non-volatile memory. (2) Designs and explores architectural foundations to run in-memory big data framework in the hybrid cloud environment with fast fabric interconnection between geo-distributed data centers. (3) Proposes an abstraction for disaggregated memory pool based on persistent memory and Remote Direct Memory Access (RDMA) to optimize the computing resource efficiency and performance of intermediate storage of big data frameworks. (4) Provides a novel in-transit shuffle mechanism for big data frameworks, which is lightweight and compatible with modern in-memory big data frameworks. The proposed mechanisms and solutions have been implemented, deployed, and evaluated in high-performance clusters and real computing environments, including academic clusters at Rutgers and production systems at scale in the information technology industry. |