Efficient Modeling for DNN Hardware Resiliency Assessment

Autor: Mahmoud, Karim
Jazyk: angličtina
Rok vydání: 2025
Předmět:
Druh dokumentu: Diplomová práce
Popis: Deep neural network (DNN) hardware accelerators are critical enablers of the current resurgence in machine learning technologies. Adopting machine learning in safety-critical systems imposes additional reliability requirements on hardware design. Addressing these requirements mandates an accurate assessment of the impact caused by permanent faults in the processing engines (PE). Carrying out this reliability assessment early in the design process allows for addressing potential reliability concerns when it is less costly to perform design revisions. However, the large size of modern DNN hardware and the complexity of the DNN applications running on it present barriers to efficient reliability evaluation before proceeding with the design implementation. Considering these barriers, this dissertation proposes two methodologies to assess fault resiliency in integer arithmetic units in DNN hardware. Using the information from the data streaming patterns of the DNN accelerators, which are known before the register-transfer level (RTL) implementation, the first methodology enables fault injection experiments to be carried out in PE units at the pre-RTL stage during architectural design space exploration. This is achieved in a DNN simulation framework that captures the mapping between a model's operations and the hardware's arithmetic units. This facilitates a fault resiliency comparison of state-of-the-art DNN accelerators comprising thousands of PE units. The second methodology introduces accurate and efficient modelling of the impact of permanent faults in integer multipliers. It avoids the need for computationally intensive circuit models, e.g., netlists, to inject faults in integer arithmetic units, thus scaling the fault resiliency assessment to accelerators with thousands of PE units with negligible simulation time overhead. As a first step, we formally analyze the impact of permanent faults affecting the internal nodes of two integer multiplier architectures. This analysis indicates that, for most internal faults, the impact on the output is independent of the operands involved in the arithmetic operation. As the second step, we develop a statistical fault injection approach based on the likelihood of a fault being triggered in the applications that run on the target DNN hardware. By modelling the impact of faults in internal nodes of arithmetic units using fault-free operations, fault injection campaigns run three orders of magnitude faster than using arithmetic circuit models in the same simulation environment. The experiments also show that the proposed method's accuracy is on par with that of using netlists to model arithmetic circuitry in which faults are injected. Using the proposed methods, one can conduct fault assessment experiments for various DNN models and hardware architectures, examining the sensitivity of DNN model-related and hardware architecture-related features on the DNN accelerator's reliability. In addition to understanding the impact of permanent hardware faults on the accuracy of DNN models running on defective hardware, the outcomes of these experiments can yield valuable insights for designers seeking to balance fault criticality and performance, thereby facilitating the development of more reliable DNN hardware in the future.
Thesis
Doctor of Philosophy (PhD)
The reliability of Deep Neural Network (DNN) hardware has become critical in recent years, especially for the adoption of machine learning in safety-critical applications. Evaluating the reliability of DNN hardware early in the design process enables addressing potential reliability concerns before committing to full implementation. However, the large size and complexity of DNN hardware impose challenges in evaluating its reliability in an efficient manner. In this dissertation, two novel methodologies are proposed to address these challenges. The first methodology introduces an efficient method to describe the mapping of operations of DNN applications to the processing engines of a target DNN hardware architecture in a high-performance computing DNN simulation environment. This approach allows for assessing the fault resiliency of large hardware architectures, incorporating thousands of processing engines while using fewer simulation resources compared to existing methods. The second methodology introduces an accurate and efficient approach to modelling the impact of permanent faults in integer arithmetic units of DNN hardware during inference. By leveraging the special characteristics of integer arithmetic units, this method achieves fault assessment at negligible computational overhead relative to running DNN inference in the fault-free mode in state-of-the-art DNN frameworks.
Databáze: Networked Digital Library of Theses & Dissertations