A 7-nm Four-Core Mixed-Precision AI Chip With 26.2-TFLOPS Hybrid-FP8 Training, 104.9-TOPS INT4 Inference, and Workload-Aware Throttling

Autor:	Jinwook Oh, Alyssa Herbert, Marcel Schaal, Zhibin Ren, Ching Zhou, Siyu Koswatta, Naigang Wang, Matthew Cohen, Vidhi Zalani, Howard M. Haynie, Matthew M. Ziegler, Sae Kyu Lee, Brian W. Curran, Monodeep Kar, Martin Lutz, Xin Zhang, Robert Casatuta, Vijayalakshmi Srinivasan, Nianzheng Cao, Sunil Shukla, Pong-Fei Lu, Leland Chang, Michael A. Guillorn, Bruce M. Fleischer, Michael R. Scheuermann, Joel Abraham Silberman, Kerstin Schelm, Vinay Velji Shah, Chia-Yu Chen, Kailash Gopalakrishnan, Swagath Venkataramani, Hung Tran, Mingu Kang, Wei Wang, Jungwook Choi, Scot H. Rider, Jinwook Jung, James J. Bonanno, Radhika Jain, Li Yulong, Xiao Sun, Silvia Melitta Mueller, Kyu-hyoun Kim, Ankur Agrawal
Rok vydání:	2022
Předmět:	Power management Computer science business.industry Deep learning Inference Bandwidth throttling Chip Power budget Artificial intelligence Electrical and Electronic Engineering business Electrical efficiency Computer hardware Efficient energy use
Zdroj:	IEEE Journal of Solid-State Circuits. 57:182-197
ISSN:	1558-173X 0018-9200
Popis:	Reduced precision computation is a key enabling factor for energy-efficient acceleration of deep learning (DL) applications. This article presents a 7-nm four-core mixed-precision artificial intelligence (AI) chip that supports four compute precisions--FP16, Hybrid-FP8 (HFP8), INT4, and INT2--to support diverse application demands for training and inference. The chip leverages cutting-edge algorithmic advances to demonstrate leading-edge power efficiency for 8-bit floating-point (FP8) training and INT4 inference without model accuracy degradation. A new HFP8 format combined with separation of the floating- and fixed-point pipelines and aggressive circuit/architecture optimization enables performance improvements while maintaining high compute utilization. A high-bandwidth ring protocol enables efficient data communication, while power management using workload-aware clock throttling maximizes performance within a given power budget. The AI chip demonstrates 3.58-TFLOPS/W peak energy efficiency and 26.2-TFLOPS peak performance for HFP8 iso-accuracy training, and 16.9-TOPS/W peak energy efficiency and 104.9-TOPS peak performance for INT4 iso-accuracy inference.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::2998556a11fb9c2dbdc4775ef363b4bb https://doi.org/10.1109/jssc.2021.3120113 Zobrazit plný text záznamu