|
可重构计算:一种有前景的人工智能微芯片结构
随着现行半导体工艺线宽逐渐逼近物理极限,依靠工艺技术进步获得集成电路性能和功耗的改善变得越来越困难。通过架构创新延续摩尔定律,并持续在性能、功耗和成本上获得收益成为当前国际研究的热点。例如,美国DARPA提出的电子振兴计划就把架构创新作为三个重点研究方向之一。可重构计算是近年来兴起的新兴电路架构技术,它不仅事关电路架构的探索,还蕴藏着深厚的理论背景,涉及不同学科知识的融合,是一个横跨微电子、电路与系统、计算机和软件等多个学科领域的崭新方向。十多年前,我国学者就敏锐地意识到可重构计算技术的巨大潜力,并在八六三计划重点课题及国家科技重大专项的支持下,进行了长期的研究,取得了一系列具有重要国际影响力的研究成果。
在最新出版的《半导体学报》2020年第2期上,清华大学魏少军教授阐述了可重构计算技术的近期发展状况,着重分析了当前在可重构计算芯片领域的不同技术路线,各自的优点和存在的不足,特别是针对可重构计算技术在人工智能芯片领域的高效计算和应用前景进行了详细讨论。同时,魏少军教授还对可重构计算技术的发展前景进行了展望。
Abstract
Today, integrated circuit technology is approaching the physical limit. From performance and energy consumption perspective, reconfigurable computing is regarded as the most promising technology for future computing systems with excellent feature in computing and energy efficiency. From the perspective of computing performance, compared with single thread performance stagnation of general purpose processors (GPPS), reconfigurable computing may customize hardware according to application requirements, so as to achieve higher performance and lower energy consumption. From the perspective of economics, a microchip based on reconfigurable computing technology has post-silicon reconfigurability, which can be applied in different fields, so as to better share the cost of non-recurring engineering (NRE). High computing and energy efficiency together with unique reconfigurability make reconfigurable computing one of the most important technologies of artificial intelligent microchips.
1. What is reconfigurable computing?
Different from the traditional time domain programming computing mode, reconfigurable computing performs computing on both temporal and spatial programmable architecture. Its connotation and implementation have been evolving with the progress of semiconductor technology and target applications. Field-programmable gate arrays (FPGA), which was born in 1980s, is a typical reconfigurable microchip. It was developed for logic emulation, but soon became widely used devices because its reconfigurability provides the possibility to implement various algorithms. By eliminating the instruction fetch and decode of GPPs, FPGAs are much more energy-efficient than GPPs.
However, because of large amount of the configuration context caused by FPGA's fine-grained architecture and its static reconfiguration mechanism, the computing efficiency and energy efficiency of FPGA are not ideal. For example, its look-up table (LUT) structure results in 95% of the logic used for definition rather than for computation, so that most energy consumption is not directly related to computing. Furthermore, the static programming architecture determines that only when a whole circuit design is loaded into FPGA, can its function be realized. Therefore, a 10 million gates FPGA can only achieve several hundred thousand gates circuit design.
Recently, with the emerging of artificial intelligence (AI), FPGA is used to implement different AI algorithms. However, its low programming efficiency and static reconfiguration characteristics also show that in order to truly realize AI application, especially those terminal side applications that need high energy efficiency and high flexibility, it is necessary to find a new microchip architecture.
Coarse-grained reconfigurable architectures (CGRA) is another way of implementation of the reconfigurable computing concept. Through redundant deployment of computing resources, the arithmetic logic, memory subsystem and interconnection of CGRA can be flexibly customized according to the application requirements, so as to improve the computing and energy efficiency. The emerging dynamic reconfigurable computing technology can realize the real-time dynamic configuration of CGRA according to software (application), which may be considered as an ideal AI microchip architecture.
2. Why is reconfigurable computing suitable for AI applications?
Modern AI applications, such as computer vision and voice recognition, are based on the computation of artificial neural networks (NNs), which are characterized by complex computation involving massive data, parameters and frequent layer to layer communication. Although AI technology has made great progress, AI algorithms are still evolving, and one artificial NN (algorithm) only adapts to one application, so an ideal AI microchip must be able to adapt to the continuous evolution of algorithms, to support different artificial NNs according to requirements, and to switch between different artificial NNs flexibly. Obviously, by enabling customization in computation pattern, computing architecture and memory hierarchy, microchips based on reconfigurable computing technology might be able to efficiently support different NNs with high-throughput computations and communications. Many researches achieve astonish performance on diverse NNs by reconfiguring data paths to minimize energy consumption in data movement.
3. What is the recent progress?
Recently, reconfigurable computing has achieved many remarkable progresses on AI applications accelerations. At first, an optimal NN is always formed of several kinds of layers, such as convolutional and fully-connected layers. In order to achieve end-to-end AI applications, efficient computing must be supported on these layers. Most AI processors designed reconfigurable computing units, instead of independent hardware resources, to support various layers to improve overall performance of the entire networks. Secondly, memory access, especially DRAM access, is the bottleneck of AI acceleration. For example, in AlexNet, to support its 724M MACs, nearly 3000M DRAM accesses will be required, which requires up to 200x energy than one MAC. Four dataflows, including weight stationary, output stationary, no local reuse and row stationary, are proposed to improve data reuse and reduce memory access. Every time a piece of data is moved from an expensive level to a lower cost level in terms of energy, this piece of data should be reused as much as possible to minimize subsequent accesses to the expensive levels, which is the target of the optimized dataflow. The challenge, however, is that the capacity of these low cost memories is limited. Thus different dataflows should be explored to maximize reuse under these constraints. Different from application specific integrated circuits (ASICs) that support specialized processing dataflows, more and more processors proposed to design reconfigurable architectures to dispatch one of four dataflow for different AI applications, which can maximize data reuse and significantly improve overall flexibility. Thirdly, the AI applications are implemented by processors always based on quantization, namely from floating point to fixed point. The ultimate goal is to minimize the error between the reconstructed data from the quantization levels and the original data, and sometimes to reduce the number of operations. The quantization methods can be classified into linear quantization and non-linear quantization. Linear quantization is simpler but can lose more accuracy while non-linear quantization can maintain higher accuracy but is more complex. Meanwhile, as the importance of weights and activations in different layers are various, different methods of quantization can be used for weights and activations, and different layers, filters, and channels in the network. Therefore, reconfigurable computing is more and more attractive to recent researchers to support different quantization methods. Based on the reconfigurable computing, high accuracy and less operations and operands can be achieved. Fourth, ReLU is a popular form of non-linearity activation function used in AI applications that sets all negative values to zero. As a result, the output activations of the feature maps after the ReLU are sparse; for instance, the feature maps in AlexNet have sparsity between 19% to 63%. This sparsity can be exploited for energy, cycle and area savings using compression, prediction and network pruning particularly for off-chip DRAM access which is expensive. Compression methods can skip reading the weights and performing the MAC for zero-valued activations without accuracy loss, but complex control logic is required; prediction methods sacrifices accuracy to reduce operations corresponding to zero-valued activations; The pruning methods is to eliminate the low-valued activations to make network even more sparse, but accuracy can be effected significantly. As these methods performs variously in different networks, some reconfigurable computing architectures are proposed to combine these methods to reduce operations as much as possible with marginal loss. Finally, some methods proposed compact network architectures to reduce the number of weights and computations in AI applications. The main trend is to replace a large filter with a series of smaller filters, which can be applied during network architecture design. As each compact network architectures are designed for specific AI applications, some reconfigurable computing architectures try to support all kinds of compact networks, which can maximally reduce the number of operations and model size for different compact networks in their specific situations with marginal accuracy loss.
4. Remained challenges and prospects
Up to now, the main research of AI microchips is focused on multilayer perceptual neural networks. The latest development of AI researches require that AI microchips can also accelerate the newly emerging neural networks, such as graphical neural networks and memory networks. Another promising direction is to use artificial intelligence technology to guide the design of reconfigurable computing system. Traditionally, they are designed and programmed using empirical methods. With the increasing complexity of microchips, designers can use AI to better build and manage complex reconfigurable systems.
1958年出生,应用科学博士,清华大学微电子与纳电子学系教授,国际电气与电子工程师学会会士(IEEE Fellow), 中国电子学会会士(CIE Fellow)。
研究领域包括集成电路设计方法学研究、电子设计自动化(EDA)技术研究、嵌入式系统设计和可重构计算技术研究等。发表论文220余篇,出版专著5部。曾获国家科技进步奖二等奖和国家技术发明奖二等奖各一项,省部级科技一等奖5项。
点击阅读魏少军教授文章:
Reconfigurable computing: a promising microchip architecture for artificial intelligence
Shaojun Wei
J. Semicond. 2020, 41(2): 020301
doi: 10.1088/1674-4926/41/2/020301
“面向高能效人工智能计算的可重构芯片技术”专刊
《半导体学报》组织了一期“面向高能效人工智能计算的可重构芯片技术”专刊,并邀请中国科学院电子学研究所杨海钢研究员、上海科技大学信息科学与技术学院哈亚军教授、复旦大学微电子学院王伶俐教授、香港科技大学电子和计算机工程系张薇副教授和美国莱斯大学电气与计算机工程系林映嫣助理教授共同担任特约编辑。该专刊已于2020年第2期正式出版并可在线阅读,欢迎关注。
半导体学报公众号
微信号 : JournalOfSemicond
长按或扫描二维码关注获得更多信息
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-11-24 01:46
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社