Class Summary of Computer Architecture

已有 3070 次阅读 2011-12-22 13:17 |系统分类:科研笔记

This class is my favorite class in this semester. Though it was hard for me at the very beginning, it turns out to be enjoyable.


Over the past 30 years, people have been utilizing the Instruction Level Parallelism (ILP) to improve the performance of single processors. At the very beginning, they designed the simple five-stage pipeline to process multiple instructions at the same time. According to the Amdahl's law, we can improve a processor's performance by 4 times faster in maximum with this design. But as there are structure hazard, data hazard and control hazard, the goal is far away from our reach. Therefore people try to mitigate these hazards in software and hardware.

From the point of software, people try to utilize the loop unrolling and instruction reordering provided by compilers to improve the performance. As loop unrolling can reduce the number of branches, it can improve the efficiency of pipeline by reducing the stall. Reordering the instructions can optimize the execution of instructions, occupying the stall with independent instructions. But these techniques are coarse-grain techniques, unaware of inner mechanism of processors such as cache miss and have limited return to gain.

The major performance improvement of processors comes from hardware. With different memory access units, data access and instruction access can operate in the same cycle. In order to reduce the stall, people rearrange the instructions through dynamic scheduling. According to the Tomasulo Algorithm, instructions are issued in order but are executed out of order. The Reservation Stations are introduced to rename registers, cache intermediate execution results, and provide operands for other instructions. Therefore, the Tomasulo Algorithm can eliminate the WAR and WAW effectively. But for RAW and branches, as instructions have to wait before their operands are ready in Reservation Stations, they have to stall. As the prevalence of branches in our program, the capability to execute instructions beyond the branches is critical to ILP. Accordingly, people came up speculation with branch predictors to execute instructions before branch results and targets are determined. Therefore pursuing high accuracy branch predictors is a worthwhile investment. Gshare indexes the entry by combining (XOR) the branch address and branch history, and matches the column according to the history. Neural branch predictors achieve high performance by utilizing the correlation between long fixed instructions history but with a linear hardware increase compared to the Gshare. Compared to the branch predictor with perceptrons, the TAGE predict according to the longest matching history rather than fixed length history, which reduces the interference between branches. Speculation also introduce Reorder Buffer to issue instructions in order, store execution results for registers and memory, then commit instruction in order. Compared to dynamic scheduling, speculation enables processors to diagnose exceptions in exact position.

Pipeline, dynamic scheduling and speculation help to reduce CPI tremendously, but the CPI is still greater than or equal to one for there is only one instruction is issued in every cycle. To diminish the CPI to less than one, people proposed superscalar processors and Very Long Instruction Word (VLIW) processors. Superscalar processors issue multiple instructions simultaneously; while VLIW processors issue one instruction with multiple operations. Compared to single issue processors, these processors require more instruction fetch bandwidth, registers, reservation stations and reorder buffers etc. Meanwhile as VLIW processors depend on compilers to pack multiple operations into one instruction, they perform not as good as expected.

The return from ILP diminishes as we hit the wall of ILP and power. People turn to another natural and common parallelism, Thread Level Parallelism (TLP). TLP improves the performance in twofold aspects, one is to improve the throughput, and the other is to reduce the execution time. Lured by the obvious promise, the IBM, AMD, Sun and Intel began to develop their own multiprocessors. Compared to uniprocessors, multiprocessors need to cooperate and communicate among processor nodes. There are two parallel models for multiprocessors. One is message passing; the other is memory sharing. Message-passing multiprocessors synchronize the nodes by blocking on message; while shared memory multiprocessors rely on atomic memory operations, e.g. test-and-set. As all the threads of shared memory multiprocessors share the same memory address space, we have to face the cache coherent problem.

According to the Flynn's Taxonomy, multiprocessors such as Clusters, SMP servers fit into the category of Multiple Instruction Multiple Data (MIMD); uniprocessors with pipelines belong to the field of Single Instruction Single Data (SISD); Vector Processors are accounted as Single Instruction Multiple Data (SIMD), which is a demonstration of Data Level Parallelism. Most of today's processors are based on the control flow architecture, which executes instructions pointed by PC; while dataflow architecture may bring more innovative idea to today's computer architecture. As the paper we discussed in class, data-triggered threads will not only help to improve the parallelism, but also help to eliminate redundant computation.


There are two localities in computer architecture, temporal locality and spatial locality. For temporal locality, the same instructions such as loops are more likely to execute in near future; for spatial locality, processors are more likely to access data or instruction close to the location accessed recently. The best example to demonstrate locality is memory hierarchy, a multiple levels of memory which may include two or three levels of cache, memory and hard disk. The most recently executed data and instructions are cached in upper level memory because of temporal locality. Due to spatial locality, we read or replace upper level memory in blocks or pages. There are three special designs in memory hierarchy embodying the idea of locality, the cache, TLB (Translate Lookaside Buffer) and virtual memory. Cache stores the most recently used data or instructions from memory; TLB serves as a faster Page Table. Both of them are located inside of processors. Virtual memory enables every process to use a larger memory space, storing recently used data or instructions on physical memory and placing less used ones on hard disk.

Cache and TLB both are proposed to improve memory access. How do they work together? Today most of processors use virtual memory, which means the address used to access data or instructions are not physical address. Before you read data or instructions, you need to translate the virtual address to physical address. But if there is a cache hit, you don't need the full physical address; part of it will guarantee the access. In order to reduce the memory access time, we overlap virtual address translation and cache access at the same time. As the page offset bits of virtual address is intact during the translation, we use it to access cache. If there is a cache hit, processors will send a single to stop the translation. Otherwise, continue the translation by referring the TLB, sequentially Page Table if necessary. The drawback of the design is that the size of cache is limited by the size of page.

Different level of cache serves different purposes. The upper level is used for its speed; while the last level is used for its capacity. The idea along with locality inspires the hybrid drive, which has a cache made of solid-state driver possessing a high access speed.

Focus on the common case

Frequent case is often simpler and easier to get done than infrequent one. When designing branch predictors, we focus on branches with good regularity rather than ones with irregularity. Meanwhile too complicated design for higher accuracy may lead to high latency, which makes it impossible to apply in real processor design.   

Papers discussed in class

We discussed several high quality papers from HPCA 2011. In current cutting-edge research of branch prediction, people are no longer focusing on improving the accuracy of branch predictors because it almost meets its limit. People turn to use its research result in other application. People are carrying out research on cache replacement policy based on confidence of branch predictor and balancing resource allocation for different paths of branches in multiple threads scenario. The other three papers, one is expanding Instruction Set of mobile processors to enable dynamic type check for JavaScript, which also mentions using branch predictor to do type inference; one borrows the idea of share last-level cache to build share last-level TLB, which improves the performance through better sharing among parallel applications and flexible dynamic resource allocation for sequential applications; the last one educates us the internal parallelism of flash memory based solid state drives(SSDs), aiming to teach programmer to access the future hybrid hard disk in parallel rather than in sequence.

下一篇:The days working as a teaching assistant
收藏 IP: 129.115.28.*| 热度|


该博文允许注册用户评论 请点击登录 评论 (2 个评论)


Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2023-9-28 02:11

Powered by

Copyright © 2007- 中国科学报社