存储器层次结构笔记

About

本文是Computer System–A programmer’s perspective第六章存储器层次结构的阅读笔记。对其中的要点加以表述,以方便往后能够快速回忆自己都看了些什么鸟玩意。。。

Content

存储技术

  • RAM { DRAM SRAM }
  • DRAM的访问方法:

    • 内存控制器
    • 内部行缓冲区
    • 行列引脚共用,分两次传递行列数
  • 磁盘存储、固态硬盘(SSD)、存储技术趋势

上述都是些死记硬背的东西,所以不想浪费笔墨

局部性

局部性有两种:

  • 空间局部性
  • 时间局部性

原文对空间局部性和时间局部性进行了解释:

Locality is typically described as having two distinct forms: temporal locality and spatial locality. In a program with good temporal locality, a memory location that is referenced once is likely to be referenced again multiple times in the near future. In a program with good spatial locality, if a memory location is referenced once, then the program is likely to reference a nearby memory location in the near future.

由此我们对局部性进行总结:

  • 重复引用相同变量的程序具有良好的时间局部性
  • 对于具有步长为k的引用模式的程序,步长越小,空间局部性越好
  • 对于取指令来说,循环具有良好的空间和时间局部性。循环体越小,循环迭代次数越多,局部性越好

存储器层次结构

下面是非常经典的memory hierarchy:

Figure:The Memory Hierarchy

这里就引入了一个我们非常耳熟的概念:高速缓存(cache),书中给出了对于cache的解释:

The central idea of a memory hierarchy is that for each k, the faster and smaller storage device at level k serves as a cache for the larger and slower storage device at level k + 1. In other words, each level in the hierarchy caches data objects from the next lower level.

由于缓存肯定比所有的内存小,所以一定存在如下两种情况:

  • cache hit
  • cache miss

其中,缓存不命中存在如下几种类型:

  • 冷不命中(cold miss)
  • 冲突不命中(conflict miss)
  • 容量不命中(capacity miss)

小结

  • 利用时间局部性:由于时间局部性,同一数据对象可能会被多次使用
  • 利用空间局部性:由于空间局部性,我们期望后面对该块中其他对象的访问能够补偿不命中后复制该块的花费

高速缓存存储器

通用高速缓存存储器的组织结构是比较重要的地方,也是这次cache lab第一个作业的主要内容😭,就是让你自己写一个cache stimulator,还是用c写。
对于其结构介绍如下:

Consider a computer system where each memory address has m bits that form M = 2m unique addresses. As illustrated in Figure below, a cache for such a machine is organized as an array of S = 2s cache sets. Each set consists of E cache lines. Each line consists of a data block of B = 2b bytes, a valid bit that indicates whether or not the line contains meaningful information, and t = m − (b + s) tag bits (a subset of the bits from the current block’s memory address) that uniquely identify the block stored in the cache line. In general, a cache’s organization can be characterized by the tuple (S, E, B, m). The size (or capacity) of a cache, C, is stated in terms of the aggregate size of all the blocks. The tag bits and valid bit are not included. Thus, C = S × E × B.

Figure:General organization of cache (S, E, B, m)

Figure:Address to Find Cache line

上图说明了对于给定的一个地址,我们如果想要从cache中找到有没有它,所需要的三步和对应的数字:

  • 组选择(set index)
  • 行匹配(tags)
  • 块偏移(block offset)

这里有几种不同的缓存:

  • 直接映射高速缓存(E=1)
  • 组相联高速缓存(E!=1)
  • 全相联高速缓存(s=0)

编写高速缓存友好的代码

1) 让最常见的情况运行的快
2) 尽量减少每个循环内部缓存不命中的数量

讲道理,我觉得上面的是废话= =

比较好的一个例子是下面这个stride的例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// stride=1
int sumarray(int a[M][N]){
int i,j,sum=0;
for(i=0;i<M;i++){
for(j=0;j<N;j++){
sum+=a[i][j];
}
}
}

//stride=N
int sumarray(int a[M][N]){
int i,j,sum=0;
for(j=0;j<N;j++){
for(i=0;i<M;i++){
sum+=a[i][j];
}
}
}

综合:高速缓存对程序性能的影响

储存山

Figure: The Memory Mountain

后面还提到了一个矩阵相乘的问题,需要综合考虑步长和使用重复变量的问题,这里不再陈述。

Summary

三点建议

  • Focus your attention on the inner loops, where the bulk of the computations and memory accesses occur.
  • Try to maximize the spatial locality in your programs by reading data objects sequentially, with stride 1, in the order they are stored in memory.
  • Try to maximize the temporal locality in your programs by using a data object as often as possible once it has been read from memory.

Reference

[1]Computer System- A programmer’s perspective