钛媒体T-EDGE|谷歌董事会主席John Hennessy: John( 七 )


首先,他们使用一个简单的并行模型,在特定领域工作,这意味着它们可以拥有更少的控制硬件。例如,我们从多核中的多指令多数据模型切换到单指令数据模型。这意味着我们显着提高了与获取指令相关的效率,因为现在我们必须获取一条指令而不是任何指令。
We move to VLIW versus speculative out of order mechanisms, so things that rely on being able to analyze the code better know about dependences and therefore be able to create and structure parallelism at compile time rather than having to do with dynamically runtime.
我们来看看VLIW和推测性乱序机制的对比。现在需要更好处理代码的也能够得知其依附性,因此能够在编译时创建和构建并行性,而不必进行动态运行。
Second we make more effective use of memory bandwidth. We go to user controlled memory system rather than caches. Caches are great except when you have large amounts of data does streaming through them. They're extremely inefficient that's not what they meant to do. They are meant to work when the program does repetitive things but it is somewhat in predictable fashion. Here we have repetitive things in a very predictable fashion but very large amounts of data.
其次,我们更有效地利用内存带宽。我们使用用户控制的内存系统而不是缓存。缓存是好东西,但是如果要处理大量数据的话就不会那么好使了,效率极低,缓存不是用来干这事的。缓存旨在在程序执行具有重复性、可预测的操作时发挥作用。这里执行的运算虽然重复性高且可预测,但是数据量是在太大。
So we go to an alternative using prefetching and other techniques to move data into the memory once we get it into the memory within the processor within the domain specific processor. We can then make heavy use of the data before moving it back to the main memory.
那我们就用个别的方式。在我们把数据导入特定领域处理器上的内存之后,我们采用预提取和其他技术手段将数据导入内存中。接着,在我们需要把数据导去主存之前,我们就可以重度使用这些数据。
We eliminate unneeded accuracy. Turns out we need relatively much less accuracy then we do for general purpose computing here. In the case of integer, we need 8-16 bit integers. In the case of floating point, we need 16 to 32 bit not 64-bit large-scale floating point numbers. So we get efficiency thereby making data items smaller and by making the arithmetic operations more efficient.
我们消除了不需要的准确性。事实证明,我们需要的准确度比用于通用计算的准确度要低得多。我们只需要8-16位整数,要16到32位而不是64位的大规模浮点数。因此,我们通过使数据项变得更小而提高效率。
The key is that the domain specific programming model matches the application to the processor. These are not general purpose processor. You are not gonna take a piece of C code and throw it on one of these processors and be happy with the results. They're designed to match a particular class of applications and that structure is determined by that interface in the domain specific language and the underlining architecture.
关键在于特定领域的编程模型将应用程序与处理器匹配。这些不是通用处理器。你不会把一段 C 代码扔到其中一个处理器上,然后对结果感到满意。它们旨在匹配特定类别的应用程序,并且该结构由领域特定语言中的接口和架构决定。
So this just shows you an example so you get an idea of how were using silicon rather differently in these environments then we would in a traditional processor.
这里我们来看一个例子,以便了解这些处理器与常规处理器的不同之处。
What I've done here is taken a first generation TPU-1 the first tensor processing unit from Google but I could take the second or third or fourth the numbers would be very similar. I show you what it looks like it's a block diagram in terms of what the chip area devoted to. There's a very large matrix multiply unit that can do a two 56 x 2 56 x 8 bit multiplies and the later ones actually have floating point versions of that multiplying. It has a unified buffer used for local activations of memory buffer, interfaces accumulators, a little bit of controls and interfaces to DRAM.