为 CUDA kernel 选择合适的 grid / block size

pid=108360233

本文总结了来自 OneFlow 的文章 How to Choose the Grid Size and Block Size for a CUDA Kernel?，英文部分非全文，仅做摘录

cuda_kernel is the identifier of the global function, and within the (…) are the corresponding parameters that call cuda_kernel. Both of them have the same syntax as C++. As for <<<grid_size, block_size, 0, stream>>>, it is an extension of CUDA to C++, known as Execution Configuration. You can refer to CUDA C++ Programming Guide (Hereinafter called Guide):

kernel 调用时的 <<< Dg, Db, Ns, S >>> 分别代表 grid_size, block_size, 为每个 block 动态申请的 shared memory 默认 0，stream 默认 0
（笔者：或者说 grid_size 就是 grid 内 block 的数量，block_size 就是 block 内的线程数量）

The Guide K.1. Features and Technical Specifications points out that Maximum number of threads per block and Maximum x- or y-dimension of a block are both 1024. Thus, the maximum value of block_size can be 1024.

In a block, a warp is made up of 32 consecutive threads. These 32 threads execute the same instruction at a time, known as a SIMT. The threads use the same hardware resources even if the number of active threads in the last warp is less than 32. Therefore, block_size must be an integer multiple of 32.

The block is also known as Cooperative Thread Arrays

Threads within a CTA can communicate with each other. To coordinate the communication of the threads within the CTA, one can specify synchronization points where threads wait until all threads in the CTA have arrived.

The hardware for block is SM (Streaming Multiprocessor). SM provides hardware resources required for communication and synchronization for threads in the same block. Communication is not supported across SMs, thus, all threads in a block are executed on one SM. Also, since there might be synchronization between threads, once the block starts executing on the SM, all the threads in the block execute on the same SM at the same time (concurrency, not parallelism). That means that the process of scheduling the block to the SM is atomic. SM allows more than one block to execute concurrently. If the idle resources of an SM meet the execution of a block, the block can be immediately scheduled on that SM. The specific hardware resources generally include registers, shared memory, and various scheduling-related resources.

grid 是最大的单位，grid 内的每个单元是一个 block，也称为 CTA；block 内可以通信、同步；warp 是最小的调度单位，至多包含 32 个线程，（另 Hardware Impl. 一节指出 block 被划分为 warp 的方式是固定的）；SM 是执行 block 的硬件，为执行提供寄存器、shared mem、调度相关资源等资源，SM 间无法通信，因此（为了实现 block 内通信）block 内的 warp 一定会调度到相同 SM 上，SM 可以并行执行多个 warp，也可以执行来自多个 block 的 warp
（笔者：SM 包含多个 CUDA Core）
Question: CUDA Core 与 warp 的执行是什么关系？
ref: https://stackoverflow.com/a/62160925
TODO

The scheduling-related resources has two specific limits: Maximum number of resident blocks per SM and Maximum number of resident threads per SM. That is, the maximum number of blocks and threads that can be executed simultaneously on the SM.
…
to carry enough people to meet a certain goal, it needs to ensure that there are enough people on the escalator at the same time. For GPUs, this means trying to keep enough instructions on the pipeline at the same time.

There are many ways to achieve this goal. One simple way is to allow as many threads as possible to execute on the SM at the same time. The ratio of the number of concurrent threads executing on the SM to the maximum number of threads supported on the SM is called “occupancy”. The higher occupancy, the higher potential performance.

有个问题，文档上没有说 number of concurrent threads executing on the SM 可以是多少，个人理解，因为是 concurrent 而非 paralle，这里指的是往 SM 上提交的线程数量。Occupancy 指的是向 SM 提交的线程数量 / SM 上最多支持的线程数量，希望接近 100%

Obviously, the block_size of a kernel should be greater than the ratio of the maximum number of threads and the maximum number of blocks on the SM. Otherwise, 100% occupancy cannot be achieved. For different architectures, this ratio is also different. For V100, A100, and GTX 1080 Ti, the ratio is 2048 / 32 = 64. For RTX 3090, it is 1536 / 16 = 96. Therefore, in order to adapt to the mainstream architecture, if you want to set a fixed value of the block_size, it should not be less than 96. Considering the atomicity of block scheduling, block_size should be the approximate number of the maximum number of threads of SM. Otherwise, 100% occupancy cannot be achieved. The convention of the maximum number of threads of SM for the mainstream architecture GPU is 512. The approximate number above 96 are 128 and 256. That is to say, so far, the only three values of block_size left are 128 / 256 / 512.

根据前文结论，block size 需要是 32 的倍数。
上文给出的一个策略（one simple way），需要在 SM 上调度尽量多的线程。
V100 SM 最多 32 个 block，64 个 warp，2048 个 thread，平均每个 block 64 个 thread，或者说 2 个 warp。也就是说，大于等于 64，小于等于上限 1024 的 32 的倍数都可以

Still, because block scheduling to SM is atomic, SM has to provide enough resources required for at least one block to execute, including shared memory and registers. Shared memory is generally explicitly controlled by the developer. If the number of threads in a block times the number of registers required per thread is greater than the maximum number of registers per block supported by SM, the kernel will fail to launch.

需要注意 block 内的 thread 需求的寄存器数量不能超过 SM 能提供的上限

For the current mainstream frameworks, the maximum number of registers per block supported by SM is 32K or 64K 32-bit registers. Each thread can use 255 32-bit registers at most, and the compiler will not allocate more registers for threads. So for the registers, each SM can support at least 128 or 256 threads. A block_size of 128 can put an end to the boot failure caused by the number of registers, but few kernels can use so many registers, and there might be potential performance problems with only 128 or 256 threads executing simultaneously on the SM. But, setting block_size to 128 is not a loss compared to 256 and 512. So 128 is a very suitable common value for block_size.

考虑线程最大用 255 个寄存器（且一般用不到这么多），SM 一般至少能支持 32K 个寄存器，block_size = 128 及以下是一定安全的

Sometimes it is feasible to create a thread for each element, because thread creation is an operation with very low overhead on the GPU. However, if each thread contains a common operation, the increase in the number of threads means that the overhead becomes larger, for example:

__global__
void kernel(const float* x, const float* v, float* y) {
   const float sqrt_v = sqrt(*v);
   const int idx = blockIdx* gridDim.x + threadIdx.x;
   y[idx] = x[idx] * sqrt_v;
}

接下来作者讨论了 grid_size 怎么选，线程数小于元素数是显然的，但等于元素数不一定是最好的，上面的例子中，sqrt(v) 是可以复用的，在 x/y 上 loop 可以减少 sqrt 的计算量，但 grid_size 仍然应当大于 SM 数量，避免有 SM 闲置

The GPU can schedule (the number of SMs times the maximum number of blocks per SM) blocks at one time. Because the computation amount of each block is equal, all SMs should complete the computation of these blocks almost at the same time, and then process the next batch. Each batch is called a wave. Imagine that grid_size is exactly one more block than a wave. In this condition, the next kernel on the stream cannot be executed until the kernel is fully executed. Therefore, after the first wave is completed, only one block will be executed on the GPU, and the actual utilization of the GPU will be very low. This situation is called tail effect.

We should try to avoid this situation. Setting grid_size to exactly one wave might not avoid the tail effect, because the GPU may not be exclusive to the current stream, such as the NCCL execution will occupy the SMs. Therefore, usually, we can set grid_size to a sufficient number of integer waves to achieve more desirable results. If the number is large enough, not an integer number of waves will not have much effect.

所有 block 的 workload 是相同的，意味着所有被占用的 SM 会在类似的时间后结束一轮计算，称为一个 wave，tail effect 的意思是最后一个 wave 可能只有少量 SM 被用到，降低了 GPU 利用率。wave 数比较少时，将 grid_size 对齐为 SM 的倍数是有效的优化

To sum up, in a normal elementwise kernel or other similar cases, a block_size of 128 and a grid_size of enough waves can lead to a satisfactory result. However, more complex cases need to be analyzed on a problem-specific basis. For example, if an SM can only execute few blocks at the same time due to the shared_memory limitation, then increasing the block_size might improve the performance. If there is synchronization between threads in the kernel, then an excessive block_size will lead to a lower actual utilization of SMs. For these conditions, we can discuss them later separately.

总结一下，128/256/512/1024 的 block_size，SM 数量倍数的 grid_size 是最优的（仅考虑这两个变量时）。更复杂的情况，如果由于 shared mem 限制，SM 只能同时执行几个 block，这种情况下最好使用更大的 block_size；如果 block 内存在线程同步，使用更小的 block_size 会更好

笔者的总结：

block_size

硬限制小于 1024
需要是 warp 大小（32）的倍数
SM 寄存器总量 / 最大线程寄存器及以下的 block_size 绝对安全
- V100 的场景是 64K / 255 = 256
从最大化 occupancy 的角度考虑，block_size 应当大于 SM 上最大线程数 / SM 上最大 CTA 数
- V100 的场景是 2048 / 32 = 64

grid_size

需要大于 SM 数量（利用满所有 SM），且最好是 SM 数量的整数倍（或者接近向上取整的倍数），目的是使所有的 SM 在 kernel 运行期间都不闲置
- V100 有 80 个 SM，每个 64 个 CUDA Core，共计 5120 个 CUDA Core

TODO https://stackoverflow.com/questions/6048907/maximum-blocks-per-gridcuda
grid 三维也有最大值

另一个需要考虑的因素是 cache 局部性，之后总结一篇文章

https://docs.nvidia.com/cuda/cuda-c-programming-guide/#a-scalable-programming-model

each block of threads can be scheduled on any of the available multiprocessors within a GPU

Thread blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series. This independence requirement allows thread blocks to be scheduled in any order across any number of cores as illustrated by Figure 3, enabling programmers to write code that scales with the number of cores.

技术

#CUDA

为 CUDA kernel 选择合适的 grid / block size

https://vicety.github.io/2023/11/30/如何为cuda-kernel选择block-grid-size/

作者

vicety

发布于

2023年11月30日

许可协议

使用 Nsight Compute 对算子进行优化上一篇

动态库加载时出现 Undefined Symbol 的排查思路下一篇