0 GPU-mode

视频资源https://www.bilibili.com/video/BV1QZ421N7pT?vd_source=85c9ce6d49ba579156fb1b41d0e606b3

github参考https://github.com/BBuf/how-to-optim-algorithm-in-cuda/tree/master/reduce

torch的core dev推荐书籍:《Programming Massively Parallel Processors》

Programming Massively Parallel Processors.pdf

<aside> 💡

进度:how-to-optim-algorithm-in-cuda / cuda-mode/lecture / 14-编程模型

</aside>

0.1 CUDA Related API & Function

API

cudaMallocManaged 统一内存

cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, kernel, 0, 0); 查询最小gridsize和最佳blocksize

Function

__shfl_sync() 用于线程间通信的内置函数,实现了“shuffle”操作,即它可以从Warp内的任意线程获取数据,并将其广播给同一Warp中的其他线程。

0.2 计算

0.3-0.5 内存基础