视频资源:https://www.bilibili.com/video/BV1QZ421N7pT?vd_source=85c9ce6d49ba579156fb1b41d0e606b3
github参考:https://github.com/BBuf/how-to-optim-algorithm-in-cuda/tree/master/reduce
torch的core dev推荐书籍:《Programming Massively Parallel Processors》
Programming Massively Parallel Processors.pdf
<aside> 💡
进度:how-to-optim-algorithm-in-cuda / cuda-mode/lecture / 14-编程模型
</aside>
API
cudaMallocManaged
统一内存
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, kernel, 0, 0);
查询最小gridsize和最佳blocksize
Function
__shfl_sync()
用于线程间通信的内置函数,实现了“shuffle”操作,即它可以从Warp内的任意线程获取数据,并将其广播给同一Warp中的其他线程。