method SM thread
notify dispatch 1 + 8 128
dispatch 20 128
combine 20 768
method SM thread
notify dispatch 1 + 8 (1 send, 8 receive) 128
dispatch 20 (even send, odd receive) 128
combine 20 (even send, odd receive) 768
idea description location
dual stream communication toggling communication/compute stream DeepEP/csrc/deep_ep.cpp
out-of-doc PTX load/store by-pass L1 cache load/store DeepEP/csrc/kernels/utils.cuh
warp specialization kernel do branching in same warp DeepEP/csrc/kernels/intranode.cu
topology-aware routing forward either on IB or NVLink DeepEP/deep_ep/buffer.py