Sample: globalToShmemAsyncCopy
Minimum spec: SM 7.0

This sample implements matrix multiplication which uses asynchronous copy of data from global to shared memory when on compute capability 8.0 or higher. Also demonstrates arrive-wait barrier for synchronization.

Key concepts:
CUDA Runtime API
Linear Algebra
CPP11 CUDA
