A synthetic benchmarking tool to measure peak capabilities of opencl devices. It only measures the peak metrics that can be achieved using vector operations and does not represent a real-world use case
git submodule update --init --recursive --remote mkdir build cd build cmake .. cmake --build .
Platform: NVIDIA CUDA Device: Tesla V100-SXM2-16GB Driver version : 390.77 (Linux x64) Compute units : 80 Clock frequency : 1530 MHz Global memory bandwidth (GBPS) float : 767.48 float2 : 810.81 float4 : 843.06 float8 : 726.12 float16 : 735.98 Single-precision compute (GFLOPS) float : 15680.96 float2 : 15674.50 float4 : 15645.58 float8 : 15583.27 float16 : 15466.50 No half precision support! Skipped Double-precision compute (GFLOPS) double : 7859.49 double2 : 7849.96 double4 : 7832.96 double8 : 7799.82 double16 : 7740.88 Integer compute (GIOPS) int : 15653.47 int2 : 15654.40 int4 : 15655.21 int8 : 15659.04 int16 : 15608.65 Transfer bandwidth (GBPS) enqueueWriteBuffer : 10.64 enqueueReadBuffer : 11.92 enqueueMapBuffer(for read) : 9.97 memcpy from mapped ptr : 8.62 enqueueUnmap(after write) : 11.04 memcpy to mapped ptr : 9.16 Kernel launch latency : 7.22 us