docs/source/bottleneck.rst - platform/external/pytorch - Git at Google

 torch.utils.bottleneck
 ======================

 .. automodule:: torch.utils.bottleneck
 .. currentmodule:: torch.utils.bottleneck

 `torch.utils.bottleneck` is a tool that can be used as an initial step for
 debugging bottlenecks in your program. It summarizes runs of your script with
 the Python profiler and PyTorch's autograd profiler.

 Run it on the command line with

 ::

     python -m torch.utils.bottleneck /path/to/source/script.py [args]

 where [args] are any number of arguments to `script.py`, or run
 ``python -m torch.utils.bottleneck -h`` for more usage instructions.

 .. warning::
     Because your script will be profiled, please ensure that it exits in a
     finite amount of time.

 .. warning::
     Due to the asynchronous nature of CUDA kernels, when running against
     CUDA code, the cProfile output and CPU-mode autograd profilers may
     not show correct timings: the reported CPU time reports the amount of time
     used to launch the kernels but does not include the time the kernel
     spent executing on a GPU unless the operation does a synchronize.
     Ops that do synchronize appear to be extremely expensive under regular
     CPU-mode profilers.
     In these case where timings are incorrect, the CUDA-mode autograd profiler
     may be helpful.

 .. note::
     To decide which (CPU-only-mode or CUDA-mode) autograd profiler output to
     look at, you should first check if your script is CPU-bound
     ("CPU total time is much greater than CUDA total time").
     If it is CPU-bound, looking at the results of the CPU-mode autograd
     profiler will help. If on the other hand your script spends most of its
     time executing on the GPU, then it makes sense to start
     looking for responsible CUDA operators in the output of the CUDA-mode
     autograd profiler.

     Of course the reality is much more complicated and your script might not be
     in one of those two extremes depending on the part of the model you're
     evaluating. If the profiler outputs don't help, you could try looking at
     the result of :func:`torch.autograd.profiler.emit_nvtx()` with ``nvprof``.
     However, please take into account that the NVTX overhead is very high and
     often gives a heavily skewed timeline. Similarly, Intel VTune Profiler helps
     to analyze performance on Intel platforms further with
     :func:`torch.autograd.profiler.emit_nvtx()`.

 .. warning::
     If you are profiling CUDA code, the first profiler that ``bottleneck`` runs
     (cProfile) will include the CUDA startup time (CUDA buffer allocation cost)
     in its time reporting. This should not matter if your bottlenecks result
     in code much slower than the CUDA startup time.

 For more complicated uses of the profilers (like in a multi-GPU case),
 please see https://docs.python.org/3/library/profile.html
 or :func:`torch.autograd.profiler.profile()` for more information.
	torch.utils.bottleneck
	======================

	.. automodule:: torch.utils.bottleneck
	.. currentmodule:: torch.utils.bottleneck

	`torch.utils.bottleneck` is a tool that can be used as an initial step for
	debugging bottlenecks in your program. It summarizes runs of your script with
	the Python profiler and PyTorch's autograd profiler.

	Run it on the command line with

	::

	python -m torch.utils.bottleneck /path/to/source/script.py [args]

	where [args] are any number of arguments to `script.py`, or run
	``python -m torch.utils.bottleneck -h`` for more usage instructions.

	.. warning::
	Because your script will be profiled, please ensure that it exits in a
	finite amount of time.

	.. warning::
	Due to the asynchronous nature of CUDA kernels, when running against
	CUDA code, the cProfile output and CPU-mode autograd profilers may
	not show correct timings: the reported CPU time reports the amount of time
	used to launch the kernels but does not include the time the kernel
	spent executing on a GPU unless the operation does a synchronize.
	Ops that do synchronize appear to be extremely expensive under regular
	CPU-mode profilers.
	In these case where timings are incorrect, the CUDA-mode autograd profiler
	may be helpful.

	.. note::
	To decide which (CPU-only-mode or CUDA-mode) autograd profiler output to
	look at, you should first check if your script is CPU-bound
	("CPU total time is much greater than CUDA total time").
	If it is CPU-bound, looking at the results of the CPU-mode autograd
	profiler will help. If on the other hand your script spends most of its
	time executing on the GPU, then it makes sense to start
	looking for responsible CUDA operators in the output of the CUDA-mode
	autograd profiler.

	Of course the reality is much more complicated and your script might not be
	in one of those two extremes depending on the part of the model you're
	evaluating. If the profiler outputs don't help, you could try looking at
	the result of :func:`torch.autograd.profiler.emit_nvtx()` with ``nvprof``.
	However, please take into account that the NVTX overhead is very high and
	often gives a heavily skewed timeline. Similarly, Intel VTune Profiler helps
	to analyze performance on Intel platforms further with
	:func:`torch.autograd.profiler.emit_nvtx()`.

	.. warning::
	If you are profiling CUDA code, the first profiler that ``bottleneck`` runs
	(cProfile) will include the CUDA startup time (CUDA buffer allocation cost)
	in its time reporting. This should not matter if your bottlenecks result
	in code much slower than the CUDA startup time.

	For more complicated uses of the profilers (like in a multi-GPU case),
	please see https://docs.python.org/3/library/profile.html
	or :func:`torch.autograd.profiler.profile()` for more information.