benchmarks/instruction_counts/README.md - platform/external/pytorch - Git at Google

 # Instruction count microbenchmarks
 ## Quick start

 ### To run the benchmark:

 ```
 # From pytorch root
 cd benchmarks/instruction_counts
 python main.py
 ```

 Currently `main.py` contains a very simple threadpool (so that run time isn't
 unbearably onerous) and simply prints the results. These components will be
 upgraded in subsequent PRs.

 ### To define a new benchmark:
 * `TimerArgs`: Low level definition which maps directly to
 `torch.utils.benchmark.Timer`
 * `GroupedStmts`: Benchmark a snippet. (Python, C++, or both) Can automatically
 generate TorchScript and autograd variants.
 * `GroupedModules`: Like `GroupedStmts`, but takes `nn.Module`s
 * `GroupedVariants`: Benchmark-per-line to define many related benchmarks in a
 single code block.

 ## Architecture
 ### Benchmark definition.

 One primary goal of this suite is to make it easy to define semantically
 related clusters of benchmarks. The crux of this effort is the
 `GroupedBenchmark` class, which is defined in `core/api.py`. It takes a
 definition for a set of related benchmarks, and produces one or more concrete
 cases. It's helpful to see an example to understand how the machinery works.
 Consider the following benchmark:

 ```
 # `GroupedStmts` is an alias of `GroupedBenchmark.init_from_stmts`
 benchmark = GroupedStmts(
     py_stmt=r"y = x * w",
     cpp_stmt=r"auto y = x * w;",

     setup=GroupedSetup(
         py_setup="""
             x = torch.ones((4, 4))
             w = torch.ones((4, 4), requires_grad=True)
         """,
         cpp_setup="""
             auto x = torch::ones((4, 4));
             auto w = torch::ones((4, 4));
             w.set_requires_grad(true);
         """,
     ),

     signature="f(x, w) -> y",
     torchscript=True,
     autograd=True,
 ),
 ```

 It is trivial to generate Timers for the eager forward mode case (ignoring
 `num_threads` for now):

 ```
 Timer(
     stmt=benchmark.py_fwd_stmt,
     setup=benchmark.setup.py_setup,
 )

 Timer(
     stmt=benchmark.cpp_fwd_stmt,
     setup=benchmark.setup.cpp_setup,
     language="cpp",
 )
 ```

 Moreover, because `signature` is provided we know that creation of `x` and `w`
 is part of setup, and the overall computation uses `x` and `w` to produce `y`.
 As a result, we can derive TorchScript'd and AutoGrad variants as well. We can
 deduce that a TorchScript model will take the form:

 ```
 @torch.jit.script
 def f(x, w):
     # Paste `benchmark.py_fwd_stmt` into the function body.
     y = x * w
     return y  # Set by `-> y` in signature.
 ```

 And because we will want to use this model in both Python and C++, we save it to
 disk and load it as needed. At this point Timers for TorchScript become:

 ```
 Timer(
     stmt="""
         y = jit_model(x, w)
     """,
     setup=""",
         # benchmark.setup.py_setup
         # jit_model = torch.jit.load(...)
         # Warm up jit_model
     """,
 )

 Timer(
     stmt="""
         std::vector<torch::jit::IValue> ivalue_inputs(
             torch::jit::IValue({x}),
             torch::jit::IValue({w})
         );
         auto y = jit_model.forward(ivalue_inputs);
     """,
     setup="""
         # benchmark.setup.cpp_setup
         # jit_model = torch::jit::load(...)
         # Warm up jit_model
     """,
 )
 ```

 While nothing above is particularly complex, there is non-trivial bookkeeping
 (managing the model artifact, setting up IValues) which if done manually would
 be rather bug-prone and hard to read.

 The story is similar for autograd: because we know the output variable (`y`)
 and we make sure to assign it when calling TorchScript models, testing AutoGrad
 is as simple as appending `y.backward()` (or `y.backward();` in C++) to the
 stmt of the forward only variant. Of course this requires that `signature` be
 provided, as there is nothing special about the name `y`.

 The logic for the manipulations above is split between `core/api.py` (for
 generating `stmt` based on language, Eager/TorchScript, with or without AutoGrad)
 and `core/expand.py` (for larger, more expansive generation). The benchmarks
 themselves are defined in `definitions/standard.py`. The current set is chosen
 to demonstrate the various model definition APIs, and will be expanded when the
 benchmark runner infrastructure is better equipped to deal with a larger run.

 ### Benchmark execution.

 Once `expand.materialize` has flattened the abstract benchmark definitions into
 `TimerArgs`, they can be sent to a worker (`worker/main.py`) subprocess to
 execution. This worker has no concept of the larger benchmark suite; `TimerArgs`
 is a one-to-one and direct mapping to the `torch.utils.benchmark.Timer` instance
 that the worker instantiates.
	# Instruction count microbenchmarks
	## Quick start

	### To run the benchmark:

	```
	# From pytorch root
	cd benchmarks/instruction_counts
	python main.py
	```

	Currently `main.py` contains a very simple threadpool (so that run time isn't
	unbearably onerous) and simply prints the results. These components will be
	upgraded in subsequent PRs.

	### To define a new benchmark:
	* `TimerArgs`: Low level definition which maps directly to
	`torch.utils.benchmark.Timer`
	* `GroupedStmts`: Benchmark a snippet. (Python, C++, or both) Can automatically
	generate TorchScript and autograd variants.
	* `GroupedModules`: Like `GroupedStmts`, but takes `nn.Module`s
	* `GroupedVariants`: Benchmark-per-line to define many related benchmarks in a
	single code block.

	## Architecture
	### Benchmark definition.

	One primary goal of this suite is to make it easy to define semantically
	related clusters of benchmarks. The crux of this effort is the
	`GroupedBenchmark` class, which is defined in `core/api.py`. It takes a
	definition for a set of related benchmarks, and produces one or more concrete
	cases. It's helpful to see an example to understand how the machinery works.
	Consider the following benchmark:

	```
	# `GroupedStmts` is an alias of `GroupedBenchmark.init_from_stmts`
	benchmark = GroupedStmts(
	py_stmt=r"y = x * w",
	cpp_stmt=r"auto y = x * w;",

	setup=GroupedSetup(
	py_setup="""
	x = torch.ones((4, 4))
	w = torch.ones((4, 4), requires_grad=True)
	""",
	cpp_setup="""
	auto x = torch::ones((4, 4));
	auto w = torch::ones((4, 4));
	w.set_requires_grad(true);
	""",
	),

	signature="f(x, w) -> y",
	torchscript=True,
	autograd=True,
	),
	```

	It is trivial to generate Timers for the eager forward mode case (ignoring
	`num_threads` for now):

	```
	Timer(
	stmt=benchmark.py_fwd_stmt,
	setup=benchmark.setup.py_setup,
	)

	Timer(
	stmt=benchmark.cpp_fwd_stmt,
	setup=benchmark.setup.cpp_setup,
	language="cpp",
	)
	```

	Moreover, because `signature` is provided we know that creation of `x` and `w`
	is part of setup, and the overall computation uses `x` and `w` to produce `y`.
	As a result, we can derive TorchScript'd and AutoGrad variants as well. We can
	deduce that a TorchScript model will take the form:

	```
	@torch.jit.script
	def f(x, w):
	# Paste `benchmark.py_fwd_stmt` into the function body.
	y = x * w
	return y # Set by `-> y` in signature.
	```

	And because we will want to use this model in both Python and C++, we save it to
	disk and load it as needed. At this point Timers for TorchScript become:

	```
	Timer(
	stmt="""
	y = jit_model(x, w)
	""",
	setup=""",
	# benchmark.setup.py_setup
	# jit_model = torch.jit.load(...)
	# Warm up jit_model
	""",
	)

	Timer(
	stmt="""
	std::vector<torch::jit::IValue> ivalue_inputs(
	torch::jit::IValue({x}),
	torch::jit::IValue({w})
	);
	auto y = jit_model.forward(ivalue_inputs);
	""",
	setup="""
	# benchmark.setup.cpp_setup
	# jit_model = torch::jit::load(...)
	# Warm up jit_model
	""",
	)
	```

	While nothing above is particularly complex, there is non-trivial bookkeeping
	(managing the model artifact, setting up IValues) which if done manually would
	be rather bug-prone and hard to read.

	The story is similar for autograd: because we know the output variable (`y`)
	and we make sure to assign it when calling TorchScript models, testing AutoGrad
	is as simple as appending `y.backward()` (or `y.backward();` in C++) to the
	stmt of the forward only variant. Of course this requires that `signature` be
	provided, as there is nothing special about the name `y`.

	The logic for the manipulations above is split between `core/api.py` (for
	generating `stmt` based on language, Eager/TorchScript, with or without AutoGrad)
	and `core/expand.py` (for larger, more expansive generation). The benchmarks
	themselves are defined in `definitions/standard.py`. The current set is chosen
	to demonstrate the various model definition APIs, and will be expanded when the
	benchmark runner infrastructure is better equipped to deal with a larger run.

	### Benchmark execution.

	Once `expand.materialize` has flattened the abstract benchmark definitions into
	`TimerArgs`, they can be sent to a worker (`worker/main.py`) subprocess to
	execution. This worker has no concept of the larger benchmark suite; `TimerArgs`
	is a one-to-one and direct mapping to the `torch.utils.benchmark.Timer` instance
	that the worker instantiates.