development/ab-benchmarking/README.md - platform/frameworks/support - Git at Google

 # Local A/B MicroBenchmark Automation Tool

 This tool automates the process of running local A/B performance micro benchmarks by
 comparing two Git revisions (branches, commits, tags, etc.). It is designed to help developers
 quickly measure the performance impact of their changes before code submission.

 The script performs the following actions:

 1.  Runs a specified benchmark test multiple times on a base revision (e.g., `main`).
 2.  Runs the same test multiple times on a comparison revision (e.g., a feature branch or `HEAD~1`).
 3.  Collects the timing results from all test runs for both revisions.
 4.  Performs a statistical analysis to determine if there is a significant performance difference between the two revisions.
 5.  Outputs a human-readable summary, a machine-readable CSV summary, a metadata file, and a histogram plot of the results.

 ## Prerequisites

 Before running the tool, please ensure the following conditions are met:

 1.  **Clean Git Working Directory**: Your repository must have no uncommitted changes, staged files, or untracked files. The script will perform a check and exit if the working tree is not clean. Please commit or stash your changes.
 2.  **Connected Android Device(s)**:
     *   Benchmarks can only be run on a connected Android device.
     *   If a single device is connected via ADB, the script will automatically target it.
     *   If multiple devices are connected, you **must** use the `--serial` flag to specify which one to use. The script will exit with an error if multiple devices are detected without a specified ID.
 3.  **Valid Git Revisions**: The Git revisions you provide for comparison must be valid and exist locally.

 ## Reducing Noise

 To get more stable and reliable benchmark results, it's important to minimize
 environmental noise. Here are some recommendations:

 *   **Disable JIT (Just-In-Time) Compilation**: The JIT compiler can introduce
     variability. Use the provided script to disable it:
     ```bash
     ./benchmark/gradle-plugin/src/main/resources/scripts/disableJit.sh
     ```
 *   **Lock CPU and GPU Clocks**: Fluctuations in clock speeds can affect
     measurements. Lock them with:
     ```bash
     ./benchmark/gradle-plugin/src/main/resources/scripts/lockClocks.sh
     ```
 *   **Use a aosp-userdebug Build**: Flash your device with a `userdebug` build of AOSP
     for more performance control. AOSP build do not have GMS services hences reduces background interference.
 *   **Minimize Device Activity**:
     *   Disable Wi-Fi, mobile data, and NFC.
     *   Enable Airplane Mode.
     *   Clear all applications from the "Recents" screen.

 ## Usage

 The script is executed via Gradle from the `development/ab-benchmarking` directory.

 ```bash
 ./gradlew run --args="<rev_a> <rev_b> <module> <benchmarkTest> [options]"
 ```

 ### Finding device serial
 To find the serial ID of all connected devices, run the following ADB command in your terminal:
 ```bash
 adb devices
 ```
 The output will list your connected devices. The string in the first column is the serial ID.
 ```
 List of devices attached
 emulator-5554   device
 123456789ABCDEF device
 ```

 ### Command-Line Arguments

 The script accepts the following positional arguments in order:

 1.  `rev_a` (String): The first Git revision to test (e.g., a branch, commit hash, tag, or `HEAD`).
 2.  `rev_b` (String): The second Git revision to test.
 3.  `module` (String): The Gradle module path containing the benchmark test.
     *   *Example*: `compose:ui:ui-benchmark`
 4.  `benchmark_test` (String): The fully qualified class name of the benchmark test to run.
     *   *Example*: `androidx.compose.ui.benchmark.accessibility.AccessibilityBenchmark`
     *   **Note**: To run a single method within a test class, use the format `ClassName#methodName`.
     *   *Example*: `androidx.compose.ui.benchmark.ModifiersBenchmark#full[clickable_1x]`

 ### Options

 *   `--run_count` (Int): The number of times the entire test suite should be run on *each* revision to gather a sample set. For example, a `run_count` of 10 will result in 10 test executions on `rev_a` and 10 on `rev_b`. Defaults to `1`.
 *   `--iteration_count` (Int): The number of internal iterations the benchmark framework should perform in a single test run. This is passed directly to the `androidx.benchmark.iterations` argument. Defaults to `50`.
 *   `--serial` (String): The serial ID of the target Android device to use for benchmarking. This is **required** if more than one device is connected. Use the `adb devices` command to find the ID.
 *   `--output_path` (String): The path where temporary and final result files should be stored. This includes intermediate CSV files, the final `.metadata.json` file, and a histogram plot. Defaults to `~/androidx-main/frameworks/support/development/ab-benchmarking/app/build/benchmark-results/`.

 ## Example Commands

 ### Comparing Branches

 Here is an example that compares the `main` branch against a feature branch named `my-perf-fix`.

 ```bash
 ./gradlew run --args="main my-perf-fix compose:ui:ui-benchmark androidx.compose.ui.benchmark.accessibility.AccessibilityBenchmark --run_count 5 --iteration_count 1000 --serial emulator-5554"
 ```

 ### Comparing a Commit Against its Parent

 To measure the impact of the very last commit on the current branch, you can compare `HEAD` with its parent, `HEAD~1`.

 ```bash
 ./gradlew run --args="HEAD~1 HEAD compose:ui:ui-benchmark androidx.compose.ui.benchmark.accessibility.AccessibilityBenchmark --run_count 3 --iteration_count 1500 --serial emulator-5554"
 ```

 ### Comparing a Single Benchmark Method

 To isolate the performance of a specific method within a benchmark class, use the `#` separator.

 ```bash
 ./gradlew run --args="main my-perf-fix compose:ui:ui-benchmark androidx.compose.ui.benchmark.accessibility.AccessibilityBenchmark#mySpecificTest --run_count 5 --iteration_count 500 --serial emulator-5554"
 ```

 ## Interpreting the Output

 The tool produces four forms of output: a human-readable summary, a machine-readable CSV line, a metadata JSON file, and a histogram plot.

 ### Statistical Summary

 The summary provides descriptive statistics for the benchmark timings (in nanoseconds) from both datasets (revisions) and an analysis of their difference.

 ```
 --- Comparison for: withTrailingLambdas_compose ---
                        Dataset 1 (Branch A) | Dataset 2 (Branch B)
 ----------------------------------------------------------------
 Count                | 100                  | 100
 Min (ns)             | 160768.30            | 160433.01
 Mean (ns)            | 167300.93            | 167229.12
 Median (ns)          | 164604.77            | 164904.08
 Std. Dev. (ns)       | 8627.72              | 7774.46
 Min Difference:      | -335.28 ns (-0.21%)
 Mean Difference:     | -71.81 ns (-0.04%)
 Median Difference:   | 299.31 ns (0.18%)
 95% CI of Diff:      | [-678.86, 1390.19] ns ([-0.41%, 0.84%])

 The confidence interval contains zero, suggesting no statistically significant difference between the medians.

 --- MannWhitneyUTest Results (Branch B vs. Branch A) ---
 P-value:                 0.5675
 Result:                  No statistically significant difference.

 -------------------------------------------------------
 ```

 *   **Descriptive Statistics**: `Count`, `Min`, `Mean`, `Median`, and `Std. Dev.` provide a basic overview of the timing distributions for each revision.
 *   **Median Difference**: Shows the absolute and percentage change between the median of Rev B and Rev A. A negative value indicates a performance improvement (Rev B is faster).
 *   **95% CI of Diff**: The 95% bootstrap confidence interval for the median difference.
     *   If this interval **does not** contain zero, there is strong evidence that a real performance difference exists.
     *   If this interval **does** contain zero, the observed difference may be due to random chance.
 *   **P-value**: The result of the Mann-Whitney U Test.
     *   A p-value less than `0.05` indicates a **statistically significant** difference between the two revisions.
     *   A p-value greater than or equal to `0.05` suggests no statistically significant difference was detected.

 ### Machine-Readable CSV

 A single CSV line is printed for easy parsing by other scripts or for logging results.

 ```
 --- Machine-Readable CSV ---
 benchmarkName,count,min1,min2,min_diff,min_diff_%,mean1,mean2,mean_diff_%,median1,median2,p-value,median_diff_%,median_diff,median_diff_ci_lower,median_diff_ci_upper,median_diff_ci_lower_%,median_diff_ci_upper_%
 withTrailingLambdas_compose,100,160768.30,160433.01,-335.28,-0.21%,167300.93,167229.12,-0.04%,164604.77,164904.08,0.5675,0.18%,299.31,-678.86,1390.19,-0.41%,0.84%
 ```
 *   `benchmarkName`: The name of the benchmark test method.
 *   `count`: The number of measurements taken for each revision.
 *   `min1`: The minimum timing value for revision A.
 *   `min2`: The minimum timing value for revision B.
 *   `min_diff`: The absolute difference in minimums (min2 - min1).
 *   `min_diff_%`: The percentage difference in minimums.
 *   `mean1`: The mean (average) timing for revision A.
 *   `mean2`: The mean (average) timing for revision B.
 *   `mean_diff_%`: The percentage difference in means.
 *   `median1`: Median of the baseline revision (A).
 *   `median2`: Median of the comparison revision (B).
 *   `p-value`: The p-value from the Mann-Whitney U test.
 *   `median_diff_%`: The percentage difference in medians (`(median2 - median1) / median1`).
 *   `median_diff`: The absolute difference in medians (`median2 - median1`).
 *   `median_diff_CI_lower`: The lower bound of the 95% confidence interval for the median difference.
 *   `median_diff_CI_upper`: The upper bound of the 95% confidence interval for the median difference.
 *   `median_diff_CI_lower_%`: The lower bound of the confidence interval as a percentage.
 *   `median_diff_CI_upper_%`: The upper bound of the confidence interval as a percentage.

 ### Metadata File

 A JSON file named `metadata.json` is created in the output directory. It contains information about the benchmark run, including:
 *   Timestamp of the execution
 *   Git revision information (name and commit hash)
 *   Device information
 *   Input parameters for the run

 ### Histogram Plot

 A PNG image file named `<benchmark_name>_histogram.png` is created in the output directory, where benchmark_name is the name of the benchmark test method.
 This plot visualizes the distribution of the benchmark timings for both revisions, making it easier to spot differences in performance.
 **Note**: The `<path_to_output_dir>` is the value passed to the `--output_path` parameter.
 If this parameter is not specified, it defaults to `~/androidx-main/frameworks/support/development/ab-benchmarking/app/build/benchmark-results/`.
 ```
 --- Graphical Plot ---
 Saved histogram to: file://<path_to_output_dir>/<benchmark_name>_histogram.png
 ```
	# Local A/B MicroBenchmark Automation Tool

	This tool automates the process of running local A/B performance micro benchmarks by
	comparing two Git revisions (branches, commits, tags, etc.). It is designed to help developers
	quickly measure the performance impact of their changes before code submission.

	The script performs the following actions:

	1. Runs a specified benchmark test multiple times on a base revision (e.g., `main`).
	2. Runs the same test multiple times on a comparison revision (e.g., a feature branch or `HEAD~1`).
	3. Collects the timing results from all test runs for both revisions.
	4. Performs a statistical analysis to determine if there is a significant performance difference between the two revisions.
	5. Outputs a human-readable summary, a machine-readable CSV summary, a metadata file, and a histogram plot of the results.

	## Prerequisites

	Before running the tool, please ensure the following conditions are met:

	1. Clean Git Working Directory: Your repository must have no uncommitted changes, staged files, or untracked files. The script will perform a check and exit if the working tree is not clean. Please commit or stash your changes.
	2. Connected Android Device(s):
	* Benchmarks can only be run on a connected Android device.
	* If a single device is connected via ADB, the script will automatically target it.
	* If multiple devices are connected, you must use the `--serial` flag to specify which one to use. The script will exit with an error if multiple devices are detected without a specified ID.
	3. Valid Git Revisions: The Git revisions you provide for comparison must be valid and exist locally.

	## Reducing Noise

	To get more stable and reliable benchmark results, it's important to minimize
	environmental noise. Here are some recommendations:

	* Disable JIT (Just-In-Time) Compilation: The JIT compiler can introduce
	variability. Use the provided script to disable it:
	```bash
	./benchmark/gradle-plugin/src/main/resources/scripts/disableJit.sh
	```
	* Lock CPU and GPU Clocks: Fluctuations in clock speeds can affect
	measurements. Lock them with:
	```bash
	./benchmark/gradle-plugin/src/main/resources/scripts/lockClocks.sh
	```
	* Use a aosp-userdebug Build: Flash your device with a `userdebug` build of AOSP
	for more performance control. AOSP build do not have GMS services hences reduces background interference.
	* Minimize Device Activity:
	* Disable Wi-Fi, mobile data, and NFC.
	* Enable Airplane Mode.
	* Clear all applications from the "Recents" screen.

	## Usage

	The script is executed via Gradle from the `development/ab-benchmarking` directory.

	```bash
	./gradlew run --args="<rev_a> <rev_b> <module> <benchmarkTest> [options]"
	```

	### Finding device serial
	To find the serial ID of all connected devices, run the following ADB command in your terminal:
	```bash
	adb devices
	```
	The output will list your connected devices. The string in the first column is the serial ID.
	```
	List of devices attached
	emulator-5554 device
	123456789ABCDEF device
	```

	### Command-Line Arguments

	The script accepts the following positional arguments in order:

	1. `rev_a` (String): The first Git revision to test (e.g., a branch, commit hash, tag, or `HEAD`).
	2. `rev_b` (String): The second Git revision to test.
	3. `module` (String): The Gradle module path containing the benchmark test.
	* Example: `compose:ui:ui-benchmark`
	4. `benchmark_test` (String): The fully qualified class name of the benchmark test to run.
	* Example: `androidx.compose.ui.benchmark.accessibility.AccessibilityBenchmark`
	* Note: To run a single method within a test class, use the format `ClassName#methodName`.
	* Example: `androidx.compose.ui.benchmark.ModifiersBenchmark#full[clickable_1x]`

	### Options

	* `--run_count` (Int): The number of times the entire test suite should be run on each revision to gather a sample set. For example, a `run_count` of 10 will result in 10 test executions on `rev_a` and 10 on `rev_b`. Defaults to `1`.
	* `--iteration_count` (Int): The number of internal iterations the benchmark framework should perform in a single test run. This is passed directly to the `androidx.benchmark.iterations` argument. Defaults to `50`.
	* `--serial` (String): The serial ID of the target Android device to use for benchmarking. This is required if more than one device is connected. Use the `adb devices` command to find the ID.
	* `--output_path` (String): The path where temporary and final result files should be stored. This includes intermediate CSV files, the final `.metadata.json` file, and a histogram plot. Defaults to `~/androidx-main/frameworks/support/development/ab-benchmarking/app/build/benchmark-results/`.

	## Example Commands

	### Comparing Branches

	Here is an example that compares the `main` branch against a feature branch named `my-perf-fix`.

	```bash
	./gradlew run --args="main my-perf-fix compose:ui:ui-benchmark androidx.compose.ui.benchmark.accessibility.AccessibilityBenchmark --run_count 5 --iteration_count 1000 --serial emulator-5554"
	```

	### Comparing a Commit Against its Parent

	To measure the impact of the very last commit on the current branch, you can compare `HEAD` with its parent, `HEAD~1`.

	```bash
	./gradlew run --args="HEAD~1 HEAD compose:ui:ui-benchmark androidx.compose.ui.benchmark.accessibility.AccessibilityBenchmark --run_count 3 --iteration_count 1500 --serial emulator-5554"
	```

	### Comparing a Single Benchmark Method

	To isolate the performance of a specific method within a benchmark class, use the `#` separator.

	```bash
	./gradlew run --args="main my-perf-fix compose:ui:ui-benchmark androidx.compose.ui.benchmark.accessibility.AccessibilityBenchmark#mySpecificTest --run_count 5 --iteration_count 500 --serial emulator-5554"
	```

	## Interpreting the Output

	The tool produces four forms of output: a human-readable summary, a machine-readable CSV line, a metadata JSON file, and a histogram plot.

	### Statistical Summary

	The summary provides descriptive statistics for the benchmark timings (in nanoseconds) from both datasets (revisions) and an analysis of their difference.

	```
	--- Comparison for: withTrailingLambdas_compose ---
	Dataset 1 (Branch A) \| Dataset 2 (Branch B)
	----------------------------------------------------------------
	Count \| 100 \| 100
	Min (ns) \| 160768.30 \| 160433.01
	Mean (ns) \| 167300.93 \| 167229.12
	Median (ns) \| 164604.77 \| 164904.08
	Std. Dev. (ns) \| 8627.72 \| 7774.46
	Min Difference: \| -335.28 ns (-0.21%)
	Mean Difference: \| -71.81 ns (-0.04%)
	Median Difference: \| 299.31 ns (0.18%)
	95% CI of Diff: \| [-678.86, 1390.19] ns ([-0.41%, 0.84%])

	The confidence interval contains zero, suggesting no statistically significant difference between the medians.

	--- MannWhitneyUTest Results (Branch B vs. Branch A) ---
	P-value: 0.5675
	Result: No statistically significant difference.

	-------------------------------------------------------
	```

	* Descriptive Statistics: `Count`, `Min`, `Mean`, `Median`, and `Std. Dev.` provide a basic overview of the timing distributions for each revision.
	* Median Difference: Shows the absolute and percentage change between the median of Rev B and Rev A. A negative value indicates a performance improvement (Rev B is faster).
	* 95% CI of Diff: The 95% bootstrap confidence interval for the median difference.
	* If this interval does not contain zero, there is strong evidence that a real performance difference exists.
	* If this interval does contain zero, the observed difference may be due to random chance.
	* P-value: The result of the Mann-Whitney U Test.
	* A p-value less than `0.05` indicates a statistically significant difference between the two revisions.
	* A p-value greater than or equal to `0.05` suggests no statistically significant difference was detected.

	### Machine-Readable CSV

	A single CSV line is printed for easy parsing by other scripts or for logging results.

	```
	--- Machine-Readable CSV ---
	benchmarkName,count,min1,min2,min_diff,min_diff_%,mean1,mean2,mean_diff_%,median1,median2,p-value,median_diff_%,median_diff,median_diff_ci_lower,median_diff_ci_upper,median_diff_ci_lower_%,median_diff_ci_upper_%
	withTrailingLambdas_compose,100,160768.30,160433.01,-335.28,-0.21%,167300.93,167229.12,-0.04%,164604.77,164904.08,0.5675,0.18%,299.31,-678.86,1390.19,-0.41%,0.84%
	```
	* `benchmarkName`: The name of the benchmark test method.
	* `count`: The number of measurements taken for each revision.
	* `min1`: The minimum timing value for revision A.
	* `min2`: The minimum timing value for revision B.
	* `min_diff`: The absolute difference in minimums (min2 - min1).
	* `min_diff_%`: The percentage difference in minimums.
	* `mean1`: The mean (average) timing for revision A.
	* `mean2`: The mean (average) timing for revision B.
	* `mean_diff_%`: The percentage difference in means.
	* `median1`: Median of the baseline revision (A).
	* `median2`: Median of the comparison revision (B).
	* `p-value`: The p-value from the Mann-Whitney U test.
	* `median_diff_%`: The percentage difference in medians (`(median2 - median1) / median1`).
	* `median_diff`: The absolute difference in medians (`median2 - median1`).
	* `median_diff_CI_lower`: The lower bound of the 95% confidence interval for the median difference.
	* `median_diff_CI_upper`: The upper bound of the 95% confidence interval for the median difference.
	* `median_diff_CI_lower_%`: The lower bound of the confidence interval as a percentage.
	* `median_diff_CI_upper_%`: The upper bound of the confidence interval as a percentage.

	### Metadata File

	A JSON file named `metadata.json` is created in the output directory. It contains information about the benchmark run, including:
	* Timestamp of the execution
	* Git revision information (name and commit hash)
	* Device information
	* Input parameters for the run

	### Histogram Plot

	A PNG image file named `<benchmark_name>_histogram.png` is created in the output directory, where benchmark_name is the name of the benchmark test method.
	This plot visualizes the distribution of the benchmark timings for both revisions, making it easier to spot differences in performance.
	Note: The `<path_to_output_dir>` is the value passed to the `--output_path` parameter.
	If this parameter is not specified, it defaults to `~/androidx-main/frameworks/support/development/ab-benchmarking/app/build/benchmark-results/`.
	```
	--- Graphical Plot ---
	Saved histogram to: file://<path_to_output_dir>/<benchmark_name>_histogram.png
	```