Local A/B MicroBenchmark Automation Tool

This tool automates the process of running local A/B performance micro benchmarks by comparing two Git revisions (branches, commits, tags, etc.). It is designed to help developers quickly measure the performance impact of their changes before code submission.

The script performs the following actions:

Runs a specified benchmark test multiple times on a base revision (e.g., main).
Runs the same test multiple times on a comparison revision (e.g., a feature branch or HEAD~1).
Collects the timing results from all test runs for both revisions.
Performs a statistical analysis to determine if there is a significant performance difference between the two revisions.
Outputs a human-readable summary, a machine-readable CSV summary, a metadata file, and a histogram plot of the results.

Prerequisites

Before running the tool, please ensure the following conditions are met:

Clean Git Working Directory: Your repository must have no uncommitted changes, staged files, or untracked files. The script will perform a check and exit if the working tree is not clean. Please commit or stash your changes.
Connected Android Device(s):
- Benchmarks can only be run on a connected Android device.
- If a single device is connected via ADB, the script will automatically target it.
- If multiple devices are connected, you must use the --serial flag to specify which one to use. The script will exit with an error if multiple devices are detected without a specified ID.
Valid Git Revisions: The Git revisions you provide for comparison must be valid and exist locally.

Reducing Noise

To get more stable and reliable benchmark results, it's important to minimize environmental noise. Here are some recommendations:

Disable JIT (Just-In-Time) Compilation: The JIT compiler can introduce variability. Use the provided script to disable it:
```
./benchmark/gradle-plugin/src/main/resources/scripts/disableJit.sh
```
Lock CPU and GPU Clocks: Fluctuations in clock speeds can affect measurements. Lock them with:
```
./benchmark/gradle-plugin/src/main/resources/scripts/lockClocks.sh
```
Use a aosp-userdebug Build: Flash your device with a userdebug build of AOSP for more performance control. AOSP build do not have GMS services hences reduces background interference.
Minimize Device Activity:
- Disable Wi-Fi, mobile data, and NFC.
- Enable Airplane Mode.
- Clear all applications from the “Recents” screen.

Usage

The script is executed via Gradle from the development/ab-benchmarking directory.

./gradlew run --args="<rev_a> <rev_b> <module> <benchmarkTest> [options]"

Finding device serial

To find the serial ID of all connected devices, run the following ADB command in your terminal:

adb devices

The output will list your connected devices. The string in the first column is the serial ID.

List of devices attached
emulator-5554   device
123456789ABCDEF device

Command-Line Arguments

The script accepts the following positional arguments in order:

rev_a (String): The first Git revision to test (e.g., a branch, commit hash, tag, or HEAD).
rev_b (String): The second Git revision to test.
module (String): The Gradle module path containing the benchmark test.
- Example: compose:ui:ui-benchmark
benchmark_test (String): The fully qualified class name of the benchmark test to run.
- Example: androidx.compose.ui.benchmark.accessibility.AccessibilityBenchmark
- Note: To run a single method within a test class, use the format ClassName#methodName.
- Example: androidx.compose.ui.benchmark.ModifiersBenchmark#full[clickable_1x]

Options

--run_count (Int): The number of times the entire test suite should be run on each revision to gather a sample set. For example, a run_count of 10 will result in 10 test executions on rev_a and 10 on rev_b. Defaults to 1.
--iteration_count (Int): The number of internal iterations the benchmark framework should perform in a single test run. This is passed directly to the androidx.benchmark.iterations argument. Defaults to 50.
--serial (String): The serial ID of the target Android device to use for benchmarking. This is required if more than one device is connected. Use the adb devices command to find the ID.
--output_path (String): The path where temporary and final result files should be stored. This includes intermediate CSV files, the final .metadata.json file, and a histogram plot. Defaults to ~/androidx-main/frameworks/support/development/ab-benchmarking/app/build/benchmark-results/.

Example Commands

Comparing Branches

Here is an example that compares the main branch against a feature branch named my-perf-fix.

./gradlew run --args="main my-perf-fix compose:ui:ui-benchmark androidx.compose.ui.benchmark.accessibility.AccessibilityBenchmark --run_count 5 --iteration_count 1000 --serial emulator-5554"

Comparing a Commit Against its Parent

To measure the impact of the very last commit on the current branch, you can compare HEAD with its parent, HEAD~1.

./gradlew run --args="HEAD~1 HEAD compose:ui:ui-benchmark androidx.compose.ui.benchmark.accessibility.AccessibilityBenchmark --run_count 3 --iteration_count 1500 --serial emulator-5554"

Comparing a Single Benchmark Method

To isolate the performance of a specific method within a benchmark class, use the # separator.

./gradlew run --args="main my-perf-fix compose:ui:ui-benchmark androidx.compose.ui.benchmark.accessibility.AccessibilityBenchmark#mySpecificTest --run_count 5 --iteration_count 500 --serial emulator-5554"

Interpreting the Output

The tool produces four forms of output: a human-readable summary, a machine-readable CSV line, a metadata JSON file, and a histogram plot.

Statistical Summary

The summary provides descriptive statistics for the benchmark timings (in nanoseconds) from both datasets (revisions) and an analysis of their difference.

--- Comparison for: withTrailingLambdas_compose ---
                       Dataset 1 (Branch A) | Dataset 2 (Branch B)
----------------------------------------------------------------
Count                | 100                  | 100
Min (ns)             | 160768.30            | 160433.01
Mean (ns)            | 167300.93            | 167229.12
Median (ns)          | 164604.77            | 164904.08
Std. Dev. (ns)       | 8627.72              | 7774.46
Min Difference:      | -335.28 ns (-0.21%)
Mean Difference:     | -71.81 ns (-0.04%)
Median Difference:   | 299.31 ns (0.18%)
95% CI of Diff:      | [-678.86, 1390.19] ns ([-0.41%, 0.84%])

The confidence interval contains zero, suggesting no statistically significant difference between the medians.

--- MannWhitneyUTest Results (Branch B vs. Branch A) ---
P-value:                 0.5675
Result:                  No statistically significant difference.

-------------------------------------------------------

Descriptive Statistics: Count, Min, Mean, Median, and Std. Dev. provide a basic overview of the timing distributions for each revision.
Median Difference: Shows the absolute and percentage change between the median of Rev B and Rev A. A negative value indicates a performance improvement (Rev B is faster).
95% CI of Diff: The 95% bootstrap confidence interval for the median difference.
- If this interval does not contain zero, there is strong evidence that a real performance difference exists.
- If this interval does contain zero, the observed difference may be due to random chance.
P-value: The result of the Mann-Whitney U Test.
- A p-value less than 0.05 indicates a statistically significant difference between the two revisions.
- A p-value greater than or equal to 0.05 suggests no statistically significant difference was detected.

Machine-Readable CSV

A single CSV line is printed for easy parsing by other scripts or for logging results.

--- Machine-Readable CSV ---
benchmarkName,count,min1,min2,min_diff,min_diff_%,mean1,mean2,mean_diff_%,median1,median2,p-value,median_diff_%,median_diff,median_diff_ci_lower,median_diff_ci_upper,median_diff_ci_lower_%,median_diff_ci_upper_%
withTrailingLambdas_compose,100,160768.30,160433.01,-335.28,-0.21%,167300.93,167229.12,-0.04%,164604.77,164904.08,0.5675,0.18%,299.31,-678.86,1390.19,-0.41%,0.84%

benchmarkName: The name of the benchmark test method.
count: The number of measurements taken for each revision.
min1: The minimum timing value for revision A.
min2: The minimum timing value for revision B.
min_diff: The absolute difference in minimums (min2 - min1).
min_diff_%: The percentage difference in minimums.
mean1: The mean (average) timing for revision A.
mean2: The mean (average) timing for revision B.
mean_diff_%: The percentage difference in means.
median1: Median of the baseline revision (A).
median2: Median of the comparison revision (B).
p-value: The p-value from the Mann-Whitney U test.
median_diff_%: The percentage difference in medians ((median2 - median1) / median1).
median_diff: The absolute difference in medians (median2 - median1).
median_diff_CI_lower: The lower bound of the 95% confidence interval for the median difference.
median_diff_CI_upper: The upper bound of the 95% confidence interval for the median difference.
median_diff_CI_lower_%: The lower bound of the confidence interval as a percentage.
median_diff_CI_upper_%: The upper bound of the confidence interval as a percentage.

Metadata File

A JSON file named metadata.json is created in the output directory. It contains information about the benchmark run, including:

Timestamp of the execution
Git revision information (name and commit hash)
Device information
Input parameters for the run

Histogram Plot

A PNG image file named <benchmark_name>_histogram.png is created in the output directory, where benchmark_name is the name of the benchmark test method. This plot visualizes the distribution of the benchmark timings for both revisions, making it easier to spot differences in performance. Note: The <path_to_output_dir> is the value passed to the --output_path parameter. If this parameter is not specified, it defaults to ~/androidx-main/frameworks/support/development/ab-benchmarking/app/build/benchmark-results/.

--- Graphical Plot ---
Saved histogram to: file://<path_to_output_dir>/<benchmark_name>_histogram.png