2024-07-02 Triage Log

We saw a large set of primary benchmarks regress, mostly due to PR #120924 (lint_reasons and #[expect]) and PR #120639 (new effects desugaring). Separate from those, there are a couple rollup PRs (#127076, #127096) with some regressions that were limited to relatively few benchmarks; pnkfelix was unable to isolate a injecting PR that can be identified as a root cause (outside assistance welcome!).

Triage done by @pnkfelix. Revision range: c3d7fb39..cf2df68d

Summary:

(instructions:u)	mean	range	count
Regressions ❌ (primary)	1.0%	[0.2%, 2.8%]	109
Regressions ❌ (secondary)	1.4%	[0.3%, 8.0%]	50
Improvements ✅ (primary)	-1.3%	[-4.3%, -0.2%]	41
Improvements ✅ (secondary)	-1.3%	[-4.4%, -0.2%]	75
All ❌✅ (primary)	0.4%	[-4.3%, 2.8%]	150

4 Regressions, 3 Improvements, 11 Mixed; 7 of them in rollups 59 artifact comparisons made in total

Regressions

Rollup of 7 pull requests #126951 (Comparison Link)

(instructions:u)	mean	range	count
Regressions ❌ (primary)	0.5%	[0.5%, 0.6%]	3
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.5%	[0.5%, 0.6%]	3

regressions are all to serde incr-patched:println {check, debug, opt}.
on its own, the regression is limited to instruction counts, and seems minor enough to not warrant deeper investigation.
(the 30-day history tells a slightly more complex story, though)
marked as triaged

Let's #[expect] some lints: Stabilize lint_reasons (RFC 2383) #120924 (Comparison Link)

(instructions:u)	mean	range	count
Regressions ❌ (primary)	0.6%	[0.2%, 1.9%]	142
Regressions ❌ (secondary)	0.6%	[0.1%, 1.5%]	79
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.6%	[0.2%, 1.9%]	142

wide collection of regressions.
PR discussion indicates regression may be inherent to how #[expect] is implemented; it is also hypothesized to be “likely” that the implementation can be better optimized.
not marking as triaged.
(my hope is that someone will look into the “further optimizations” that xFrednet alludes to above, and after we've done a reasonable amount of investigation there, then we can mark this as triaged.)

Update browser-ui-test version to 0.18.0 #127010 (Comparison Link)

(instructions:u)	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	7.2%	[7.2%, 7.2%]	1
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-	-	0

already marked as triaged (secondary benchmark deep-vector is being noisy at the moment).

Implement new effects desugaring #120639 (Comparison Link)

(instructions:u)	mean	range	count
Regressions ❌ (primary)	0.3%	[0.2%, 0.6%]	72
Regressions ❌ (secondary)	0.4%	[0.1%, 0.9%]	24
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.3%	[0.2%, 0.6%]	72

Biggest (>=0.4%) primary regressions: regex, bitmaps, typenum, stm32f4, exa. (19 variants of those five benchmarks.)
the PR author (fee1-dead) has made a couple follow-up attempts to address the regressions, but nothing has hit yet.
not marking as triaged, in order to encourage addressing the regressions. (note however: the cycles:u metric didn't regress, at least not past our noise-filtering significance threshold. Nor did task-clock:u. It is not totally clear how much effort is warranted here, apart from a desire to keep the instruction count low just because that is our most stable proxy for “computational effort”)

Improvements

Save 2 pointers in TerminatorKind (96 → 80 bytes) #126784 (Comparison Link)

(instructions:u)	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.4%	[-0.5%, -0.2%]	9
Improvements ✅ (secondary)	-0.1%	[-0.1%, -0.1%]	4
All ❌✅ (primary)	-0.4%	[-0.5%, -0.2%]	9

improvements are to serde and diesel.
skimming 30 day history indicates that the effect is real, though may have been somewhat undone by subsequent changes.

rustdoc: use current stage if download-rustc enabled #126728 (Comparison Link)

(instructions:u)	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-8.0%	[-8.0%, -8.0%]	1
All ❌✅ (primary)	-	-	0

(this was just deep-vector noise)

Rollup of 9 pull requests #127174 (Comparison Link)

(instructions:u)	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	0.2%	[0.2%, 0.2%]	1
Improvements ✅ (primary)	-0.4%	[-1.1%, -0.2%]	46
Improvements ✅ (secondary)	-1.3%	[-2.9%, -0.2%]	36
All ❌✅ (primary)	-0.4%	[-1.1%, -0.2%]	46

this had broad improvements to instruction counts, but the cycle counts metric reports that there were 13 regressions (one of which, unicode-normalization, was primary) here with only one improvement (none of which was primary).
nonetheless, not worth investigating further.

Mixed

Rollup of 9 pull requests #126878 (Comparison Link)

(instructions:u)	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	0.4%	[0.3%, 0.5%]	6
Improvements ✅ (primary)	-0.4%	[-0.5%, -0.3%]	4
Improvements ✅ (secondary)	-0.3%	[-0.3%, -0.3%]	1
All ❌✅ (primary)	-0.4%	[-0.5%, -0.3%]	4

regressions are all to secondary benchmark: coercions.
marking as triaged

Add SliceLike to rustc_type_ir, use it in the generic solver code (+ some other changes) #126813 (Comparison Link)

(instructions:u)	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	0.5%	[0.4%, 0.8%]	7
Improvements ✅ (primary)	-0.4%	[-0.6%, -0.3%]	12
Improvements ✅ (secondary)	-0.7%	[-2.2%, -0.2%]	9
All ❌✅ (primary)	-0.4%	[-0.6%, -0.3%]	12

regressions are all to secondary benchmark: match-stress
marking as triaged

Also get add nuw from uN::checked_add #126852 (Comparison Link)

(instructions:u)	mean	range	count
Regressions ❌ (primary)	0.6%	[0.3%, 0.9%]	4
Regressions ❌ (secondary)	0.4%	[0.3%, 0.4%]	2
Improvements ✅ (primary)	-0.3%	[-0.3%, -0.3%]	1
Improvements ✅ (secondary)	-1.3%	[-1.4%, -0.9%]	7
All ❌✅ (primary)	0.4%	[-0.3%, 0.9%]	5

PR was analyzed and thought to be a net win, despite the anticipated regression to compiler instruction-counts
but there was a bystander follow-up comment that the result here might be a pessimization, at least for Intel x86.
not marking as triaged, in hopes that follow-up comment gets addressed in some manner.

ast: Standardize visiting order for attributes and node IDs #125741 (Comparison Link)

(instructions:u)	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	0.3%	[0.2%, 0.3%]	3
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-0.3%	[-0.4%, -0.2%]	12
All ❌✅ (primary)	-	-	0

solely regressions to secondary benchmark: tt-muncher.
marking as triaged

Rollup of 8 pull requests #126965 (Comparison Link)

(instructions:u)	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	3.3%	[1.7%, 5.8%]	9
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-3.0%	[-5.7%, -0.3%]	2
All ❌✅ (primary)	-	-	0

solely regressions to secondary benchmark: derive
marking as triaged

Remove more PtrToPtr casts in GVN #126844 (Comparison Link)

(instructions:u)	mean	range	count
Regressions ❌ (primary)	1.3%	[0.3%, 2.9%]	4
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.7%	[-1.1%, -0.4%]	2
Improvements ✅ (secondary)	-0.3%	[-0.3%, -0.3%]	1
All ❌✅ (primary)	0.6%	[-1.1%, 2.9%]	6

Main primary regressions to opt-full benchmarks ripgrep (2.89%), webrender (1.11%), html5ever (0.70%).
Some interesting discussion on the PR about the cause; e.g. are PR's like this causing individual MIR reduction that leads to more inlining and then more bloat overall?
but I do not think any of that would cause us to undo this particular change; there are higher level inlining and code-generation policies that need to be revisited.
marking as triaged.

Rollup of 6 pull requests #127014 (Comparison Link)

(instructions:u)	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	8.2%	[8.2%, 8.2%]	1
Improvements ✅ (primary)	-0.2%	[-0.2%, -0.2%]	1
Improvements ✅ (secondary)	-2.2%	[-5.0%, -0.2%]	13
All ❌✅ (primary)	-0.2%	[-0.2%, -0.2%]	1

already marked as triaged (sole regressionw as to noisy deep-vector)

Rollup of 6 pull requests #127076 (Comparison Link)

(instructions:u)	mean	range	count
Regressions ❌ (primary)	1.4%	[0.6%, 2.1%]	2
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-1.8%	[-2.7%, -0.8%]	2
Improvements ✅ (secondary)	-0.7%	[-6.2%, -0.2%]	17
All ❌✅ (primary)	-0.2%	[-2.7%, 2.1%]	4

regressions are to opt-full: image (2.11%) and cargo (0.61%).
eyeballing the self-profile results provides a hint that we might be spending more time in LLVM optimizations passes after this rollup PR landed.
fired off some follow-up rust-timer builds on a couple potential culprits, but I admit that I‘m only making semi-educated guesses. (outcome: It wasn’t PR #124741 nor PR #126970)
not marking as triaged, but if no one can identify a root cause within a week, then we should just mark it so at that point.

Rollup of 11 pull requests #127096 (Comparison Link)

(instructions:u)	mean	range	count
Regressions ❌ (primary)	0.5%	[0.3%, 0.7%]	7
Regressions ❌ (secondary)	1.1%	[0.2%, 1.6%]	7
Improvements ✅ (primary)	-3.4%	[-6.2%, -1.2%]	12
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-1.9%	[-6.2%, 0.7%]	19

all 7 primary regressions are variants of syn; all but one are incremental.
skimming the detailed results reports for the top three regressing variants, I see the following queries at the top of the ordering by time-delta: incr_comp_persist_dep_graph, hir_crate, codegen_copy_artifacts_from_incr_cache, early_lint_checks...
what in this rollup would have impacted those incremental-compilation related queries?
PR #1270668 already had its own dedicated rustc-perf run.
(is this potentially just fallout noise from internal API changes like that in PR #127071?)
fired off a rust-timer build against that, just to scratch that itch.
not marking as triaged, but if no one can identify a root cause within a week, then we should just mark it so at that point.

Automatically taint InferCtxt when errors are emitted #126996 (Comparison Link)

(instructions:u)	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	0.5%	[0.4%, 0.9%]	7
Improvements ✅ (primary)	-0.2%	[-0.2%, -0.2%]	1
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-0.2%	[-0.2%, -0.2%]	1

all regressions are to secondary match-stress benchmark, and were anticipated during a perf run during review
marking as triaged.

Avoid MIR bloat in inlining #127113 (Comparison Link)

(instructions:u)	mean	range	count
Regressions ❌ (primary)	1.1%	[0.3%, 2.8%]	6
Regressions ❌ (secondary)	1.6%	[1.5%, 1.9%]	6
Improvements ✅ (primary)	-0.8%	[-2.2%, -0.2%]	17
Improvements ✅ (secondary)	-1.6%	[-4.5%, -0.2%]	18
All ❌✅ (primary)	-0.3%	[-2.2%, 2.8%]	23

regressed opt-full html5ever, diesel, hyper, and clap. Also regressed ripgrep and regex in two isolated opt incremental scenarios.
overall gains more than it loses, as noted after the perf run done during PR development.
the big impact was to binary sizes, where the improvement is pretty clear.
marking as triaged.

2024-07-02 Triage Log

Regressions

Improvements

Mixed

Untriaged Pull Requests