I’d like to tell you about some of the recent changes we’ve made as part of our ongoing work to extend the optimization capabilities of RyuJIT, the MSIL-to-native code generator used by .NET Core and .NET Framework. I hope it will make for an interesting read, and offer some insight into the sorts of optimization opportunities we have our eyes on.
Note: The changes described here landed after the release fork for .NET Core 2.0 was created, so they are available in daily preview builds but not the released 2.0 bits. Similarly, these changes landed after the fork for .NET Framework 4.7.1 was created. The changes to struct argument passing and block layout, which are purely JIT changes, will automatically propagate to subsequent .NET Framework releases with the new JIT bits (the RyuJIT sources are shared between .NET Core and .NET Framework); the other changes depend on their runtime components to propagate to .NET Framework.
Improvements for Span
Some of our work was motivated by the introduction of Span<T>
, so that it and similar types could better deliver on their performance promises.
One such change was #10910, which made the JIT recognize the Item
property getters of Span<T>
and ReadOnlySpan<T>
as intrinsics — the JIT now recognizes calls to these getters and, rather than generate code for them the same way it would for other calls, it transforms them directly into code sequences in its intermediate representation that are similar to the sequences used for the ldelem
MSIL opcode that fetches an element from an array. As noted in the PR’s performance assessment (n.b., if you follow that link, see also the follow-up where the initially-discovered regressions were fixed with subsequent improvements in #10956 and dotnet/roslyn#20548), this improved several benchmarks in the tests we added to track Span<T>
performance, by allowing the existing JIT code that optimized array bound checks that are redundant with prior checks, or that are against arrays with known constant length, to kick in for Span<T>
as well. This is what some of those improved benchmark methods look like, and their improvements:
Building on that, change #11521 updated the analysis machinery the JIT uses to eliminate bounds checks for other provably in-bounds array accesses, to similarly eliminate bounds checks for provably in-bounds Span<T>
accesses (in particular, bounds checks in for
loops bounded by span.Length
). As noted in the PR (numbers here), this brought the codegen for four more microbenchmarks in the Span<T>
tests up to par with the codegen for equivalent patterns with arrays; here are two of them:
One key fact that these bounds-check removal optimizations exploit is that array lengths are immutable; any two loads of a.Length
, if a
refers to the same array each time, will load the same length value. It’s common for the JIT to encounter different accesses to the same array, where the reference to the array is held in a local or parameter of type T[]
, such that it can determine that intervening code hasn’t modified the local/parameter in question, even if that intervening code has unknown side-effects. The same isn’t true for parameters of type ref T[]
, since intervening code with unknown side-effects might change which array object is referenced. Consider:
Since Span<T>
is a struct, some platforms’ ABIs specify that passing an argument of type Span<T>
actually be done by creating a copy of the struct in the caller’s stack frame, and passing a pointer to that copy in to the callee via the argument registers/stack. The JIT’s internal modeling of this convention is to rewrite Span<T>
parameters as ref Span<T>
parameters. That internal rewrite at first caused problems for applying bounds-check removal optimizations to spans passed as parameters. The problem was that methods written with by-value Span<T>
parameters, which at source look analogous to by-value array parameter a
in the example above, when rewritten looked to the JIT like by-reference parameters, analogous to by-reference array parameter b
above. This caused the JIT to handle references to such parameters’ Length
fields with the same conservativism needed for b
above. Change #10453 taught the JIT to make local copies of such parameters before doing that rewrite (in beneficial cases), so that bounds-check removal optimizations can equally apply to spans passed by value. As noted in the PR, this change allowed these optimizations to fire in 9 more of the Span<T>
micro-benchmarks in our test suite; here are three of them:
This last change applies more generally to any structs passed as parameters (not just Span<T>
); the JIT is now better able to analyze value propagation through their fields.
Enum.HasFlag Optimization
The Enum.HasFlag
method offers nice readability (compare targets.HasFlag(AttributeTargets.Class | AttributeTargets.Struct)
vs targets & (AttributeTargets.Class | AttributeTargets.Struct) == (AttributeTargets.Class | AttributeTargets.Struct)
), but, since it needs to handle reflection cases where the exact enum type isn’t known until run-time, it is notoriously expensive. Change #13748 taught the JIT to recognize when the enum type is known (and known to equal the argument type) at JIT time, and generate the simple bit test rather than the expensive Enum.HasFlag
call. Here’s a micro-benchmark to demonstrate, comparing .NET Core 2.0 (which doesn’t have this change) to a recent daily preview build (which does). Much thanks to @adamsitnik for making it easy to use BenchmarkDotNet with daily preview builds of .NET Core!
Output:
BenchmarkDotNet=v0.10.9.313-nightly, OS=Windows 10 Redstone 2 [1703, Creators Update] (10.0.15063) Processor=Intel Core i7-4790 CPU 3.60GHz (Haswell), ProcessorCount=8 Frequency=3507517 Hz, Resolution=285.1020 ns, Timer=TSC .NET Core SDK=2.1.0-preview1-007228 [Host] : .NET Core 2.1.0-preview1-25719-04 (Framework 4.6.25718.02), 64bit RyuJIT Job-WFNGKY : .NET Core 2.0.0 (Framework 4.6.00001.0), 64bit RyuJIT Job-VIXUQP : .NET Core 2.1.0-preview1-25719-04 (Framework 4.6.25718.02), 64bit RyuJIT
Method | Toolchain | Mean | Error | StdDev |
---|---|---|---|---|
HasFlag | .NET Core 2.0 | 14,917.4 ns | 80.147 ns | 71.048 ns |
HasFlag | .NET Core 2.1.0-preview1-25719-04 | 449.3 ns | 1.239 ns | 1.034 ns |
With the cool new BenchmarkDotNet DisassemblyDiagnoser (again thanks to @adamsitnik), we can see that the optimized code really is a simple bit test:
Bench.HasFlag | |
---|---|
RyuJIT x64 .NET Core 2.0 | RyuJIT x64 .NET Core 2.1.0-preview1-25719-04 |
|
|
What’s more, implementing this optimization involved implementing a new scheme for recognizing intrinsics in the JIT, which is more flexible than the previous scheme, and which is being leveraged in the implementation of Intel SIMD intrinsics for.NET Core.
Block Layout for Search Loops
Outside of profile-guided optimization, the JIT has traditionally been conservative about rearranging the basic blocks of methods it compiles, leaving them in MSIL order except to segregate code it identifies as “rarely-run” (e.g. blocks that throw or catch exceptions). Of course, MSIL order isn’t always the most performant one; notably, in the case of loops with conditional exits/returns, it’s generally a good idea to keep the in-loop code together, and move everything on the exit path after the conditional branch out of the loop. For particularly hot loops, this can cause a significant enough difference that developers have been using gotos to make the MSIL order reflect the desired machine code order. Change #13314 updated the JIT’s loop detection to effect this layout automatically. As usual, the PR included a performance assessment,
which noted speed-ups in 5 of the benchmarks in our performance test suite.
Again comparing .NET Core 2.0 (which didn’t have this change) to a recent daily preview build (which does), let’s look at the effect on the repro case from the GitHub issue describing this opportunity:
The results confirm that the new JIT brings the performance of the loop with the in-place return
in line with the performance of the loop with the goto
, and that doing so constituted a 15% speed-up:
BenchmarkDotNet=v0.10.9.313-nightly, OS=Windows 10 Redstone 2 [1703, Creators Update] (10.0.15063) Processor=Intel Core i7-4790 CPU 3.60GHz (Haswell), ProcessorCount=8 Frequency=3507517 Hz, Resolution=285.1020 ns, Timer=TSC .NET Core SDK=2.1.0-preview1-007228 [Host] : .NET Core 2.0.0 (Framework 4.6.00001.0), 64bit RyuJIT Job-NHAVNC : .NET Core 2.0.0 (Framework 4.6.00001.0), 64bit RyuJIT Job-CTEHPT : .NET Core 2.1.0-preview1-25719-04 (Framework 4.6.25718.02), 64bit RyuJIT
Method | Toolchain | Mean | Error | StdDev |
---|---|---|---|---|
LoopReturn | .NET Core 2.0 | 61.97 ns | 0.1254 ns | 0.1111 ns |
LoopGoto | .NET Core 2.0 | 53.63 ns | 0.5171 ns | 0.4837 ns |
LoopReturn | .NET Core 2.1.0-preview1-25719-04 | 53.75 ns | 0.5089 ns | 0.4511 ns |
LoopGoto | .NET Core 2.1.0-preview1-25719-04 | 53.52 ns | 0.0999 ns | 0.0934 ns |
Disassembly confirms that the difference is entirely block placement:
LoopWithExit.LoopReturn | |
---|---|
RyuJIT x64 .NET Core 2.0 | RyuJIT x64 .NET Core 2.1.0-preview1-25719-04 |
|
|
LoopWithExit.LoopGoto | |
---|---|
RyuJIT x64 .NET Core 2.0 | RyuJIT x64 .NET Core 2.1.0-preview1-25719-04 |
|
|
Conclusion
We’re constantly pushing to improve our codegen, whether it’s to enable new scenarios/features (like Span<T>
), or to ensure good performance for natural/readable code (like calls to HasFlag
and returns from loops). As always, we invite anyone interested to join the community pushing this work forward. RyuJIT documentation avilable online includes an overview and a recently added tutorial, and our GitHub issues are open for (and full of) active discussions!