Tuesday, March 10, 2009

JIT Optimization and Inconsistent Results

FPU Discrepancies


Everyone who has spent time at the coal-face of numerical programming knows that floating point numbers are the devils plaything. Add up any list of floats one way then another way and you will get a different answer. Even when you think you are doing a calculation one way, it may come out another. Common causes include:

  1. Optimization. The most common difference is between debug and release builds, where the compiler has reordered operations for pipeline efficiency.

  2. Parallelism. When computing hetrogenous tasks, it is typical to keep a TODO list and give out jobs to threads as they become free. This inevitably means that jobs will be done in slightly different orders, and if the result is accumulated in any way, the answer will vary.

  3. Co-processors. Modern CPUs have exotic built in commands (i.e. dedicated silicon) for numeric processing, such as SSE and 3DNow!. It is possible for the compiler to use these pipelines for one formulation of an algorithm, but not for another, resulting in different answers. Typically these instructions are disabled in debug mode but not in release mode (see 1).


So which answer is right? All of them and none of them. At least no answer is any better than any other, and anyway the inputs were already approximations before you even started. The alternative to this single-point approach is interval arithmetic, and every time I have these problems I hear Bill Walster laughing at me.

We Are All Doomed!


Once you are comfortable that you understand the source and magnitude of the differences, you apply a suitable tolerance to any comparison of the answer and off you go.

You weren't trying to compare the floats directly were you? [whacks back of hand with ruler]

The Complier Strikes Back


I recently discovered a new mode where this problem can occur: JIT optimization. Coming from a C++ background I am used to the compiler doing more-or-less what I tell it to, and the compiled code staying the same from machine to machine. JIT compilation means that is no longer the case: the same IL will end up as different machine code on different machines and on the same machine under different circumstances (more on that later).

To fight back against non-determinism, we write unit tests that check the answers, and to do that we use exact comparison (I'm sorry I hit you on the hand with the ruler - comparison is fine for unit tests). The problem is that if the tolerance is too great, it might let in some of the other issues we discussed, principally parallel non-determinism. If a client of mine runs the same calculation, they expect the same answer. This problem can be magnified if the answers are used to inform optimization strategies that could take a design down significantly different paths. The Age of Empires team had similar issues when synchronizing simulations for multiple users.

The Problem


We have some simple trigonometric code that looks very straightforward:

double a4 = square(sin(dlon / 2));

The code was in C++/CLI (although this problem will equally manifest in C#). Function square() does what it says on the tin, and in all cases was inlined by the compiler.

When we fed in 0.0000189590943040496 we go two different answers depending on if the debugger was attached or detached:

0.0000000000898618142047701 (attached)
0.0000000000898618142047700 (detached)

Looking at the binary representation of these numbers we saw a single bit difference in the mantissa.

As an interesting aside, there were other instances that were different, but Double.ToString() converted to the same string (even with F30 as the conversion string for maximum representation). For instance, 40BFA439A6E6D034 and 40BFA439A6E6D033 (doubles in binary form) both come out as 8100.225202966530000000. Beware!

What could be different about having the IDE attached? Note that this was the same build with the same IL (this isn't some n00b debug vs. release issue).

The Joker In The (JIT) Pack (see what I did there?)


It turns out that when you load a CLR DLL, the JIT goes to work optimizing the code for performance, unless the IDE is attached (no inlining, no reordering, no MMX, etc...). This is so that the user can step through the code in a useful way, which even for non-optimized IL is a pretty amazing trick.

To get under the skin of the problem I needed to see the native code that was being executed in both scenarios. I added a breakpoint on the offending line and launched a debug session from the IDE. Ctrl + Alt + D brings up the annotated assembly code, which looked like this:

...
double a4 = square(sin(dlon/2));
00000443 fld qword ptr [ebp-20h]
00000449 fmul dword ptr ds:[08E239D0h]
0000044f fsin
00000451 fstp qword ptr [ebp+FFFFFF24h]
00000457 fld qword ptr [ebp+FFFFFF24h]
0000045d fstp qword ptr [ebp-40h]
...
00000482 fld qword ptr [ebp-40h]
00000485 fmul st,st(0)
...

[ebp-20h] contains dlon, and the answer ends up on in the FPU accumulator (ST0). The sequence of events is Load dlon, Multiply dlon by 0.5, Sin the result, Store the result, ..., Retrieve the result, Square the result. Critically the answer is stored in memory before being squared - this is relevant.

Now the optimized JIT. This was harder to get hold of and required changing a couple of IDE settings:



One setting stops the IDE supressing JIT optimization and the other stops the IDE complaining when it finds optimized code (it confuses it with 3rd party code).

This gave the following assembly code:


double a4 = square(sin(dlon/2));
00000183 fxch st(2)
...
00000188 fmul dword ptr ds:[086CB6C8h]
0000018e fsin
00000190 fxch st(1)
...
00000199 fmulp st(1),st

Critically this version of the calculation keeps the intermediate results on the FPU stack rather than dumping them into memory. Why is this significant? Because they are 80 bits wide, whereas the memory locations are 64 bits wide. The extended precision bits remain intact, and overflow into the LSB of the 64 bit mantissa causing our bit-flip.

Now What?


I am now comfortable that I understand the source of the problem, and that I know what an appropriate tolerance would be for comparison. It only has to catch LSB changes, so it can be very, very small. This will prevent other non-determinism problems sneaking past, that typically cause differences that are orders of magnitude greater.

Credits


I couldn't have found this without the timely assistance of Greg Young, who's posts (1 and 2) on the subject steered me to a resolution.

No comments: