Thursday Night

Paul Betts’s personal website / blog / what-have-you

The case of the disappearing OnLoad exception – user-mode callback exceptions in x64

For the impatient, you can skip to the end of the article to see what you should do about disappearing exceptions in desktop applications

The problem – why doesn’t this crash?

If you’ve got a 64-bit OS on your machine, try the following: open up Visual Studio 2010 and create a new WinForms project. Add an OnLoad event handler and paste in the following code:

namespace WindowsFormsApplication1
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void OnLoad(object sender, EventArgs e)
        {
            throw new Exception("Hey, where’d I go?");
        }
    }

    static class Program
    {
        /// <summary>
        /// The main entry point for the application.
        /// </summary>
        [STAThread]
        static void Main()
        {
            Application.EnableVisualStyles();
            Application.SetCompatibleTextRenderingDefault(false);
            Application.Run(new Form1());

            MessageBox.Show("We should never get here");
        }
    }
}

Run the app – it succeeds! The exception we threw just disappeared into the ether, and the app just went on its merry way. A lot of people think this is a CLR bug or a WinForms bug, but it actually happens everywhere – you can trigger this using WinForms, WPF, and also in MFC / straight Win32. This is a complicated design decision in ntdll / NT kernel that I worked on for Vista and Windows 7, and it’s important to understand if you’re writing desktop applications on Windows. But before I can explain how it all works, we’ll learn some about exceptions in NT.

What is an SEH exception?

Different programming languages all implement their own notion of software exceptions differently, with different semantics (some like C# offer keywords like ‘finally’, for example, and others like VB.NET allow you to run a filter function on thrown exceptions to decide whether to catch them). However, Windows has its own notion of exceptions, called Structured Exception Handling. Almost every language-specific exception framework including C++ and .NET exceptions are really SEH exceptions on Windows.

This concept is built into the NT kernel – remember, that not all exceptions are traps (i.e, invoked by a ‘throw’ statement); exceptions are also generated by the hardware. When you run that *((void*)0x0) = 0x42; statement, the CPU attempts to walk the page tables looking for that memory mapping; when it fails, the CPU raises a hardware page fault exception, and the OS has a chance to take over. The OS will realize that this address is bogus, and it propagates that exception back to the application as an SEH exception with the code STATUS_ACCESS_VIOLATION.

The moral I’m trying to express is, that all of these program crashes like a Null pointer or a divide-by-zero are really all SEH exceptions. You could catch that exception and in fact, that’s exactly how the CLR’s NullReferenceException works – it’s really a reworded access violation, verified to point to NULL. When an exception is thrown, it walks up the stack looking for a handler – when it finally runs out of stack entries to traverse, UnhandledExceptionFilter is invoked, and normally that sends a signal to WER to come clean up the mess.

Here’s the reason that you probably shouldn’t implement exceptions yourself if you’re a language implementer – SEH exceptions are the only ones that understand how to traverse the user-mode / kernel-mode boundary (i.e. think about what happens when you throw an exception inside a user-mode APC for example – the exception has to go through kernel-mode to get back to your WinMain).

How come this doesn’t happen all the time?

Here’s why this seems to happen only on certain window messages – remember that window messages can originate from different sources, anyone(*) can queue a message to a window. However, certain window messages are directly sent via win32k.sys (the most notable one being WM_CREATE) as a direct synchronous result of a user-mode call.

Your app calls CreateWindow(); this results in a system call to NtCreateWindow. This syscall ends up going down to win32k.sys, who does the work of actually creating the window. None of the work is deferred, it’s all done in the context of the running user thread; as part of the work to create and show the window, win32k must call the app’s WndProc function. We’ve went from user mode to kernel mode, and now we have to come back to user mode, all in the same thread stack. Here’s what it looks like:

00 KERNELBASE!RaiseException+0×39
01 clr!RaiseTheExceptionInternalOnly+0×363
02 clr!IL_Throw+0×146
03 WindowsFormsApplication1!WindowsFormsApplication1.Form1.OnLoad+0×70
04 System_Windows_Forms_ni!System.Windows.Forms.Form.OnLoad+0x1a9
05 System_Windows_Forms_ni!System.Windows.Forms.Control.CreateControl+0x1c4
06 System_Windows_Forms_ni!System.Windows.Forms.Control.CreateControl+0×24
07 System_Windows_Forms_ni!System.Windows.Forms.Control.WmShowWindow+0xd8
08 System_Windows_Forms_ni!System.Windows.Forms.Control.WndProc+0x3dd
09 System_Windows_Forms_ni!System.Windows.Forms.Form.WndProc+0×243
0a System_Windows_Forms_ni!System.Windows.Forms.NativeWindow.Callback+0x16c
0b System_Windows_Forms_ni!DomainBoundILStubClass.IL_STUB_ReversePInvoke+0×50
0c clr!UMThunkStubAMD64+0×77
0d USER32!UserCallWinProcCheckWow+0x1ad
0e USER32!DispatchClientMessage+0xc3
0f USER32!__fnDWORD+0x2d
10 ntdll!KiUserCallbackDispatcherContinue
11 USER32!ZwUserShowWindow+0xa
12 clr!DoNDirectCall__PatchGetThreadCall+0x7b
13 System_Windows_Forms_ni!DomainBoundILStubClass.IL_STUB_PInvoke+0×42
14 System_Windows_Forms_ni!System.Windows.Forms.Control.SetVisibleCore+0×179
15 System_Windows_Forms_ni!System.Windows.Forms.Form.SetVisibleCore+0x25d
16 System_Windows_Forms_ni!System.Windows.Forms.Application+ThreadContext.RunMessageLoopInner+0x1dc
17 System_Windows_Forms_ni!System.Windows.Forms.Application+ThreadContext.RunMessageLoop+0×81
18 WindowsFormsApplication1!WindowsFormsApplication1.Program.Main()+0×57

Frame 10 and 11 are the ones I’m talking about – we disappeared into the syscall, then reappear in userspace via KiUserCallbackDispatcherContinue, then we showed up into OnLoad in frame 3. SEH will now walk the stack back up but will hit a brick wall at KiUserCallbackDispatcherContinue. For complicated reasons, we cannot propagate the exception back on 64-bit operating systems (amd64 and IA64). This has been the case ever since the first 64-bit release of Server 2003. On x86, this isn’t the case – the exception gets propagated through the kernel boundary and would end up walking the frames back until it ended up at WindowsFormsApplication1.Program.Main().

When this happens, there are only two things we can sanely do: either make the exception disappear, or kill the application: rethrow the exception as noncontinuable, similar to the CLR’s StackOverflowException. Your catch blocks and finallys still run, but any attempts to catch the exception are ignored. The kernel architects at the time decided to take the conservative AppCompat-friendly approach – hide the exception, and hope for the best.

Why would I want to crash my own application?

In a small application, you can sometimes get away with eating the exception – if you can, great! However, most of the time you’re not so lucky; since SEH was in the middle of unwinding the stack and processing the exception then suddenly jumped to another location, the application is usually in a very inconsistent weird state when the call comes back, as if you just suddenly longjmp()’ed to another place. The end result for large apps is chaos – a simple AV turns into an difficult to debug application state corruption – if you’re a developer, you’ll pull your hair out wondering how your structures ended up in such a bizarre state. This also makes it very difficult to use the crash reports from WER or write your own error reporting code, since you’ll never see the true crash, only a later random one that resulted from the corrupted state.

The situation on Server 2003 and Vista

On Server ’03, XP64 (a rebranded Server ’03) and Vista, we kept the swallow exception behavior on both native x64 applications and WOW64 applications (32-bit programs running on a 64-bit OS), in an attempt to keep programs working. However, this was never an ideal solution – we needed a better way to cater to both modern applications by crashing and giving good bug reports, as well as to legacy applications that just happened to work correctly.

Windows 7 fixes this…kind of

The solution? In Windows 7, when a native x64 application crashes in this fashion, the Program Compatibility Assistant is notified. If the application doesn’t have a Windows 7 Manifest, we show a dialog telling you that PCA has applied an Application Compatibility shim. What does this mean? This means, that the next time you run your application, Windows will emulate the Server 2003 behavior and make the exception disappear. Keep in mind, that PCA doesn’t exist on Server 2008 R2, so this advice doesn’t apply.

What does this mean to you as a developer? This means, if you’re writing a new application, you always want to have a Win7-compatible manifest. Using a Win7 manifest tells Windows not to treat you like an older application and always use the latest OS features.

Completing the story with KB976038

I called out in the Win7 section that this applies to native x64 applications – well, what about 32-bit applications (WOW64 processes)? Unfortunately in Win7 RTM, WOW64 applications still have the Server ’03 behavior: exceptions are always swallowed. However, if you install this hotfix from Microsoft, WOW64 will now act just like x64 on Windows 7 – Win7-manifested x86 applications will crash, just like their x64 counterparts.

This fix also gets you more ways to control when we swallow or rethrow user-mode callback exceptions: first, for debugging purposes only, this fix adds a way to control the behavior via Image File Execution Options. You can enable/disable this option system-wide, or per-application (via the EXE name). I want to mention again though, that this option is for developer machines. If you set this key in an installer, you are Doing it Wrong and will make me sad.

Another way that you can enable/disable exception swallowing is via a new public API in Kernel32.dll – since this won’t be available in the SDK headers until Win7 SP1, you’ll have to dynamically invoke the API call via LoadLibrary and GetProcAddress. Here’s the definitions of these functions:

//
// If this flag is set, the exception will be *swallowed* (i.e. the Server ’03
// behavior)
//

#define PROCESS_CALLBACK_FILTER_ENABLED     0×1

BOOL
WINAPI
SetProcessUserModeExceptionPolicy(
    __in DWORD dwFlags
    );

BOOL
WINAPI
GetProcessUserModeExceptionPolicy(
    __out LPDWORD lpFlags
    );

So, the best future-proof way to call this function is:

DWORD dwFlags;
if (GetProcessUserModeExceptionPolicy(&dwFlags)) {
    SetProcessUserModeExceptionPolicy(dwFlags & ~PROCESS_CALLBACK_FILTER_ENABLED); // turn off bit 1
}

Hey, there’s a Vista package too

On Vista, the story’s a bit different: since PCA doesn’t have the proper support on Vista, the hotfix will add this behavior only for applications (either native or WOW64) who specifically ask for it via the public API or the IFEO key. Since this is a hotfix for an older, stable operating system, the hotfix tries to be more conservative so existing apps aren’t broken.

In Summary – ways to control user-mode exceptions

Here’s the tl;dr; version of this article:

  • If you’re writing desktop applications, install this hotfix from Microsoft on all your development machines until Win7 SP1 comes out.
  • Mark all your new applications as Win7-compatible.
  • If the manifest doesn’t work for you for some reason, or you’re shipping for Vista, try to call the public API via GetProcAddress.

Written by Paul Betts

July 20th, 2010 at 11:42 pm