Thursday, November 26, 2009

"Exploding Error Codes"

Life delays blog posts, but here we go at last:

We're experimenting with a novel error handling strategy that I call "Exploding Error Codes." The intended usage pattern looks like this: Any function that needs to indicate an error condition takes a pointer to an EEC. Requiring the EEC to be passed as an argument prevents it from being forgotten the way that a return value error indicator can easily be overlooked.
A called function is required to always set its EEC argument. A caller is required to always check the EEC after the call. This pattern keeps control-flow linear and local, thus avoiding the control-flow obfuscation you get from exceptions.
However, just because you pass an error container by value doesn't mean that the call-er sets it, or the call-ee checks it. So exploding error codes feature enforcement logic:
  • An EEC has three states: unset, set-but-unchecked, and set-and-checked.
  • A new EEC starts in the unset state.
  • On EEC destruction, it is an error if the EEC is in any state but set-and-checked. In C++, we stack-allocate EECs in callers to get this check as soon as possible.
  • An EEC supports these operations: set(code), get_code(), check(), unset()
    • set(code) sets the error value of the EEC. It is an error if the EEC is in any state but the unset state.
    • get_code(): It is an error if the EEC is not in one of the set states. This call has no side-effects. It returns the error value of the EEC.
    • check(): It is an error if the EEC is not in the set-but-unchecked state. The EEC is placed in the set-and-checked state, and the call returns the error value of the EEC.
    • unset(): It is an error if the EEC is not in the set-and-checked state. The EEC is placed in the unset state.
  • Whenever an error ocurrs, the EEC explodes.  And by "explodes", I mean it logs a stack trace and calls abort(), thus ending program execution.
The overall goal is straightforward: to cause a deterministic program abort any time that a callee forgets to set an error code, or a caller fails to check an error code --- whether or not there was an error condition. This determinism enables the developer to rapidly discover and correct places where we're not handling errors properly.
So given that that's the objective, let's look at how these things work out in several simple cases.

Correct caller, correct callee

Here's a caller that allocates, passes, and checks an EEC properly:
void do_stuff() {
  ExplodingErrorCode eec;
  set_up();
  first_call(&eec);
  if (eec.check() != (ExplodingErrorCode::OK)) {
     tear_down();
      return;
  }
}
And here's a callee that sets the EEC properly, too:
void first_call(ExplodingErrorCode *eec) {
  my_guts();
  eec->set(ExplodingErrorCode::OK);
}
The EEC is allocated in the unset state. After the call to first_call, it is in the set-but-unchecked state. After do_stuff calls check(), the EEC is in the set-and-checked state. Finally, at the end of do_stuff(), the EEC's destructor runs; since the EEC is set-and-checked, destruction proceeds without incident.

Caller forgets to check before destruction

Consider a do_stuff that looks like this:
void do_stuff() {
  ExplodingErrorCode eec;
  set_up();
  first_call(&eec);
  more_stuf();
  // EEC destructor will deterministically abort when we get here
}
The very first time we run this program, it will deterministically abort when the EEC destructor runs and discovers that the EEC has not been checked.

Caller forgets to check before re-use

Here's another common mistake:
void do_stuff() {
  ExplodingErrorCode eec;
  set_up();
  first_call(&eec);
  // forgot to check, went on to re-use instead
  second_call(&eec);  // EEC will abort when second_call tries to set 
  if (eec.check() != (ExplodingErrorCode::OK)) {
     cleanup_first_call();
     tear_down();
     return;
  }
}
Assume, for the moment, that second_call does try to set the EEC properly. In that case, the call to set will abort when it discovers that the EEC is already in the set-but-unchecked state.

Callee forgets to set

Let's go back to a correct do_stuff, but now let's say the callee forgets to set the error code:

void first_call(ExplodingErrorCode *eec) {
  my_guts();
  eec->set(ExplodingErrorCode::OK);
}

void do_stuff() {
  ExplodingErrorCode eec;
  set_up();
  first_call(&eec);
  if (eec.check() != (ExplodingErrorCode::OK)) {  // abort happens here
     tear_down();
     return;
  }
}
Now the call to check() will abort, because the EEC will still be in the unset state.

Holes in the safety net

As illustrated above, the ExplodingErrorCode strategy turns a lot of common error-handling programming errors into determinstic aborts. In particular, as long as one of caller and callee does the right thing, we will reliably detect the failure of their calling partner to behave. However, compound errors can still slip through the cracks. For instance, consider the case in which the caller re-uses the EEC without checking it; and in which the second callee forgets to set the EEC. In this case, when the second call returns, the EEC will still be set-and-unchecked from the first call. Assuming the caller then checks the error code, the double-error will go undetected.

Compounding verbosity

As I've presented it so far, it can be kind of verbose to use an error code from your parent to call a number of children. Here's a version of the last example from my previous post that illustrates a successful, but verbose and repetitive, daisy-chained use of an EEC:

// Parent wants to know about errors.
void do_stuff(ExplodingErrorCode *eec) {
  int flag;
  set_up();
  first_call(&eec);
  if (eec->get_code() != ExplodingErrorCode::OK) { goto first_failed;}
  eec->check();
  eec->reset();

  second_call(&eec);
  if (eec->get_code() != ExplodingErrorCode::OK) { goto second_failed;}
  eec->check();
  eec->reset();

  third_call(&eec);
  if (eec->get_code() != ExplodingErrorCode::OK) { goto third_failed;}
  eec->check();
  eec->reset();

  fourth_call(&eec);
  if (eec->get_code() != ExplodingErrorCode::OK) { goto fourth_failed;}
  // don't reset EEC here, it's the return version!

  tear_down();
  return 0;
fourth_failed:
  cleanup_third_call();
third_failed:
  cleanup_second_call();
second_failed:
  cleanup_first_call();
first_failed:
  tear_down();
}
We're still experimenting with idioms, helper methods, and shorthands to try to simplify these daisy-chained cases...

Edited 2010-11-10: Fixed the error pointed out by msanchez.

14 comments:

  1. Basically, this is a runtime version of checked exceptions, as in Java. But Java programmers who don't want to do error checking just stick in boilerplate that catches exceptions, but doesn't do anything about them. I don't think this approach would work any better.

    ReplyDelete
  2. It is very similar to checked exceptions. The key distinction is that the author is forced to handle each EEC inline, whereas try/catch can be placed at arbitrary points. As you point out, the handling can still be minimal, but the control flow remains predictable, and the handling is textually proximate to the associated call/method.

    ReplyDelete
  3. In the last example, you have 4 distinct labels, but all goto statements jump to the fourth_failed label.

    ReplyDelete
  4. You're entirely correct. I've fixed it --- thanks!

    ReplyDelete
  5. With all due respect, do you think that using goto's in the example is a good idea? I've always found them to be 'evil' and never a good patter to write clean code. Doesn't look like it's helping your case IMHO.

    ReplyDelete
  6. Have you considered returning these EEC objects instead of passing them by pointer? It would spare you the dereferencing and exposes additional useful functionality:
    1. Functions can simply return enumerated values that can be default constructed in to your EEC type.
    2. The 'checked' condition maybe reset after every invocation/assignment automatically.
    3. Existing functions that don't conform to your framework may be wrapped with helper functions that understand the functions unique return code.
    4. Functions whose returned EEC object is not capture in a variable would immediately trigger your "unhandled" logic.

    ReplyDelete
  7. I played with returning objects, but the internal logic got a bit too "clever" -- since an un-handled code could get handled locally or returned, you had to essentially do a reference-counting implementation to notice when the last copy was demolished without the code ever having been checked.

    We more or less use the Google C++ style guidelines anyhow, so we're not afraid of pointers and dereferencing anyhow -- the simplification was definitely worth it.

    ReplyDelete
  8. Concerning returning an EEC you could avoid having to implement reference counting if the object being returned to is initialized by a move constructor. That way you could get a clean destruction of the internal EEC and simply move all its info out into the returned object for processing.

    ReplyDelete
  9. My "too clever" detector starts firing pretty fast on this, but it's a valid experiment. Let me know how it works out for you ;-)

    ReplyDelete
  10. So every function call is six lines of code, right?

    ReplyDelete
  11. So is the exception handling code:

    try {
    first_call();
    } catch e {
    tear_down();
    return;
    }

    ReplyDelete
  12. Jeremy: nobody using exceptions writes code like that, if they know what they're doing. If something needs tearing down, they'll use a deterministic disposal idiom. For example, in C#:

    using (Lock(foo))

    using (GetBar(ref bar))

    using (GetBaz(ref baz))

    {

    foo.blah(bar);

    baz();

    }

    ReplyDelete
  13. The point of good exception handling is to never write the word 'catch' if you can help it. Any teardown you need belongs in a 'finally', ideally one single 'finally' at the end of the whole routine, and ideally using some convention, like a stack of closures (function pointers), that builds up the teardown as incremental progress is made, if the teardown needs to be that complicated. What you definitely do not want is writing try/catch/finally around every single call that can fail. That way madness lies.

    ReplyDelete