Thursday, November 26, 2009

"Exploding Error Codes"

Life delays blog posts, but here we go at last:

We're experimenting with a novel error handling strategy that I call "Exploding Error Codes." The intended usage pattern looks like this: Any function that needs to indicate an error condition takes a pointer to an EEC. Requiring the EEC to be passed as an argument prevents it from being forgotten the way that a return value error indicator can easily be overlooked.
A called function is required to always set its EEC argument. A caller is required to always check the EEC after the call. This pattern keeps control-flow linear and local, thus avoiding the control-flow obfuscation you get from exceptions.
However, just because you pass an error container by value doesn't mean that the call-er sets it, or the call-ee checks it. So exploding error codes feature enforcement logic:
  • An EEC has three states: unset, set-but-unchecked, and set-and-checked.
  • A new EEC starts in the unset state.
  • On EEC destruction, it is an error if the EEC is in any state but set-and-checked. In C++, we stack-allocate EECs in callers to get this check as soon as possible.
  • An EEC supports these operations: set(code), get_code(), check(), unset()
    • set(code) sets the error value of the EEC. It is an error if the EEC is in any state but the unset state.
    • get_code(): It is an error if the EEC is not in one of the set states. This call has no side-effects. It returns the error value of the EEC.
    • check(): It is an error if the EEC is not in the set-but-unchecked state. The EEC is placed in the set-and-checked state, and the call returns the error value of the EEC.
    • unset(): It is an error if the EEC is not in the set-and-checked state. The EEC is placed in the unset state.
  • Whenever an error ocurrs, the EEC explodes.  And by "explodes", I mean it logs a stack trace and calls abort(), thus ending program execution.
The overall goal is straightforward: to cause a deterministic program abort any time that a callee forgets to set an error code, or a caller fails to check an error code --- whether or not there was an error condition. This determinism enables the developer to rapidly discover and correct places where we're not handling errors properly.
So given that that's the objective, let's look at how these things work out in several simple cases.

Correct caller, correct callee

Here's a caller that allocates, passes, and checks an EEC properly:
void do_stuff() {
  ExplodingErrorCode eec;
  set_up();
  first_call(&eec);
  if (eec.check() != (ExplodingErrorCode::OK)) {
     tear_down();
      return;
  }
}
And here's a callee that sets the EEC properly, too:
void first_call(ExplodingErrorCode *eec) {
  my_guts();
  eec->set(ExplodingErrorCode::OK);
}
The EEC is allocated in the unset state. After the call to first_call, it is in the set-but-unchecked state. After do_stuff calls check(), the EEC is in the set-and-checked state. Finally, at the end of do_stuff(), the EEC's destructor runs; since the EEC is set-and-checked, destruction proceeds without incident.

Caller forgets to check before destruction

Consider a do_stuff that looks like this:
void do_stuff() {
  ExplodingErrorCode eec;
  set_up();
  first_call(&eec);
  more_stuf();
  // EEC destructor will deterministically abort when we get here
}
The very first time we run this program, it will deterministically abort when the EEC destructor runs and discovers that the EEC has not been checked.

Caller forgets to check before re-use

Here's another common mistake:
void do_stuff() {
  ExplodingErrorCode eec;
  set_up();
  first_call(&eec);
  // forgot to check, went on to re-use instead
  second_call(&eec);  // EEC will abort when second_call tries to set 
  if (eec.check() != (ExplodingErrorCode::OK)) {
     cleanup_first_call();
     tear_down();
     return;
  }
}
Assume, for the moment, that second_call does try to set the EEC properly. In that case, the call to set will abort when it discovers that the EEC is already in the set-but-unchecked state.

Callee forgets to set

Let's go back to a correct do_stuff, but now let's say the callee forgets to set the error code:

void first_call(ExplodingErrorCode *eec) {
  my_guts();
  eec->set(ExplodingErrorCode::OK);
}

void do_stuff() {
  ExplodingErrorCode eec;
  set_up();
  first_call(&eec);
  if (eec.check() != (ExplodingErrorCode::OK)) {  // abort happens here
     tear_down();
     return;
  }
}
Now the call to check() will abort, because the EEC will still be in the unset state.

Holes in the safety net

As illustrated above, the ExplodingErrorCode strategy turns a lot of common error-handling programming errors into determinstic aborts. In particular, as long as one of caller and callee does the right thing, we will reliably detect the failure of their calling partner to behave. However, compound errors can still slip through the cracks. For instance, consider the case in which the caller re-uses the EEC without checking it; and in which the second callee forgets to set the EEC. In this case, when the second call returns, the EEC will still be set-and-unchecked from the first call. Assuming the caller then checks the error code, the double-error will go undetected.

Compounding verbosity

As I've presented it so far, it can be kind of verbose to use an error code from your parent to call a number of children. Here's a version of the last example from my previous post that illustrates a successful, but verbose and repetitive, daisy-chained use of an EEC:

// Parent wants to know about errors.
void do_stuff(ExplodingErrorCode *eec) {
  int flag;
  set_up();
  first_call(&eec);
  if (eec->get_code() != ExplodingErrorCode::OK) { goto first_failed;}
  eec->check();
  eec->reset();

  second_call(&eec);
  if (eec->get_code() != ExplodingErrorCode::OK) { goto second_failed;}
  eec->check();
  eec->reset();

  third_call(&eec);
  if (eec->get_code() != ExplodingErrorCode::OK) { goto third_failed;}
  eec->check();
  eec->reset();

  fourth_call(&eec);
  if (eec->get_code() != ExplodingErrorCode::OK) { goto fourth_failed;}
  // don't reset EEC here, it's the return version!

  tear_down();
  return 0;
fourth_failed:
  cleanup_third_call();
third_failed:
  cleanup_second_call();
second_failed:
  cleanup_first_call();
first_failed:
  tear_down();
}
We're still experimenting with idioms, helper methods, and shorthands to try to simplify these daisy-chained cases...

Edited 2010-11-10: Fixed the error pointed out by msanchez.