Monday, December 27, 2010

Emperors and airports

"Proper and effective security requires multiple layers of systems, procedures and policies that are interlaced and constantly monitored," the airport said. "The vast majority of the widespread layers of this security program are behind the scenes and transparent to casual observers."  -- SFO flunky quoted at the end of this CNN article.

Turns out, the emperor wasn't naked.  His clothes were just transparent to casual observers.

Monday, December 20, 2010

Why let users choose their passwords?

As the release notes for the Gawker debacle illustrate, users make up sucky passwords.  Maybe the real lesson is that users should not be allowed to produce their own passwords.  Instead, websites should auto-generate passwords for users --- passwords that have real complexity, rather than "password" or "123456".   

Sure, this would force most users to write their passwords down --- or keep them in something like KeePass --- but that pushes the point of failure to the user.  The overall risk is way lower than the risk that all 50 websites for which said user used an identical, identically-lame password will remain indefinitely uncompromised.

Thoughts?

Thursday, February 18, 2010

Notes on using boost

  1. Using the boostpro Windows installer is substantially slower than downloading and building boost from scratch.
  2. I like to set the compile-time flag BOOST_ALL_NO_LIB so that I can manually manage what's getting glued into my program.  The same manual link-list then serves as a list of what has to go into the installer package.
    // BOOST_ALL_NO_LIB: Tells the config system not to automatically select 
    // which libraries to link against.  
    // Normally if a compiler supports #pragma lib, then the correct library 
    // build variant will be automatically selected and linked against, 
    // simply by the act of including one of that library's headers.  
    // This macro turns that feature off.
    // #define BOOST_ALL_NO_LIB
  3. "ALL" doesn't mean what you think it means:
    // BOOST_ALL_DYN_LINK: Forces all libraries that have separate source, 
    // to be linked as dll's rather than static libraries on Microsoft Windows 
    // (this macro is used to turn on __declspec(dllimport) modifiers, so that 
    // the compiler knows which symbols to look for in a dll rather than in a 
    // static library).  Note that there may be some libraries that can only 
    // be statically linked (Boost.Test for example) and others which may only 
    // be dynamically linked (Boost.Threads for example), in these cases this 
    // macro has no effect.
    // #define BOOST_ALL_DYN_LINK

Thursday, November 26, 2009

"Exploding Error Codes"

Life delays blog posts, but here we go at last:

We're experimenting with a novel error handling strategy that I call "Exploding Error Codes." The intended usage pattern looks like this: Any function that needs to indicate an error condition takes a pointer to an EEC. Requiring the EEC to be passed as an argument prevents it from being forgotten the way that a return value error indicator can easily be overlooked.
A called function is required to always set its EEC argument. A caller is required to always check the EEC after the call. This pattern keeps control-flow linear and local, thus avoiding the control-flow obfuscation you get from exceptions.
However, just because you pass an error container by value doesn't mean that the call-er sets it, or the call-ee checks it. So exploding error codes feature enforcement logic:
  • An EEC has three states: unset, set-but-unchecked, and set-and-checked.
  • A new EEC starts in the unset state.
  • On EEC destruction, it is an error if the EEC is in any state but set-and-checked. In C++, we stack-allocate EECs in callers to get this check as soon as possible.
  • An EEC supports these operations: set(code), get_code(), check(), unset()
    • set(code) sets the error value of the EEC. It is an error if the EEC is in any state but the unset state.
    • get_code(): It is an error if the EEC is not in one of the set states. This call has no side-effects. It returns the error value of the EEC.
    • check(): It is an error if the EEC is not in the set-but-unchecked state. The EEC is placed in the set-and-checked state, and the call returns the error value of the EEC.
    • unset(): It is an error if the EEC is not in the set-and-checked state. The EEC is placed in the unset state.
  • Whenever an error ocurrs, the EEC explodes.  And by "explodes", I mean it logs a stack trace and calls abort(), thus ending program execution.
The overall goal is straightforward: to cause a deterministic program abort any time that a callee forgets to set an error code, or a caller fails to check an error code --- whether or not there was an error condition. This determinism enables the developer to rapidly discover and correct places where we're not handling errors properly.
So given that that's the objective, let's look at how these things work out in several simple cases.

Correct caller, correct callee

Here's a caller that allocates, passes, and checks an EEC properly:
void do_stuff() {
  ExplodingErrorCode eec;
  set_up();
  first_call(&eec);
  if (eec.check() != (ExplodingErrorCode::OK)) {
     tear_down();
      return;
  }
}
And here's a callee that sets the EEC properly, too:
void first_call(ExplodingErrorCode *eec) {
  my_guts();
  eec->set(ExplodingErrorCode::OK);
}
The EEC is allocated in the unset state. After the call to first_call, it is in the set-but-unchecked state. After do_stuff calls check(), the EEC is in the set-and-checked state. Finally, at the end of do_stuff(), the EEC's destructor runs; since the EEC is set-and-checked, destruction proceeds without incident.

Caller forgets to check before destruction

Consider a do_stuff that looks like this:
void do_stuff() {
  ExplodingErrorCode eec;
  set_up();
  first_call(&eec);
  more_stuf();
  // EEC destructor will deterministically abort when we get here
}
The very first time we run this program, it will deterministically abort when the EEC destructor runs and discovers that the EEC has not been checked.

Caller forgets to check before re-use

Here's another common mistake:
void do_stuff() {
  ExplodingErrorCode eec;
  set_up();
  first_call(&eec);
  // forgot to check, went on to re-use instead
  second_call(&eec);  // EEC will abort when second_call tries to set 
  if (eec.check() != (ExplodingErrorCode::OK)) {
     cleanup_first_call();
     tear_down();
     return;
  }
}
Assume, for the moment, that second_call does try to set the EEC properly. In that case, the call to set will abort when it discovers that the EEC is already in the set-but-unchecked state.

Callee forgets to set

Let's go back to a correct do_stuff, but now let's say the callee forgets to set the error code:

void first_call(ExplodingErrorCode *eec) {
  my_guts();
  eec->set(ExplodingErrorCode::OK);
}

void do_stuff() {
  ExplodingErrorCode eec;
  set_up();
  first_call(&eec);
  if (eec.check() != (ExplodingErrorCode::OK)) {  // abort happens here
     tear_down();
     return;
  }
}
Now the call to check() will abort, because the EEC will still be in the unset state.

Holes in the safety net

As illustrated above, the ExplodingErrorCode strategy turns a lot of common error-handling programming errors into determinstic aborts. In particular, as long as one of caller and callee does the right thing, we will reliably detect the failure of their calling partner to behave. However, compound errors can still slip through the cracks. For instance, consider the case in which the caller re-uses the EEC without checking it; and in which the second callee forgets to set the EEC. In this case, when the second call returns, the EEC will still be set-and-unchecked from the first call. Assuming the caller then checks the error code, the double-error will go undetected.

Compounding verbosity

As I've presented it so far, it can be kind of verbose to use an error code from your parent to call a number of children. Here's a version of the last example from my previous post that illustrates a successful, but verbose and repetitive, daisy-chained use of an EEC:

// Parent wants to know about errors.
void do_stuff(ExplodingErrorCode *eec) {
  int flag;
  set_up();
  first_call(&eec);
  if (eec->get_code() != ExplodingErrorCode::OK) { goto first_failed;}
  eec->check();
  eec->reset();

  second_call(&eec);
  if (eec->get_code() != ExplodingErrorCode::OK) { goto second_failed;}
  eec->check();
  eec->reset();

  third_call(&eec);
  if (eec->get_code() != ExplodingErrorCode::OK) { goto third_failed;}
  eec->check();
  eec->reset();

  fourth_call(&eec);
  if (eec->get_code() != ExplodingErrorCode::OK) { goto fourth_failed;}
  // don't reset EEC here, it's the return version!

  tear_down();
  return 0;
fourth_failed:
  cleanup_third_call();
third_failed:
  cleanup_second_call();
second_failed:
  cleanup_first_call();
first_failed:
  tear_down();
}
We're still experimenting with idioms, helper methods, and shorthands to try to simplify these daisy-chained cases...

Edited 2010-11-10: Fixed the error pointed out by msanchez.

Thursday, October 15, 2009

Error handling is hard. Let's use goto!

Error conditions in subroutines are hard to handle properly. Both
of the common strategies, checking return values and exceptions, are
prone to leaving cases uncovered that aren't discovered until runtime
And leaving a function in the middle of its operation is always hard
--- you have to be careful to leave the program's world in a
consistent state.

Checking return values is subject to, well, failure to check the
return value:

Check return value; local cleanup

// returns non-zero value on error
int do_stuff() {
  int flag;
  set_up();
  flag = first_call();
  if (flag != 0) {
     tear_down();
     return 1;
  }
  flag = second_call();
  if (flag != 0) {
     cleanup_first_call();
     tear_down();
     return 1;
  }
  flag = third_call();
  // oops, didn't check the flag, or cleanup first and second calls
  flag = fourth_call();
  if (flag != 0) {
     cleanup_third_call();
     // oops, didn't clean up after second_call.  Maybe it was added later
     // and the developer missed this cleanup clause.
     cleanup_first_call();
     tear_down();
     return 1;
  }
  tear_down();
  return 0;
}

Exceptions seem like a nice fix. They make sure that you're not
going to get through to fourth_call if third_call blows up. Here's
the first naive attempt:

Pass-through exceptions; no cleanup

// raises an exception on error, but has problems
void do_stuff() {
  set_up();
  first_call();
  second_call();
  third_call();
  fourth_call();
  tear_down();
}

That's pretty, isn't it? We got rid of all that horrible error
handling code that was confusing the core flow, etc. and so forth.
But actually, we're now deep in the thickets.

Here's the first problem: if any of first_call, etc. raise an
exception, tear_down() isn't going to get called. We can use
try/catch to guard against this problem:

Exceptions rethrown by tear-down guard

// make sure to call tear_down on exception
void do_stuff() {
  set_up();
  try {
    first_call();
    second_call();
    third_call();
    fourth_call();
  } catch (...) {
    tear_down();
    throw;
  }
  tear_down();
}

Since tear_down is supposed to get run in all exit paths , in C++
we can use a stack-allocated helper class, implementing the RAII
(Resource Acquisition Is Initialization) pattern, to make this a
little less error prone:

Pass-through exceptions; RAII guard

class SetUpHelper {
public:
  SetUpHelper() {set_up();}
  ~SetUpHelper() {tear_down();}
}
void do_stuff() {
  // automatically call tear-down when the destructor runs
  SetUpHelper raii_setup;
  first_call();
  second_call();
  third_call();
  fourth_call();
}

That's better, but there's still something wrong here: nothing is
ever calling our cleanup functions, cleanup_first_call(), etc. If
there's an exceptional exit, we're presently going to leave the
program's internals in some inconsistent state. Let's fix that up
too, try/catch style:

Re-thrown exceptions with local cleanup

void do_stuff() {
  SetUpHelper raii_setup;
  first_call();
  try {
    second_call();
  } catch (...) {
    cleanup_first_call();
    throw;
  }
  try {
    third_call();
  } catch (...) {
    cleanup_second_call();
    cleanup_first_call();
    throw;
  }
  try {
    fourth_call();
  } catch (...) {
    cleanup_third_call();
    // oops, forgot to clean up second_call. Maybe it was added later
    // and the developer missed this handler
    cleanup_first_call();
    throw;
  }
}

That's not substantially better than our original flag-based
handling --- the error logic is all inline with the main path, there
are just as many opportunities to forget to try/catch or call the
right cleanup sequence, there's code duplication, etc.

We can clean up a little of the code duplication, and reduce the
opportunity for error, by nesting our handler logic:

Re-thrown exceptions with nested cleanup

void do_stuff() {
  SetUpHelper raii_setup;
  first_call();
  try {
    second_call();
    try {
      third_call();
      try {
        fourth_call();
      } catch (...) {
        cleanup_third_call();
        throw:
      }
    } catch (...) {
      cleanup_second_call();
      throw;
    }
  } catch (...) {    
    cleanup_first_call();
    throw;
  }
}

Is this really an improvement? Our main-line logic isn't
intrinsically nested, but here we've imposed a nesting structure just
to handle errors. It adds considerable visual and structural
complexity. Furthermore, the nesting interleaves main-line logic and
clean-up logic nearly as badly as our original flag logic did.
Ugly.

With all due respect to the "...Considered Harmful" philosophy,
C and C++ give us a better option than we've seen so far: goto.

Check return value; goto-based cleanup

// returns non-zero value on error
int do_stuff() {
  int flag;
  set_up();
  flag = first_call();
  if (flag != 0) { goto first_failed;}
  flag = second_call();
  if (flag != 0) { goto second_failed;}
  flag = third_call();
  if (flag != 0) { goto third_failed;}
  flag = fourth_call();
  if (flag != 0) { goto fourth_failed;}
  tear_down();
  return 0;
fourth_failed:
  cleanup_third_call();
third_failed:
  cleanup_second_call();
second_failed:
  cleanup_first_call();
first_failed:
  tear_down();
  return 1;
}

Loook at that! Our main-line logic is linear. It's not cluttered
with clean-up logic, but the correlation between a main-line exit and
a clean-up entry is clearly named. There's a regularity to the logic
which is easy to inspect for missing flag checks. There's no code
duplication calling the cleanup functions. This code
is readable.

We do have one bit of duplicate code, the two calls to tear_down().
The duplication here is a lot less verbose, and a lot more obvious to
someone reading the code, than having an extra RAII class.

To be sure, there is still room for error. You could forget to
check a flag. If you were to swap the order of second_call and
third_call, you'd need to do so in two places to make cleanup behave
properly sane; of course, this is actually just as true in the nested
exceptions example.

Finally, there's one dangling bit of ugliness we haven't addressed
anywhere yet. What if, say, second_cleanup can fail too? If it's
returning a flag, we're ignoring it in every example above. Oops!

If it's throwing an exception... well, in our exception-based
examples, it's doing so from the middle of another exception's catch
clause. That's pretty ugly too!

In my next post, I'll describe a novel error handling strategy that
we're trying out to address some of the problems inherent to the
schemes above. (By the way, it's not dependent on goto, so those of
you holding air sickness bags can put them down.)

Saturday, September 26, 2009

The Unit Test shoe does not fit all feet

In the last year, my company has written software of four very different types.

One type is intended to be used internally, rarely, by gurus. It's fine for it to require human intervention, to break, to require editing of Python source files to change its behavior. This stuff is developed on our own time (and dime) to solve short-term problems and to prototype concepts.

The second type is demoware. It has to work along exactly one path while we take screenshots or record a video. This is developed on a small budget (internal or external) and a tight schedule. But the target audience isn't expecting anything more than a demo reel and a powerpoint presentation.

A third kind is a customer-deliverable application. This has to work pretty reliably; the customer will use it frequently, time is valuable, all of that. This development is fully funded.

And then there's the fourth kind. The fourth kind is life-safety critical software. We develop software for mobile robots, some of which are big and scary. This development is fully funded too.

Anyone who says that we shouldn't use unit testing for the fourth type of software is an idiot.

Anyone who says that we should use unit testing for the first type is an idiot too.

Anyone who says that we should use unit testing for the second type of software is oblivious to the constraints of time and budget --- the constraints that make sure that, in a business, everyone gets paid. This testhead is not quite an idiot, but is limited by perfectionist blinders that will hurt project, team, and company.

The third type of software -- 'pretty reliable' end-user software -- is the only place where the unit-test-or-not discussion is worth having.