November 22, 2010

After the Build Breaks

Originally published 28 Aug 2006

The second part of the build/SCM blog series (see part 1) deals with what happens when a build breaks. This happens, hopefully infrequently, but it happens. Getting the build back on track should be one of the highest priorities - your build box teammate is down and you need to get him back on his feet.

Ideally, what should happen is the following:
  1. Notification of a build failure is distributed by e-mail to the team. This is part of the publishing functionality of CI systems like CruiseControl.
  2. Particularly, whoever checked in since the last successful build analyzes the results.
  3. The person who broke the build mans up and replies to all, “This is me. I’m on it.”
  4. Nobody checks in until the build is fixed unless the check in is for the purpose of fixing the build.
  5. Somebody fixes the build, obtaining any help necessary.
  6. (Optional) Once the build is believed to be fixed, the person from step 3 sends an e-mail saying the build should be fixed.
Let’s talk more about step 3. I know it could be embarrassing to break the build. The step is not to call anyone out or lay blame. The point is to communicate to the team, who may be large and distributed, that somebody is taking responsibility for fixing the build. Several others may be analyzing and trying to fix the problem and this communication is to reduce that wasted effort. I have been on teams where the build has been broken for hours and nobody knows if anybody is doing anything about it.

Sometimes nobody is doing anything about it. If this is a recurring problem, I recommend the use of a Build Nazi (in honor of the Seinfeld Soup Nazi). The job of the Build Nazi is to make sure somebody is responsible for fixing a broken build. The job is not to fix the build (unless help is needed). The Build Nazi job is not a fun one and therefore should be rotated. It is also a controversial role in an agile environment where the practice of collective code ownership is being followed. Ideally, the team is following the steps above and a Build Nazi is totally unnecessary. I’ve found I’ve had to resort to the role for short term periods in times of chaos when the build is breaking more than it is succeeding. Once things get back on track, the Build Nazi role typically becomes dormant.

The last step I’d like to elaborate on is step 4. One of the worst things to do is check in a bunch of changes that don’t have anything to do with fixing the build. Checking in causes the CI system to start another build that will result in another failure. Piling on check ins during a broken build prolongs the broken state of the build and therefore the feedback cycle that is all important in agile development. Another risk is that the fix necessary to correct the original problem must be applied to the check ins that occurred after the first failure, further prolonging the broken state.

I plan on finishing this series next time. Stay tuned.

No comments:

Post a Comment