Effective Debugging

There are different kinds of bugs.

There are bugs for which the cause is hard to find, but once you find it, the problem is easy to solve. The cause of this kind of bug is most often a simple mistake in the code. The fix often consists of just changing a few lines of code.

There are bugs for which it’s easy to find out why they are occurring, but that are hard to solve. These are most often bugs that are caused by an error in the design or architecture of the software. Fixing them means that the idea behind the code needs to be changed, which means that (part of) the code needs to be rewritten.

A few months ago I solved a nasty bug of the first kind for a customer. It was a bug that had to do with SSL connections to a system of a business partner being dropped randomly. This had been playing for months already, and the relationship between my customer and their business partner was suffering from it.

Steps for effective debugging

Many developers don’t like solving bugs. I think this is often because they hate being held responsible for problem. They want to get rid of the problem as quickly as possible so they make a guess about what the cause is, make a fix based on the guess and then hope that it makes the problem go away.

Guessing is, however, not an effective way to solve bugs. You might get lucky and guess correctly, but often your guess is incorrect and the bug reappears sooner or later. And if your fix is based on a guess you never know for sure that you really fixed the bug.

It is much more effective to take a systematic approach, being willing to go deep and taking more time if necessary, until you find proof that your hypothesis about the cause of the bug is correct. It’s also much more fun to solve bugs this way, because it’s much more satisfying to know (rather than hope) that you’ve really fixed a difficult problem.

The steps to solving a bug are:

  1. Investigate the bug. Make sure you understand how the program is supposed to behave, how it behaves in reality and what the difference is exactly. Find out how you can reproduce the problem.
  2. Form a hypothesis about what the cause for the bug might be.
  3. Find proof for your hypothesis. Do experiments, maybe run the code in a debugger, add extra logging, read and understand the source code that’s involved, including source code of libraries or the JDK itself if necessary. Don’t stop until you understand exactly what causes the bug. Go back to step 2 if you find that your hypothesis is incorrect.
  4. Fix the problem based on the knowledge you found in step 3.

Step 3 is is the most important step, but this is also where developers often take shortcuts, or which they even skip completely.

An example

Back to the SSL bug that I solved. The business partner was using an old third-party software, and my customer had custom software that communicated with the business partner’s system over SSL.

There had been attempts to solve the bug that were done based on no more than blind guesses. For example, trying to put all the data in a buffer, and then issuing a single write() call on the OutputStream of the SSLSocket, in the hope that it would all be sent in a single SSL packet. For a while this seemed to work, but the bug intermittently reappeared soon after this “fix”.

After some insistence from my customer, the business partner called in the vendor of their software, who finally discovered exactly why their software was sometimes closing the connection. The messages that we were sending were prefixed with a 4-byte length field. They noticed that our messages were broken up into two SSL records, the first record containing only the very first byte of the message and the second record containing the rest. Their software couldn’t handle that, it expected the first SSL record to contain at least the 4-byte length field. (This is actually a bug in their software, but they couldn’t fix it very easily).

The mystery now was why our software was sending the first byte in a separate SSL record. To find out why this happened, I ran the code in a debugger and stepped into the JDK source code. After some digging I came across class sun.security.ssl.SSLSocketImpl.

The getOutputStream() method of this class returns a sun.security.ssl.AppOutputStream.

Class AppOutputStream implements the write() method. After carefully looking at this I saw something that looked suspicious. There are some lines to decide how many bytes of data to put in an SSL record.

synchronized public void write(byte b[], int off, int len) throws IOException {
    // ...

    int howmuch;
    if (isFirstRecordOfThePayload && c.needToSplitPayload()) {
        howmuch = (len == 0) ? 0 : Math.min(0x01, r.availableDataBytes());
        // ...
    }

    // ...
}

Under some circumstances (if isFirstRecordOfThePayload and c.needToSplitPayload() are true), it decides to put at most 1 byte in the SSL record. This is exactly what the business partner was seeing and what was causing the problem.

It looks like this is done deliberately, and not a bug in the JDK. But why – what is in the method needToSplitPayload? This is a method in class SSLSocketImpl:

/*
 * Need to split the payload except the following cases:
 *
 * 1. protocol version is TLS 1.1 or later;
 * 2. bulk cipher does not use CBC mode, including null bulk cipher suites.
 * 3. the payload is the first application record of a freshly
 *    negotiated TLS session.
 * 4. the CBC protection is disabled;
 *
 * More details, please refer to AppOutputStream.write(byte[], int, int).
 */
boolean needToSplitPayload() {
    writeLock.lock();
    try {
        return (protocolVersion.v <= ProtocolVersion.TLS10.v) &&
                writeCipher.isCBCMode() && !isFirstAppOutputRecord && 
                Record.enableCBCProtection;
    } finally {
        writeLock.unlock();
    }
}

But why is it implemented this way? It turns out that this is a workaround for a security problem with TLSv1.0 and older. Unfortunately, the business partner’s source code can’t deal with this workaround.

Fortunately, the people at Oracle expected that the workaround might cause compatibility problems, so they provided a way to disable it, by setting the system variable jsse.enableCBCProtection to false. (Record.enableCBCProtection contains the value of this system property).

We had a choice between disabling the workaround by setting the system property jsse.enableCBCProtection to false, which would make the software vulnerable to the security flaw, or making sure we use TLSv1.1 or newer, which is what we finally did.

Conclusion

When you have to fix a bug, don’t guess and take shortcuts. Keep on digging until you know exactly what the cause of your bug is and fix the problem based on knowledge instead of on a guess, because that is a much more effective and satisfying way to solve bugs.

Have fun fixing your next bug!