Introduction

This article briefly summarizes a couple of strategies and mindsets that are useful when establishing a modus operandi to deal with software defects.

We generally think in terms of three distinct steps when triaging and solving bugs:

1. Reproduce

Before you take any further steps you have to be able to reliably and deterministically reproduce the problem in a controlled environment. This typically requires you to identify and re-create the conditions needed to reproduce the problem in your development environment.

If you can't accurately reproduce the problem it will be nearly impossible to properly diagnose, non-deterministic behaviors will be very hard to spot and remove, and attempting to verify that you've actually solved the problem will be a more or less hopeless endeavor.

  • When feasible, try to create a minimal version of the software in which the problem manifests.
  • Discover the shortest path to reproducing the problem by leaving out certain steps and/or inputs.
  • Binary search (e.g. git bisect) is a very useful tool when trying to systematically determine when the problem was introduced into the code base.
  • Document all the conditions and steps necessary to reproduce the bug.

2. Diagnose

The next step is to formulate hypotheses about the possible cause and run repeatable experiments to prove or disprove them.

  • Before you start you have to fully understand the expected outcome.
  • Keep making small incremental changes to your experiments, make sure you understand what they actually mean, and keep record of what you've tried.
  • Focus on the root cause of the defect. We want to solve the underlying issue, not just the symptoms.
  • Occam's razor is a useful mental model when testing hypotheses.
  • Make sure you read and understand any related code and configuration.
  • Work in topic branches so that you can quickly navigate between experiments and reset the code base.
  • Utilizing the test suite to prove or disprove hypotheses is usually a good idea (easy to make incremental changes, you'll start with a fresh state on every run, specialized utilities to verify outputs etc).
  • Use any tools available to you to continuously inspect the state of the running software.
  • Talk to other developers on the team that might've touched the relevant parts of the project.

3. Implement

When you've properly diagnosed the problem and identified the root cause it is time to implement the solution.

  • Make sure you clearly understand the problem and solution before you start writing any implementation code.
  • It is important that you do not introduce any regressions (i.e. you must be certain that you're not breaking some other part of the system with your fix). Ensure all existing tests pass before and after implementing the solution.
  • Maintain or improve (but see the next point) the overall quality of the code base.
  • Avoid fixing other issues and refactoring -- unless strictly necessary -- when you're implementing the fix.
  • Make sure new and related functionality is sufficiently covered by automated tests.
  • Provide a "post mortem" to relevant stakeholders. A couple of extra lines in a commit message, a short explanation of what happened in the relevant Slack channel, or a brief during the next team meeting is usually sufficient. Inform the users of the software as necessary.

Happy hunting!

Want to know more about how we can work together and launch a successful digital energy service?