The Hardest Bug I’ve Faced

What is the hardest bug you have found?

Bugs are the only modification in the code that can’t be estimated. Because while investigating, I discover more about the problem. It may be a problem about the source code, the configuration or an external service.

Some bugs are so unpredictable that there is no way to write unit tests which cover them.

I’m a fervent follower of TDD (Test-Driven-Development), but this only covers the rules described in the features, not in the ecosystem.

In this article, I will describe some of the worst bugs that happen in an application.

Let’s start by splitting them into two main categories: those which I can reproduce on my laptop, and those I can’t reproduce on my laptop.

Misconfigurations

Those are easily reproducible, and most of the time the application itself doesn’t run.

But sometimes, it’s a different configuration applied on my laptop than on the production environment. The hard task here will be to know who injects the configuration. Environment variables, distributed configuration, configuration files…

Once found the applied configuration, figuring out the missing parameter can be a hard task. This can lead to several hours of reading the documentation or blogs.

Incomplete Feature

“In works on my laptop”. The worst excuse from a developer.

Sometimes, I develop a complete feature, with all the detailed unit tests, but when the Product Owner tests it, it doesn’t work.

Why?

Because he’s using different use cases. Use cases which haven’t been implemented.

This can be due to incomplete specifications, incomplete unit tests or a rush time.

The solution for that is just to keep implementing the feature and asking for a complete list of use cases.

Not Working Feature

How many times I’ve written unit tests and it works? But after the first test of the Product Owner, it doesn’t work.

Most of the time, this is due to a misunderstood behavior. I thought it should work one way, and the Product Owner thought it should work another way.

Before starting a feature, I make sure to understand the feature by talking with the Product Owner. I sit next to him, and I ensure the feature is well described and the use cases are all listed.

This initial meeting is the base of a working feature.

Cross-System Bugs

Now come more interesting bugs.

Let’s say my application calls an external API. This external API may have some bugs as well. And those bugs can make my application unstable.

Those are the bugs which I can reproduce on my laptop.

When connecting the first time with an external API, I used to define a communication interface where both have to fit. Stillon, both my application and the API will evolve. And bugs can appear in both sides.

I should write unit tests to make sure my application always respects the communication interface.

But if a bug comes from the API, my application won’t be prepared for that. It can be a new field, a renamed field, a null value, a different value type…

Once I see the error in my application, I can figure out it comes from the external API. But calling again the API can be risky. It may not produce the same output. Or I can’t call again the API as it changes the state of the production environment.

It may be very hard to debug.

The best option is to contact the development team of the API and try to solve the problem together.

Concurrent Bugs

When I develop a feature, I do it on my laptop, alone. When the feature is tested, it’s tested in a controlled platform, with few users.

But when running in production, I may see different behaviours due to the amount of users.

I may notice bugs because of two users trying to update the same entity. Or one user performing a batch operation while another trying to change a small entity. Or two batch operations which overlap in time.

I’ve faced many of those scenarios.

Sometimes, it leads me to database deadlocks.

Those are the hardest to reproduce. Even after seeing the error in the logs, or a user who complains about it, I can’t reproduce it to get the same context.

If I’m unable to be in the same context, I won’t be able to debug the application. I won’t be able to know if I need to optimize the concurrent access or use a thread-safe library.

Sometimes, a simple prioritization system is the best option.

Platform Dependent Bugs

Another bug that perfectly fits in the “It works on my laptop”.

Sometimes, I’ve validated a feature in the development environment, with all the use cases. Once delivered to the next environment, it stops working.

Now, it’s time to figure out the difference between both environments.

It can be the source code. Not all the features are fully merged. Some may create conflicts
The data in the database may not be the same.
The platform may be configured differently. It may have fewer resources, or deactivate some services.

For those bugs, I always configure my laptop to be able to run my local application connected to a target environment. This procedure must be followed with caution, as it may generate more data inaccuracy.

Conclusion

For me, the worst bugs I’d to correct are the concurrent bugs.

Once it was batch operations which overlapped and produced database deadlocks. Another time it was a sequence of two users which occurred at the same time.

The best way to keep moving on a bug is to try different things. Talk to somebody, expose all the problems, and the solution may appear on a simple question.