On Debugging

Reflection is not one of my strong suits and I'm trying to get better at it. I've found that saying ideas aloud or writing down ideas solidifies ideas more concretely, and thought that I might as well publish the results. This is a log of some lessons learned while debugging various issues and what actually ended up working and was reproducible for later issues. I wouldn't call this an exhaustive list on how to solve problems or hunt down bugs, but they're a staring point (and maybe I can reference this later in anger).

I describe some of the journeys linearly but they were not. I was often up against a wall on all of these and had to learn something new each time!

Thinking in Reverse

The inverse function, f(x) = f-1(y) turns out to be a useful thought tool. Thinking backwards, or asking the reverse question is a powerful mental model. This mental model is well described by Charles Munger, among other useful mental tricks, and it is particularly well-suited to grasp what a (quite often then next) path forward could be for the problem at hand.

It can be applied in the following way: there is a "bug". You observe the current behavior: x is slow, x is doing the wrong thing, x is returning the wrong result, generally there is something wrong. Thinking in reverse allows one to instead think: "What should x be doing?" This method is much more effective at understanding the consequences of a long chain of events (especially in a larger system) than the first iteration of 'Why is x doing this'. If we refer to The 7 Habits of Hightly Effective People this is Habit 2, 'Begin with the End In Mind'. If the right hand side of the equation is the desired state and the left hand side is the current state and they do not equal then there is an issue with the left hand side.

Here is a more concrete example. I woke up to a bug report that a program which converted data from one format to another was quite slow when there was a large amount of data, but it was rather snappy when there was less than 100MB to convert. Skipping ahead to the internals, the system was storing the finished data into a process that determined if the finished data should be cached in memory, which is quite fast, or if it needed to be written to disk once a max memory threshold was breached. This threshold happened to be set to 100MB. Through timing parts of the conversion process from ingestion to conversion to storage to returning results, it became obvious that when the max memory size was breached the storage step became quite expensive and that the cost of other components were fixed regardless of the data size. We have a base intuition that writing to disk (even an SSD with a large bus) is not fast from Latency Numbers Everyone Should Know.

It is quite easy to ask "Why is storing data in this storage process slow?". It is much more useful to think in reverse and ask: "What should be happening when data should be sent to the storage process?" Digging in, the original intent was to have an asynchronous interface where the caller was free to store data and continue conversion. This is a great intention, as we know there is a significant chance that this operation involves slow writes to disk. We now have atleast one target in mind, and we can inverse the inverse question: "Are we writing results asynchronously to the storage process?" The answer to this question very quickly became apparent that no, the process that performed conversion was writing synchronously to the storage process! Through timing and asking questions backwards from the end goal it became apparent that waiting on the disk was the bottleneck, and making adjustments to properly send converted data async increased throughput by over an order of magnitude.

Now when I write this out it seems so very obvious and simple. At the time it was much more tricky, and I came to this realization after much wallowing and frustration. It took days of repeated blind experimentation and re-measurement which yielded absolutely nothing before I finally had to shift my mentality and work backwards. I had to hit the proverbial brick wall before the most obvious answer became clear. However after this realization and later the self-reflection on how I figured it out came the formalization of the method. This is now one of the first tools I reach for whenever something is wrong. One thing I want to make clear is this: until you both apply and realize that you applied this thought process, it is worthless. Only "knowing it" has no value, as I have learned. I read about it 3-4 times in different forms and contexts thinking I truly had this tool in my arsenal while in reality it gathered dust.

Correlation When There is Noise

A fun bug report came in: when the company's mobile app hit a 500 internal error it would display a modal with a cute text similar to "Couldn't complete the request :(", even for requests that were happening in the background. The CEO hit this modal 5x in one day. How unlucky. The "easy" fix would be to have the mobile app just not display that modal and silently retry the failed operation, but its out in the world so now we must atleast investigate the cause of the 500s. The best part was that no internal alarms were going off when these 500s were reported from the Big Man, so this was within our 'tolerance' for errors, until this point we'd considered these as transient problems until now. Quite fun indeed. I don't remember if everyone else was busy and this fell on me to fix, but I loath fixing transient problems.

Further, I was still a junior engineer and this was my first time debugging a production problem in this system which happened to be the new microservices.

I got to work and fired up the only tool at my disposal: New Relic. Charting out the request errors there were indeed spikes of tens to low hundreds of 500 errors every 5 to 10 minutes. The spikes were staggered, with a smaller spike happening on every 5th minute and a larger spike happening on every 10th.

What I explored next was a cloud-based log aggregation tool where all logs ended up, but sadly there were only a handful of readily analyzed errors none of which matched the timings in New Relic. Generally there as an extreme volume of garbage in the logs. Many random error logs, warnings, various stacktraces, etc. Pretty much nothing at a glance could be gleaned, and we didn't have the request id of the failures to debug so we couldn't rely on tracing through the system to drill into a particular failed request. And due to the extremely low frequency it was damn near impossible to capture the failure on a debug build of the mobile app, so picking through the logs was the clearest path forward.

This phase was a bit more painful. I gathered every unique instance of each error type and built a single chart for each of them. Finally I overlaid all the charts to look for patterns. Lo' and behold there were 2 errors across the system that happened in 5 minute spikes exactly! The first was 500 internally between calls to one service in particular. These were caused by the circuit breaker on each service, as the available threadpool was exhausted.

The next error was timeouts to requests to Redis, which was a single cached shared between all of the services (they were yet to be completely Network Isolated). Every service which accessed Redis had these errors on the same interval, and the effects compounded up the stack. Lower level services had a handful of Redis timeouts, but these lead to the higher level "aggregate" services to experience a high number of timeouts, which in-tern lead to the larger spikes seen by the clients. A beautiful cascade of failures.

The fix became apperent very quickly after this, something was running every 5 minutes and blocking Redis which is a single-threaded datastore. The scheduled jobs were identified, which used the blocking KEYS command, which has a big label in the docs saying "DO NOT USE ME". This fixed the errors went away, and the entire average request time across the system decreased by several milliseconds.

Beyond describing the whole scenario and fix, there are several extremely useful take-aways:

  1. When there is a high level of noise it is hard to find the signal. This triggered an initiative to remove some of the more frivolous logs.
  2. When there is a high level of noise it is difficult to tell what is an actionable error. Even after this journey there were still a handful of random 500s happening. We were able to dial in alerting on "spikes" of 500s, but without this fix they seemed "random" as did the remaining problems.
  3. Having the visual overlay of the error charts was much more helpful than I initially thought, and building the infra to chart individual errors and compare them against all errors as a whole helped debug other issues in the future. It further reduced time to debug problems that happened with releases as we could easily pin down exactly what was going wrong at a glance of the overlay.

The fix was the easy part. Finding this took well over a week and included rounds of changing threadpool sizes and attempts to better protect parts of the system until I realized that we as an organization needed a better mechanism to debug these problems. While I say 'we' I mean the senior engineer on the team guided me to the answer.

True Name

God called the light Day, and the darkness he called Night. - Genesis 1:5

"To name something is to have power over it", as it goes. Having a name helps immensely. Your IDE can match and search on it. All occurences can be found easily. You know all member variables, what their types are, what the default values should be. This is a extremely common practice. I've written about the pain that comes when names are not present. In Java you have DTOs. In Rust you have structs. In Erlang you have records. In Elixir you have Structs. There are tools.

At times you'll be forced to debug "unnamed datastructures". These are lists, maps, arrays, tuples, whathave you. The elements are not defined anywhere in a "formal" definition, but they have meaning. Usages are scattered throughout the code. They're build inconsistently and haphazardly. Or another case: a function head with 20 arguments, some with default values. This is complex logic but what it is really mimicking is an object with default values.

Questions will come up:

  • "Did they mean to exclude this field when they created this list here?"
  • "Why is this an integer in this case but a string in this other?"
  • "Did this start small and organically grow so large?"

You don't know if this is an accident on on purpose.

It'll be painful, but I find that the most reasonable approach is to give this loose accumulation of values a name, and you'll truly have power over it. Create a class, go through the usages of the code, and as you re-write you'll often find the exact place where a value was missed, because you'll have a constructor and you've traced through all of the 'invisible' logic of building this object. The types will be there, which will catch the mistake, etc. I've often abandoned the re-write and put a TODO with a link to the branch and fixed the issue if under duress.

If you're in a language that doesn't use types at compile time this will be a journey. Everywhere that the value is used will break individually. This is the tax, and you must pay it. But it is worth it!

Write it Down

Write everything down always. Keeping detailed notes helps immensely when working on an issue that spans more than 3 days. Write down timings, what worked, what didn't, steps for an algorithm, make small diagrams, keep links to Stackoverflow.

Not only will notes help with a single issue they'll also help with the future as well. In 3 years you'll see an issue and think "I've been here. Why does this seem so familiar?"

My note structure usually looks like this: DATE-TICKET_ID-DESCRIPTION. It is a good system, it is sortable by date and when you search through your ticket history and do find the previous problem you can easily find it in your notes, where you hopefully can jog your memory :)

As for note-taking apps they vary, but I've recently started using is Fossil. It is an SCM with a built in wiki. The editor is simple to use in markdown and you can easily throw your entire wiki into Google drive or S3 daily.

Take a Nap

My last debugging tool: take a nap. Take a break or a walk or do something mindless for 05 minutes. Working from home made this a bit easier and not having a day packed with meetings also helps. A 10 nap is like thought wasabi. Its like espresso but better. You must have diligence, you must not oversleep and you must be consistent. I saved the best for last, but when you're staring at an issue for 5-6 hours and your eyes are burning and you are at the proverbial end of the metaphorical rope then you need to rest. That is just a fact.


2370 Words

2024-02-19