Wednesday, September 13, 2017

A plumber's guide to software troubleshooting

The other day, my downstairs neighbor knocked on my door and delivered the alarming news that water was dripping from the ceiling of his laundry closet, that is, from my floor.

Bear with me. This story really does relate to software engineering...

I apologized to my neighbor and started investigating. In addition to the laundry closet occupied by a stacked washer and dryer, the bathroom includes a tub, a shower stall with a glass door, and a vanity with two sinks.

I found no visible sign of water in the laundry closet. I did notice some dampness and water damage inside the vanity cabinet under one sink. "Aha," I thought, "now it will be easy to find where the water is coming from!" Expecting to find a leaky supply or drain pipe, I ran the water, but there were no drips. Then it occurred to me to splash some water on the counter top, as might happen during shaving or face washing. Sure enough, a little water leaked into the cabinet. It appeared to be entering where the faucet passes through the counter. Looking more closely, I discovered the faucet's metal fitting inside the cabinet was corroded. "Aha!" I thought again.

I considered trying to repair it myself, but decided it was beyond my skill and called a plumber. When he arrived, I explained the situation with the neighbor, and pointed out the water damage and corrosion I had noticed. He ran the sink tap and replicated the problem to his satisfaction. He said the corroded part was beyond repair and I would need a whole new faucet set. I'd have to order it myself at a cost of several hundred dollars, plus more than a hundred dollars for a second visit by the plumber to install it.

Discouraged by the price, I decided to investigate further. It occurred to me to splash water onto the backsplash behind the counter. Sure enough, it leaked into the interior of the cabinet. "Aha!" I thought for the third time. This must be the real problem, and I could fix it myself with a simple tube of caulk. I caulked the gap between backsplash and wall, tested it by splashing lots of water up there, and found that the leak was gone. "Victory!" I assumed.

It did seem odd that a small amount of water in a cabinet several feet from the laundry closet could be the cause of my neighbor's complaint. But I've certainly heard of cases where water follows beams, pipes or wires and pops out in surprising places. And when I bumped into the neighbor a few days later, he said there had been no further leaks.

However...

A few days later, after taking a shower, my wife complained about a puddle on the bathroom floor, something we had never noticed in the past. Over the next few days, some -- but not all -- showers, resulted in a large puddle. These puddles started near the shower door and extended all the way to the laundry closet. Two things seemed obvious. First, water must be escaping around the edges of the shower door. This didn't surprise me, because the door was edged with a plastic gasket that didn't fit perfectly against the shower enclosure, and didn't extend all the way to the bottom of the door. Second, water from these puddles, extending into the laundry closet, might be the real cause of my neighbor's ceiling leak.

Again, I set out to replicate and isolate the problem, which was difficult because it didn't happen with every shower. First, I ran the shower unoccupied for about 10 minutes, to see if any water made its way out past the door. None did. I concluded someone needed to be inside the shower, with water splashing onto the door and door sill. By splashing water there, I was able to get a significant puddle to form.

But my test puddle was right next to the shower, whereas the puddles that had formed spontaneously were inexplicably a few inches away. I was forced to the conclusion that a leaky shower door was not the culprit. Perhaps a supply or drain pipe was at fault. I realized that in my previous tests, I had run a cold shower, so as not to waste hot water. Maybe there was a problem with the hot-water supply pipe. Sure enough, when I ran a hot shower for several minutes, a puddle appeared near the closet and spread toward the shower!

I figured there must be a leak in a hot-water supply pipe somewhere in the laundry closet. I was unable to move the 300-pound washer and dryer myself to inspect, and was again forced to call a plumber, a different one this time. When I described the problem on the phone, he diagnosed it without even seeing it! He agreed to come over, take a look, and hopefully fix it on the spot. His conclusion: Water was backing up from the shower drain and flowing out of a drain in the laundry closet floor. Snaking out the shower drain's U-trap would eliminate the backup and solve the problem. I was skeptical, but he arrived, reproduced the problem, snaked the shower drain, and verified that the problem was fixed. There have been no puddles ever since. "Aha" for real, at last!

So what does all this have to do with software engineering? It's a textbook example of many pitfalls that can mislead an engineer when trying to find the cause of an problem. Let's look at the situations and mental errors that confounded my plumbing investigation...

- It's easy -- but wrong -- to rationalize away things that don't quite fit the evidence. I was too ready to believe that a little water under a sink could travel several feet to the closet and soak through to the floor below. And I was also too ready to believe that a puddle near, but not adjacent to, the shower door was formed by water leaking around the door.

- Sometimes experts don't know best. The first plumber didn't correctly pinpoint the cause of the water leaking into my neighbor's apartment, or even the water dripping into the vanity. His proposed solution, at a cost of several hundred dollars, wouldn't have helped.

- Don't mislead the experts. Perhaps if I had simply told the first plumber water was leaking into the laundry closet downstairs -- and nothing else -- he would have investigated and found the clogged drain. However, it certainly didn't seem wise at the time to withhold the information I had discovered about water damage and a corroded plumbing fixture in the vanity.

- Sometimes experts DO know best. The second plumber knew that laundry closets have floor drains. He also knew that sometimes these drains are connected to the same waste pipe as another drain, in this case, the shower drain. That's just not something I, as a non-plumber, would ever have known. And I couldn't see under the 300-pound washer/dryer stack to find out for myself.

- The seemingly obvious cause of a problem might not be the real cause. The corroded faucet fitting seemed like an obvious source of leaks. So did the incomplete and poorly-fitting shower door gasket. But neither one was the real cause of the trouble,

- There can be more than one problem. There really was a small leak into the vanity cabinet, and it needed to be fixed, and was fixed. But that didn't solve the main problem, which was the result of a backed-up drain.

- It's easy to see patterns where none exist. I was convinced, after methodical testing, that the puddle formed only when running hot water in the shower. That wasn't true at all. Actually, the problem was sporadic: sometimes the shower drain backed up enough for water to flow out of the laundry closet drain, and sometimes it didn't.

I hope these tips will help you the next you need to find the cause of a software error. And I hope you don't experience any plumbing problems!

Friday, September 1, 2017

Using SQL date_trunc to group log records for troubleshooting

Did you ever notice a recurring error message in your log files, only to wonder if the problem has recently become more severe, or has actually been happening -- unnoticed -- for quite a while? If your log data is in SQL format, or any format you can query using SQL, a good technique is to group the records by time period to see if the error has become more prevalent over time.

Here's a real-world example. In a SQL table named log, which includes columns named timestamp (the time the record was written) and message (text giving details of what was logged), I stumbled across many records where message  had the value 'ConnectionError in application_cache'. I wanted to learn whether this error had suddenly become more frequent. This query gave me the answer:

select date_trunc('day', timestamp) as d, count(*) from log where message = 'ConnectionError in application_cache' group by d order by d

The date_trunc function takes a timestamp and "rounds it down" to the nearest specified unit -- the nearest day in my example. Instead of 'day', you can specify any unit from 'microseconds' all the way up to 'millennium'.

The results of my query, a count of the number of matching messages per day, looked like this:

That told me what I needed to know: This message happens many times per day, and the frequency hasn't increased recently.

My database is PostgreSQL 9.x, and the PostgreSQL date_trunc function is documented here. Whichever SQL database your using, a similar function is likely available, and may come in very handy.