Wednesday, September 13, 2017

A plumber's guide to software troubleshooting

The other day, my downstairs neighbor knocked on my door and delivered the alarming news that water was dripping from the ceiling of his laundry closet, that is, from my floor.

Bear with me. This story really does relate to software engineering...

I apologized to my neighbor and started investigating. In addition to the laundry closet occupied by a stacked washer and dryer, the bathroom includes a tub, a shower stall with a glass door, and a vanity with two sinks.

I found no visible sign of water in the laundry closet. I did notice some dampness and water damage inside the vanity cabinet under one sink. "Aha," I thought, "now it will be easy to find where the water is coming from!" Expecting to find a leaky supply or drain pipe, I ran the water, but there were no drips. Then it occurred to me to splash some water on the counter top, as might happen during shaving or face washing. Sure enough, a little water leaked into the cabinet. It appeared to be entering where the faucet passes through the counter. Looking more closely, I discovered the faucet's metal fitting inside the cabinet was corroded. "Aha!" I thought again.

I considered trying to repair it myself, but decided it was beyond my skill and called a plumber. When he arrived, I explained the situation with the neighbor, and pointed out the water damage and corrosion I had noticed. He ran the sink tap and replicated the problem to his satisfaction. He said the corroded part was beyond repair and I would need a whole new faucet set. I'd have to order it myself at a cost of several hundred dollars, plus more than a hundred dollars for a second visit by the plumber to install it.

Discouraged by the price, I decided to investigate further. It occurred to me to splash water onto the backsplash behind the counter. Sure enough, it leaked into the interior of the cabinet. "Aha!" I thought for the third time. This must be the real problem, and I could fix it myself with a simple tube of caulk. I caulked the gap between backsplash and wall, tested it by splashing lots of water up there, and found that the leak was gone. "Victory!" I assumed.

It did seem odd that a small amount of water in a cabinet several feet from the laundry closet could be the cause of my neighbor's complaint. But I've certainly heard of cases where water follows beams, pipes or wires and pops out in surprising places. And when I bumped into the neighbor a few days later, he said there had been no further leaks.

However...

A few days later, after taking a shower, my wife complained about a puddle on the bathroom floor, something we had never noticed in the past. Over the next few days, some -- but not all -- showers, resulted in a large puddle. These puddles started near the shower door and extended all the way to the laundry closet. Two things seemed obvious. First, water must be escaping around the edges of the shower door. This didn't surprise me, because the door was edged with a plastic gasket that didn't fit perfectly against the shower enclosure, and didn't extend all the way to the bottom of the door. Second, water from these puddles, extending into the laundry closet, might be the real cause of my neighbor's ceiling leak.

Again, I set out to replicate and isolate the problem, which was difficult because it didn't happen with every shower. First, I ran the shower unoccupied for about 10 minutes, to see if any water made its way out past the door. None did. I concluded someone needed to be inside the shower, with water splashing onto the door and door sill. By splashing water there, I was able to get a significant puddle to form.

But my test puddle was right next to the shower, whereas the puddles that had formed spontaneously were inexplicably a few inches away. I was forced to the conclusion that a leaky shower door was not the culprit. Perhaps a supply or drain pipe was at fault. I realized that in my previous tests, I had run a cold shower, so as not to waste hot water. Maybe there was a problem with the hot-water supply pipe. Sure enough, when I ran a hot shower for several minutes, a puddle appeared near the closet and spread toward the shower!

I figured there must be a leak in a hot-water supply pipe somewhere in the laundry closet. I was unable to move the 300-pound washer and dryer myself to inspect, and was again forced to call a plumber, a different one this time. When I described the problem on the phone, he diagnosed it without even seeing it! He agreed to come over, take a look, and hopefully fix it on the spot. His conclusion: Water was backing up from the shower drain and flowing out of a drain in the laundry closet floor. Snaking out the shower drain's U-trap would eliminate the backup and solve the problem. I was skeptical, but he arrived, reproduced the problem, snaked the shower drain, and verified that the problem was fixed. There have been no puddles ever since. "Aha" for real, at last!

So what does all this have to do with software engineering? It's a textbook example of many pitfalls that can mislead an engineer when trying to find the cause of an problem. Let's look at the situations and mental errors that confounded my plumbing investigation...

- It's easy -- but wrong -- to rationalize away things that don't quite fit the evidence. I was too ready to believe that a little water under a sink could travel several feet to the closet and soak through to the floor below. And I was also too ready to believe that a puddle near, but not adjacent to, the shower door was formed by water leaking around the door.

- Sometimes experts don't know best. The first plumber didn't correctly pinpoint the cause of the water leaking into my neighbor's apartment, or even the water dripping into the vanity. His proposed solution, at a cost of several hundred dollars, wouldn't have helped.

- Don't mislead the experts. Perhaps if I had simply told the first plumber water was leaking into the laundry closet downstairs -- and nothing else -- he would have investigated and found the clogged drain. However, it certainly didn't seem wise at the time to withhold the information I had discovered about water damage and a corroded plumbing fixture in the vanity.

- Sometimes experts DO know best. The second plumber knew that laundry closets have floor drains. He also knew that sometimes these drains are connected to the same waste pipe as another drain, in this case, the shower drain. That's just not something I, as a non-plumber, would ever have known. And I couldn't see under the 300-pound washer/dryer stack to find out for myself.

- The seemingly obvious cause of a problem might not be the real cause. The corroded faucet fitting seemed like an obvious source of leaks. So did the incomplete and poorly-fitting shower door gasket. But neither one was the real cause of the trouble,

- There can be more than one problem. There really was a small leak into the vanity cabinet, and it needed to be fixed, and was fixed. But that didn't solve the main problem, which was the result of a backed-up drain.

- It's easy to see patterns where none exist. I was convinced, after methodical testing, that the puddle formed only when running hot water in the shower. That wasn't true at all. Actually, the problem was sporadic: sometimes the shower drain backed up enough for water to flow out of the laundry closet drain, and sometimes it didn't.

I hope these tips will help you the next you need to find the cause of a software error. And I hope you don't experience any plumbing problems!

Friday, September 1, 2017

Using SQL date_trunc to group log records for troubleshooting

Did you ever notice a recurring error message in your log files, only to wonder if the problem has recently become more severe, or has actually been happening -- unnoticed -- for quite a while? If your log data is in SQL format, or any format you can query using SQL, a good technique is to group the records by time period to see if the error has become more prevalent over time.

Here's a real-world example. In a SQL table named log, which includes columns named timestamp (the time the record was written) and message (text giving details of what was logged), I stumbled across many records where message  had the value 'ConnectionError in application_cache'. I wanted to learn whether this error had suddenly become more frequent. This query gave me the answer:

select date_trunc('day', timestamp) as d, count(*) from log where message = 'ConnectionError in application_cache' group by d order by d

The date_trunc function takes a timestamp and "rounds it down" to the nearest specified unit -- the nearest day in my example. Instead of 'day', you can specify any unit from 'microseconds' all the way up to 'millennium'.

The results of my query, a count of the number of matching messages per day, looked like this:

That told me what I needed to know: This message happens many times per day, and the frequency hasn't increased recently.

My database is PostgreSQL 9.x, and the PostgreSQL date_trunc function is documented here. Whichever SQL database your using, a similar function is likely available, and may come in very handy.

Tuesday, August 15, 2017

Using old commits in GitHub to troubleshoot -- and fixing a common SQL error

By this morning, evidence was mounting that something was wrong -- very wrong. Support reps were reporting that customers' files weren't getting processed promptly. They had already mentioned a few days ago that some video clips uploaded by customers weren't appearing when expected. And this morning, our real-time dashboard showed that no video files had been uploaded in the last 24 hours, when this number is typically in the hundreds.

I realized that all the affected features had something in common: they depend on a scheduled job that processes CSV files for every customer several times per day. A check of audit logs revealed that this job was writing more than 10,000 log messages per day through Aug 11, but fewer than 200 per day starting on Aug 12. Clearly, something was not right with the scheduling!

Our software engineers note all deployments in a Slack channel, which showed we did a deployment on Aug 11. But how to figure out exactly how and where the problem was introduced? That's where GitHub's ability to compare commits came in handy. Here's how I used it:

1. Log in to GitHub, navigate to the repo, and click commits.



2. Page through the commits until one with a date shortly before the suspect deployment is found. I chose one from 5 days ago.



3. Copy the commit hash. Then use GitHub's compare route to compare that commit to master. This route requires two commit hashes or branch names, separated by three dots, e.g. https://github.com/<my account name>/<my repo name>/compare/23948e9f51d0b6c248193e450df0a7e2c3077037...master.


4. Click Files changed.


Now it's just a matter of scrolling through the diff to see which changes might be relevant to the problem. In this case, it was easy to spot: a SQLAlchemy query in the module that processes CSV files had been modified. With the problem isolated to a few lines of code, it was easy for me to figure out what I had done wrong when changing the query.

It turns out that, when I added a new condition to a SQL query, I made a common mistake. A comparison of any value to null is always false. That's true even if the comparison uses the not equals operator! So this query:
select count(*) from config
where ftp_format_ex != 'add_photo_urls_to_inbound_feed'

was finding only 370 rows -- the ones where ftp_format_ex not only was not equal to 'add_photo_urls_to_inbound_feed', but was also not null.

This mistake is so easy to make that I also made it back in 2014, and blogged about it then!

The null-coalescing operator -- in this case, converting nulls to the empty string -- gives the result I really wanted:
select count(*) from config
where coalesce(ftp_format_ex, '') != 'add_photo_urls_to_inbound_feed'

The above query finds 3,528 rows. In the module identified in GitHub, adding coalesce to the query caused all the expected CSV files to be processed and solved the problem.


The moral of the story? Keep good logs, isolate problems by comparing older commits to the current commit in GitHub, and be careful of nulls in SQL.

Monday, July 3, 2017

Using CloudFront access logs to investigate unexpected traffic

Amazon's content delivery network (CDN), CloudFront, has many benefits when serving web content to a large global audience. A small, but important, one of those benefits is detailed logging, which I recently took advantage of to investigate some unusual web traffic.

It all started with a CloudWatch alarm. I had configured CloudWatch to email me if my CloudFront distribution received more than a specified number of requests per minutes. I received such an email, clicked the link to view the alarm details in the CloudWatch console, and accessed a graph plotting requests per minute against time. Zooming out to a two-week scale, I immediately noticed sharp peaks occurring around 20:00 UTC most days.



How to find out where the excess traffic was coming from? CloudWatch and CloudFront offer a variety of graphs with resolutions as detailed as one minute. And CloudFront offers a report of top referrers, that is, websites from which the most requests to the CloudFront distribution originated. However, the shortest time unit the top referrers report covers is one day, not of much use for identifying the source of a traffic spike lasting only a few minutes.

The answer was to consult the CloudFront access logs, which, when enabled, are stored as text files on Amazon S3. Fortunately, logging was already turned on for my CloudFront distribution. If it's not, you can enable it in the AWS Console by selecting your distribution and clicking Edit on the General tab.






The log files can then be found in the specified bucket and folder in the S3 console. Focusing on the peak that occurred around 20:00 UTC on June 24, I typed in a prefix and sorted by Last modified to zero in on the relevant logs. (Tip: timestamps are in UTC on the CloudWatch graph and in the names of the log files on S3, but the last-modified time in the S3 console is in your local time zone.)



Each log is a zipped (.gz) file. Unzipping it yields a tab-delimited text file, which you can open in a text editor, view in your favorite spreadsheet program, or analyze by writing a script. Here are the first few lines of a typical log file:




Following two header rows, each row represents one request to a URL served from the CloudFront distribution. The relevant fields are time and cs(Referrer). The referrer is the URL of the page from which the event originated. I wrote a Python script to read the log files and output a CSV file with one row per request, where each row consists of the time (truncated to the minute) and the domain name.

It was then simple to sort the CSV file by minute and domain name. In this way, I found the domain that was responsible for the excessive traffic in the minutes shortly after 20:00 UTC on June 24. Armed with this knowledge, I was able to contact the owner of that domain and ask why their website was receiving such heavy use around 20:00 each day. (As of this writing, I'm waiting for their reply.)

If you prefer not to write a script, you might instead take advantage of Amazon Athena. Once you define an appropriate schema, Athena lets you query data on S3 without downloading it or writing code.

Many thanks to the engineers at the AWS Loft in New York for pointing me in the direction of CloudFront access logs. I hope you found this article informative and helpful.

Friday, June 9, 2017

Turning unstructured text into a SQL query

It sounds trivial, but querying a SQL database for records that match data provided in an unstructured format such as a text file can be a challenge. The goal is to construct a SQL query with a possible lengthy in clause, without a lot of manual copying and pasting. Consider this example: a colleague supplies a list of customer names in the body of an email message. (The principle is the same if the list is in a text file, a PDF document, or any similar format.) Something like this:

Hey Steve,
Just got this list of customers from Marketing. Could you look up the date of most recent login for each of them? Of course it's an emergency, has to be done before tomorrow's big conference.
Thanks,
Lenny Lastminute
-------------------------------
Acme Anvils, Inc.
Big Bertha's Balloons, LLC
Crazy Cars & Co
Dynamite Dog Food Enterprises
Excellent Eggs, Inc.

Since the list has only five customers, it wouldn't be hard to copy and paste to come up with a query like this:
select name, last_login_date from customer where name in (
    'Acme Anvils, Inc.',
    'Big Bertha''s Balloons, LLC',
    'Crazy Cars & Co',
    'Dynamite Dog Food Enterprises',
    'Excellent Eggs, Inc.'
) order by name

But now let's imagine that the list has hundreds of entries, and the customer table has millions of rows, so we don't just want to retrieve all of them and manually identify the relevant ones. Fortunately, there are some tricks we can do in LibreOffice Calc (or Excel or your favorite spreadsheet program if you're not a Linux geek).

Start by pasting the list into the first column of a blank spreadsheet:

Now place a formula like this in cell B1:
="select name, last_login_date from customer where name in ('" & A1 & "'"

Yep, those are single-quotes (which SQL requires) inside double quotes (which LibreOffice Calc requires).

And place a formula like this in cell B2:
=B1 & ", '" & A2 & "'"

Copy the formula from cell B2 to all the remaining cells in column B. You'll end up with something like this:

Then we just need to close the parens, and perhaps throw in an order by clause and a semicolon, which we can do with a formula like this below the last cell in column B:
=B5 & ") order by name;"

That cell now contains our SQL query:
select name, last_login_date from customer where name in ('Acme Anvils, Inc.', 'Big Bertha's Balloons, LLC', 'Crazy Cars & Co', 'Dynamite Dog Food Enterprises', 'Excellent Eggs, Inc.') order by name;

But we're not quite done. If you have sharp eyes, you may have noticed the SQL syntax error. One of the customer names, Big Bertha's Balloons, LLC, contains an apostrophe, which is the same as a single quote, which is a reserved character in SQL. We can escape it by doubling it. To automate replacing ' with '', change the formula from =B1 & ", '" & A2 & "'" to:
=B1 & ", '" & SUBSTITUTE(A2, "'", "''") & "'"

This results in a valid SQL query, like so:
select name, last_login_date from customer where name in ('Acme Anvils, Inc.', 'Big Bertha''s Balloons, LLC', 'Crazy Cars & Co', 'Dynamite Dog Food Enterprises', 'Excellent Eggs, Inc.') order by name;

I hope this saves you some time someday!

Wednesday, January 18, 2017

Installing Numpy on Ubuntu

I needed the numpy module in order to run one of my Python programs. I'm running Python 2.7.6 on Ubuntu 14.04.

I first tried pip install numpy. No luck. After a while, this error was displayed: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 72: ordinal not in range(128). The installation did not complete successfully.

Then I tried sudo apt-get install python-numpy. That did the trick. A tip of the hat to my colleague Collin Stocks for this advice.

A uWSGI Bug: A (Partial) Success Story About Logging and Testing

Users of SpinCar's mobile app recently started mentioning that they were seeing an error message. Not every user was affected, but it happened to enough users that it was clearly not an isolated problem. The error message was far from specific, indicating that something unexpected went wrong without giving any details.

The first place we looked to isolate the problem was MixPanel. We use this service to log events that occur in the app, including nonfatal errors. (We also use Fabric.io's Crashlytics, but that tends to be more useful for fatal errors that actually crash the app.) Fortunately, MixPanel did reveal details about at least one occurrence of the error. We could see that the app had made a particular call to the our back-end API, and that the API responded with an HTTP status of 502 (bad gateway). Furthermore, the full details of the offending HTTP request had been logged.

Our good luck in continued when we tried to replicate the problem. Simply by making the same HTTP request in a web browser, we provoked the same 502 response, using the production copy of the API. The next question was: Could we do so in the development environment, where we'd be able to troubleshoot and test potential fixes? The answer was yes! Running a local copy of the API and sending it the same request once again led to the 502 response.

In the development environment, we were able to see a detailed error message mentioning "invalid request block size." A Google search revealed that uwsgi, which use to serve our API, limits the size of requests, and the default maximum is 4,096 bytes. We corrected the problem by launching uwsgi with the buffer-size parameter to set a larger maximum. The bug was resolved within a few hours of the initial report, with no need to submit a new version of the iOS app for Apple's approval.

That's a success story, and the success was driven largely by two factors. First, adequate logging enabled us to get details about the error, even though the message displayed to users was uninformative, and the users themselves were non-technical customer personnel at distant locations. Second, the development environment closely mimics the production environment. In this case, it was a critical factor that uwsgi was used -- and used with the same settings -- in dev as well as on prod.

But, of course, the success is only partial. Real success would be avoiding the bug in the first place. How did this error slip through the cracks despite the fact that our app and API undergo extensive code review and automated testing? The answer is that the bug only occurs with very large requests, which only occur when users have used the app to create a great many videos -- far more than are created in our automated tests. This bug could have been revealed by the type of testing known as load testing or stress testing, in which actions are performed repeatedly and large data sets are used. There are a couple of reasons we haven't implemented this type of testing yet. One is that the open-source automated testing tools we rely on tend to crash during long test runs. But the main reason is that we haven't yet found the time to develop load tests. It's definitely on the to-do list!