Tuesday, August 15, 2017

Using old commits in GitHub to troubleshoot -- and fixing a common SQL error

By this morning, evidence was mounting that something was wrong -- very wrong. Support reps were reporting that customers' files weren't getting processed promptly. They had already mentioned a few days ago that some video clips uploaded by customers weren't appearing when expected. And this morning, our real-time dashboard showed that no video files had been uploaded in the last 24 hours, when this number is typically in the hundreds.

I realized that all the affected features had something in common: they depend on a scheduled job that processes CSV files for every customer several times per day. A check of audit logs revealed that this job was writing more than 10,000 log messages per day through Aug 11, but fewer than 200 per day starting on Aug 12. Clearly, something was not right with the scheduling!

Our software engineers note all deployments in a Slack channel, which showed we did a deployment on Aug 11. But how to figure out exactly how and where the problem was introduced? That's where GitHub's ability to compare commits came in handy. Here's how I used it:

1. Log in to GitHub, navigate to the repo, and click commits.



2. Page through the commits until one with a date shortly before the suspect deployment is found. I chose one from 5 days ago.



3. Copy the commit hash. Then use GitHub's compare route to compare that commit to master. This route requires two commit hashes or branch names, separated by three dots, e.g. https://github.com/<my account name>/<my repo name>/compare/23948e9f51d0b6c248193e450df0a7e2c3077037...master.


4. Click Files changed.


Now it's just a matter of scrolling through the diff to see which changes might be relevant to the problem. In this case, it was easy to spot: a SQLAlchemy query in the module that processes CSV files had been modified. With the problem isolated to a few lines of code, it was easy for me to figure out what I had done wrong when changing the query.

It turns out that, when I added a new condition to a SQL query, I made a common mistake. A comparison of any value to null is always false. That's true even if the comparison uses the not equals operator! So this query:
select count(*) from config
where ftp_format_ex != 'add_photo_urls_to_inbound_feed'

was finding only 370 rows -- the ones where ftp_format_ex not only was not equal to 'add_photo_urls_to_inbound_feed', but was also not null.

This mistake is so easy to make that I also made it back in 2014, and blogged about it then!

The null-coalescing operator -- in this case, converting nulls to the empty string -- gives the result I really wanted:
select count(*) from config
where coalesce(ftp_format_ex, '') != 'add_photo_urls_to_inbound_feed'

The above query finds 3,528 rows. In the module identified in GitHub, adding coalesce to the query caused all the expected CSV files to be processed and solved the problem.


The moral of the story? Keep good logs, isolate problems by comparing older commits to the current commit in GitHub, and be careful of nulls in SQL.