Tuesday, January 22, 2019

A subtle error when initializing multiple variables on one line in Python

From a code-style perspective, I don't often like to put multiple variable initializations on a single line, but once in a while it makes sense to do so:

foo = goo = None

Today, I ran into a subtle bug that can result from initializing multiple variables on one line, when the variables are of mutable types such as lists.

I was writing code to populate two lists, make some changes to the contents of each list, and then compare the two lists. Something like this:

list1 = list2 = []
list1.append(1)
list2.append(2)
print 'Are they equal?', list1 == list2  # True !!!

The output, to my surprise, showed that the two lists are equal. Upon changing the code to initialize each list on its own line::

list1 = []
list2 = []
list1.append(1)
list2.append(2)
print 'Are they equal?', list1 == list2  # False

... the lists are now shown to be unequal, as I expected. Here's why...

When the right-hand side of a Python statement is a mutable variable, nothing is created in memory. After the assignment, both variables refer to the already-existing object. (Thanks to https://medium.com/broken-window/many-names-one-memory-address-122f78734cb6 for this explanation).

In the assignment statement list1 = list2 = [], the right-hand side is list2 = [], which -- although it's also an assignment -- is a list and therefore mutable. This makes list1 an "alias" for list2, leading to the unexpected result.

If you'd rather never have to think about this, then it's reasonable to just avoid initializing multiple variables on the same line.

Monday, September 24, 2018

A more granular view of DynamoDB throughput using CloudWatch

DynamoDB, Amazon Web Services' NoSQL database in the cloud, provides guaranteed levels of read and write throughput, measured in units called RCUs and WCUs. The only catch is that you must specify the read and write capacity in advance. If the actual consumed capacity exceeds the provisioned capacity, throttling occurs, and read or write requests may be rejected.

You can set a CloudWatch alarm to alert you if consumed capacity comes close to exceeding provisioned capacity.

And it's easy to view the provisioned and consumed capacity, along with any throttled requests that may have occurred, on a table's Metrics tab in the AWS console. Below is a three-day graph for one of my tables, showing that some throttled reads occurred at 18:00 UTC on Sep 22.




By zooming in on the time range in question and setting the period to 1 Minute, I was able to get a more precise view of the read throughput at the time of the throttling.




But the console doesn't tell the whole story. In fact, it appears as if consumed capacity never exceeded provisioned capacity around 18:00 UTC. So why was there throttling?

To troubleshoot the cause of spikes in consumed capacity, I needed to view more granular metrics that are only available in CloudWatch. CloudWatch is accessible via the AWS API, command-line interface, or console. Here, we'll focus on using the console to see detailed CloudWatch metrics for a DynamoDB table.

Since I use multiple AWS services extensively, there are thousands of CloudWatch metrics. To find the relevant one, I started by filtering on my DyanmoDB table's name in the search box.





Next, I clicked to expand DynamoDB -> Table Metrics.



From the list of table metrics, I selected the relevant one, ConsumedReadCapacityUnits.






Next, I set the time range to the period from a few minutes before to a few minutes after the throttling occurred. I clicked custom, selected Absolute, and entered the start and end dates and times.




The resulting graph still didn't reveal the information I needed. It made it look as if there was only about one read per minute. My table's provisioned throughput was more than 600 reads per second, so why the throttling?



My next step was to click the Graphed Metrics tab.



That allowed me to use dropdowns to set the Statistic to Sum (rather than the default, Average) and the Period to 1 Minute (rather than the default, 5 Minutes).


With these settings, I could finally see a short-lived, but huge, spike in consumed read capacity. This spike of more than 32,000 reads in one minute -- invisible in the 5-minute average -- was now clearly displayed.

I hope this helps you if you're ever faced with isolating the nature and timing of an unexpected spike in read or write activity on one of your DynamoDB tables or indexes.

Friday, February 16, 2018

Copying an EC2 AMI between regions with boto 2

There are some good articles about copying an Amazon Machine Image (AMI) from one region to another, such as this one. It rightly states that copying can be accomplished using the console, command line tools, API or SDKs. I chose to use an SDK, specifically boto 2, but was unable to find clear instructions. I'm pleased to present a short Python script that shows how to do it.

The script is pretty basic -- no AWS authentication (so you'll need to provide an AWS access key ID and secret access key for an account with appropriate permissions), error handling, etc. But it works.

Here's the whole script. The comments explain some of the more interesting points.

#! /usr/bin/python2

from boto import ec2
from datetime import datetime
import time

COPY_FROM_REGION = 'us-east-1'  # Region to copy from. Change this if you like.
COPY_TO_REGION = 'eu-west-1'  # Region to copy to. Change this if you like.
AMI_ID = 'ami-XXXXXXXX'  # Change this to the ID of the AMI you wish to copy. This AMI must already exist.

ec2_conn = ec2.connect_to_region(COPY_TO_REGION)  # Boto 2 interface to EC2.

# Give the target AMI a unique name. Not truly necessary, but convenient. I chose seconds since epoch as a source
# of uniquness. The name can really be anything you want.
timestamp = int((datetime.utcnow() - datetime(1970, 1, 1)).total_seconds())
ami_name = 'eu-copy-of-{}-{}'.format(AMI_ID, timestamp)
print 'copying AMI to {}'.format(ami_name)
ec2_conn.copy_image(COPY_FROM_REGION, AMI_ID, name=ami_name)  # Initiate copying.

# Now we wait...
while True:
    # The image won't even show up in the target region for a while. Wait until it exists.
    images = ec2_conn.get_all_images(filters={
        'is-public': 'false',
        'name': ami_name
    })
    if images:
        image = images[0]  # Now the image exists in the target region.
        print image.id, image.state
        # Now wait for the image to be in the "available" state. This could take a few minutes, especially if it's big.
        while True:
            images = ec2_conn.get_all_images(filters={
                'is-public': 'false',
                'name': ami_name,
                'state': 'available'
            })
            if images:
                print 'available'
                break
            print 'not available'
            time.sleep(10)
        break
    print 'not found'
    time.sleep(10)


One final tip: I wasn't able to find documentation of the properties of the image objects returned by get_all_images. But Python offers an easy solution: just print out image.__dict__.

Happy copying!

Wednesday, September 13, 2017

A plumber's guide to software troubleshooting

The other day, my downstairs neighbor knocked on my door and delivered the alarming news that water was dripping from the ceiling of his laundry closet, that is, from my floor.

Bear with me. This story really does relate to software engineering...

I apologized to my neighbor and started investigating. In addition to the laundry closet occupied by a stacked washer and dryer, the bathroom includes a tub, a shower stall with a glass door, and a vanity with two sinks.

I found no visible sign of water in the laundry closet. I did notice some dampness and water damage inside the vanity cabinet under one sink. "Aha," I thought, "now it will be easy to find where the water is coming from!" Expecting to find a leaky supply or drain pipe, I ran the water, but there were no drips. Then it occurred to me to splash some water on the counter top, as might happen during shaving or face washing. Sure enough, a little water leaked into the cabinet. It appeared to be entering where the faucet passes through the counter. Looking more closely, I discovered the faucet's metal fitting inside the cabinet was corroded. "Aha!" I thought again.

I considered trying to repair it myself, but decided it was beyond my skill and called a plumber. When he arrived, I explained the situation with the neighbor, and pointed out the water damage and corrosion I had noticed. He ran the sink tap and replicated the problem to his satisfaction. He said the corroded part was beyond repair and I would need a whole new faucet set. I'd have to order it myself at a cost of several hundred dollars, plus more than a hundred dollars for a second visit by the plumber to install it.

Discouraged by the price, I decided to investigate further. It occurred to me to splash water onto the backsplash behind the counter. Sure enough, it leaked into the interior of the cabinet. "Aha!" I thought for the third time. This must be the real problem, and I could fix it myself with a simple tube of caulk. I caulked the gap between backsplash and wall, tested it by splashing lots of water up there, and found that the leak was gone. "Victory!" I assumed.

It did seem odd that a small amount of water in a cabinet several feet from the laundry closet could be the cause of my neighbor's complaint. But I've certainly heard of cases where water follows beams, pipes or wires and pops out in surprising places. And when I bumped into the neighbor a few days later, he said there had been no further leaks.

However...

A few days later, after taking a shower, my wife complained about a puddle on the bathroom floor, something we had never noticed in the past. Over the next few days, some -- but not all -- showers, resulted in a large puddle. These puddles started near the shower door and extended all the way to the laundry closet. Two things seemed obvious. First, water must be escaping around the edges of the shower door. This didn't surprise me, because the door was edged with a plastic gasket that didn't fit perfectly against the shower enclosure, and didn't extend all the way to the bottom of the door. Second, water from these puddles, extending into the laundry closet, might be the real cause of my neighbor's ceiling leak.

Again, I set out to replicate and isolate the problem, which was difficult because it didn't happen with every shower. First, I ran the shower unoccupied for about 10 minutes, to see if any water made its way out past the door. None did. I concluded someone needed to be inside the shower, with water splashing onto the door and door sill. By splashing water there, I was able to get a significant puddle to form.

But my test puddle was right next to the shower, whereas the puddles that had formed spontaneously were inexplicably a few inches away. I was forced to the conclusion that a leaky shower door was not the culprit. Perhaps a supply or drain pipe was at fault. I realized that in my previous tests, I had run a cold shower, so as not to waste hot water. Maybe there was a problem with the hot-water supply pipe. Sure enough, when I ran a hot shower for several minutes, a puddle appeared near the closet and spread toward the shower!

I figured there must be a leak in a hot-water supply pipe somewhere in the laundry closet. I was unable to move the 300-pound washer and dryer myself to inspect, and was again forced to call a plumber, a different one this time. When I described the problem on the phone, he diagnosed it without even seeing it! He agreed to come over, take a look, and hopefully fix it on the spot. His conclusion: Water was backing up from the shower drain and flowing out of a drain in the laundry closet floor. Snaking out the shower drain's U-trap would eliminate the backup and solve the problem. I was skeptical, but he arrived, reproduced the problem, snaked the shower drain, and verified that the problem was fixed. There have been no puddles ever since. "Aha" for real, at last!

So what does all this have to do with software engineering? It's a textbook example of many pitfalls that can mislead an engineer when trying to find the cause of an problem. Let's look at the situations and mental errors that confounded my plumbing investigation...

- It's easy -- but wrong -- to rationalize away things that don't quite fit the evidence. I was too ready to believe that a little water under a sink could travel several feet to the closet and soak through to the floor below. And I was also too ready to believe that a puddle near, but not adjacent to, the shower door was formed by water leaking around the door.

- Sometimes experts don't know best. The first plumber didn't correctly pinpoint the cause of the water leaking into my neighbor's apartment, or even the water dripping into the vanity. His proposed solution, at a cost of several hundred dollars, wouldn't have helped.

- Don't mislead the experts. Perhaps if I had simply told the first plumber water was leaking into the laundry closet downstairs -- and nothing else -- he would have investigated and found the clogged drain. However, it certainly didn't seem wise at the time to withhold the information I had discovered about water damage and a corroded plumbing fixture in the vanity.

- Sometimes experts DO know best. The second plumber knew that laundry closets have floor drains. He also knew that sometimes these drains are connected to the same waste pipe as another drain, in this case, the shower drain. That's just not something I, as a non-plumber, would ever have known. And I couldn't see under the 300-pound washer/dryer stack to find out for myself.

- The seemingly obvious cause of a problem might not be the real cause. The corroded faucet fitting seemed like an obvious source of leaks. So did the incomplete and poorly-fitting shower door gasket. But neither one was the real cause of the trouble,

- There can be more than one problem. There really was a small leak into the vanity cabinet, and it needed to be fixed, and was fixed. But that didn't solve the main problem, which was the result of a backed-up drain.

- It's easy to see patterns where none exist. I was convinced, after methodical testing, that the puddle formed only when running hot water in the shower. That wasn't true at all. Actually, the problem was sporadic: sometimes the shower drain backed up enough for water to flow out of the laundry closet drain, and sometimes it didn't.

I hope these tips will help you the next you need to find the cause of a software error. And I hope you don't experience any plumbing problems!

Friday, September 1, 2017

Using SQL date_trunc to group log records for troubleshooting

Did you ever notice a recurring error message in your log files, only to wonder if the problem has recently become more severe, or has actually been happening -- unnoticed -- for quite a while? If your log data is in SQL format, or any format you can query using SQL, a good technique is to group the records by time period to see if the error has become more prevalent over time.

Here's a real-world example. In a SQL table named log, which includes columns named timestamp (the time the record was written) and message (text giving details of what was logged), I stumbled across many records where message  had the value 'ConnectionError in application_cache'. I wanted to learn whether this error had suddenly become more frequent. This query gave me the answer:

select date_trunc('day', timestamp) as d, count(*) from log where message = 'ConnectionError in application_cache' group by d order by d

The date_trunc function takes a timestamp and "rounds it down" to the nearest specified unit -- the nearest day in my example. Instead of 'day', you can specify any unit from 'microseconds' all the way up to 'millennium'.

The results of my query, a count of the number of matching messages per day, looked like this:

That told me what I needed to know: This message happens many times per day, and the frequency hasn't increased recently.

My database is PostgreSQL 9.x, and the PostgreSQL date_trunc function is documented here. Whichever SQL database your using, a similar function is likely available, and may come in very handy.

Tuesday, August 15, 2017

Using old commits in GitHub to troubleshoot -- and fixing a common SQL error

By this morning, evidence was mounting that something was wrong -- very wrong. Support reps were reporting that customers' files weren't getting processed promptly. They had already mentioned a few days ago that some video clips uploaded by customers weren't appearing when expected. And this morning, our real-time dashboard showed that no video files had been uploaded in the last 24 hours, when this number is typically in the hundreds.

I realized that all the affected features had something in common: they depend on a scheduled job that processes CSV files for every customer several times per day. A check of audit logs revealed that this job was writing more than 10,000 log messages per day through Aug 11, but fewer than 200 per day starting on Aug 12. Clearly, something was not right with the scheduling!

Our software engineers note all deployments in a Slack channel, which showed we did a deployment on Aug 11. But how to figure out exactly how and where the problem was introduced? That's where GitHub's ability to compare commits came in handy. Here's how I used it:

1. Log in to GitHub, navigate to the repo, and click commits.



2. Page through the commits until one with a date shortly before the suspect deployment is found. I chose one from 5 days ago.



3. Copy the commit hash. Then use GitHub's compare route to compare that commit to master. This route requires two commit hashes or branch names, separated by three dots, e.g. https://github.com/<my account name>/<my repo name>/compare/23948e9f51d0b6c248193e450df0a7e2c3077037...master.


4. Click Files changed.


Now it's just a matter of scrolling through the diff to see which changes might be relevant to the problem. In this case, it was easy to spot: a SQLAlchemy query in the module that processes CSV files had been modified. With the problem isolated to a few lines of code, it was easy for me to figure out what I had done wrong when changing the query.

It turns out that, when I added a new condition to a SQL query, I made a common mistake. A comparison of any value to null is always false. That's true even if the comparison uses the not equals operator! So this query:
select count(*) from config
where ftp_format_ex != 'add_photo_urls_to_inbound_feed'

was finding only 370 rows -- the ones where ftp_format_ex not only was not equal to 'add_photo_urls_to_inbound_feed', but was also not null.

This mistake is so easy to make that I also made it back in 2014, and blogged about it then!

The null-coalescing operator -- in this case, converting nulls to the empty string -- gives the result I really wanted:
select count(*) from config
where coalesce(ftp_format_ex, '') != 'add_photo_urls_to_inbound_feed'

The above query finds 3,528 rows. In the module identified in GitHub, adding coalesce to the query caused all the expected CSV files to be processed and solved the problem.


The moral of the story? Keep good logs, isolate problems by comparing older commits to the current commit in GitHub, and be careful of nulls in SQL.

Monday, July 3, 2017

Using CloudFront access logs to investigate unexpected traffic

Amazon's content delivery network (CDN), CloudFront, has many benefits when serving web content to a large global audience. A small, but important, one of those benefits is detailed logging, which I recently took advantage of to investigate some unusual web traffic.

It all started with a CloudWatch alarm. I had configured CloudWatch to email me if my CloudFront distribution received more than a specified number of requests per minutes. I received such an email, clicked the link to view the alarm details in the CloudWatch console, and accessed a graph plotting requests per minute against time. Zooming out to a two-week scale, I immediately noticed sharp peaks occurring around 20:00 UTC most days.



How to find out where the excess traffic was coming from? CloudWatch and CloudFront offer a variety of graphs with resolutions as detailed as one minute. And CloudFront offers a report of top referrers, that is, websites from which the most requests to the CloudFront distribution originated. However, the shortest time unit the top referrers report covers is one day, not of much use for identifying the source of a traffic spike lasting only a few minutes.

The answer was to consult the CloudFront access logs, which, when enabled, are stored as text files on Amazon S3. Fortunately, logging was already turned on for my CloudFront distribution. If it's not, you can enable it in the AWS Console by selecting your distribution and clicking Edit on the General tab.






The log files can then be found in the specified bucket and folder in the S3 console. Focusing on the peak that occurred around 20:00 UTC on June 24, I typed in a prefix and sorted by Last modified to zero in on the relevant logs. (Tip: timestamps are in UTC on the CloudWatch graph and in the names of the log files on S3, but the last-modified time in the S3 console is in your local time zone.)



Each log is a zipped (.gz) file. Unzipping it yields a tab-delimited text file, which you can open in a text editor, view in your favorite spreadsheet program, or analyze by writing a script. Here are the first few lines of a typical log file:




Following two header rows, each row represents one request to a URL served from the CloudFront distribution. The relevant fields are time and cs(Referrer). The referrer is the URL of the page from which the event originated. I wrote a Python script to read the log files and output a CSV file with one row per request, where each row consists of the time (truncated to the minute) and the domain name.

It was then simple to sort the CSV file by minute and domain name. In this way, I found the domain that was responsible for the excessive traffic in the minutes shortly after 20:00 UTC on June 24. Armed with this knowledge, I was able to contact the owner of that domain and ask why their website was receiving such heavy use around 20:00 each day. (As of this writing, I'm waiting for their reply.)

If you prefer not to write a script, you might instead take advantage of Amazon Athena. Once you define an appropriate schema, Athena lets you query data on S3 without downloading it or writing code.

Many thanks to the engineers at the AWS Loft in New York for pointing me in the direction of CloudFront access logs. I hope you found this article informative and helpful.