Monday, September 24, 2018

A more granular view of DynamoDB throughput using CloudWatch

DynamoDB, Amazon Web Services' NoSQL database in the cloud, provides guaranteed levels of read and write throughput, measured in units called RCUs and WCUs. The only catch is that you must specify the read and write capacity in advance. If the actual consumed capacity exceeds the provisioned capacity, throttling occurs, and read or write requests may be rejected.

You can set a CloudWatch alarm to alert you if consumed capacity comes close to exceeding provisioned capacity.

And it's easy to view the provisioned and consumed capacity, along with any throttled requests that may have occurred, on a table's Metrics tab in the AWS console. Below is a three-day graph for one of my tables, showing that some throttled reads occurred at 18:00 UTC on Sep 22.




By zooming in on the time range in question and setting the period to 1 Minute, I was able to get a more precise view of the read throughput at the time of the throttling.




But the console doesn't tell the whole story. In fact, it appears as if consumed capacity never exceeded provisioned capacity around 18:00 UTC. So why was there throttling?

To troubleshoot the cause of spikes in consumed capacity, I needed to view more granular metrics that are only available in CloudWatch. CloudWatch is accessible via the AWS API, command-line interface, or console. Here, we'll focus on using the console to see detailed CloudWatch metrics for a DynamoDB table.

Since I use multiple AWS services extensively, there are thousands of CloudWatch metrics. To find the relevant one, I started by filtering on my DyanmoDB table's name in the search box.





Next, I clicked to expand DynamoDB -> Table Metrics.



From the list of table metrics, I selected the relevant one, ConsumedReadCapacityUnits.






Next, I set the time range to the period from a few minutes before to a few minutes after the throttling occurred. I clicked custom, selected Absolute, and entered the start and end dates and times.




The resulting graph still didn't reveal the information I needed. It made it look as if there was only about one read per minute. My table's provisioned throughput was more than 600 reads per second, so why the throttling?



My next step was to click the Graphed Metrics tab.



That allowed me to use dropdowns to set the Statistic to Sum (rather than the default, Average) and the Period to 1 Minute (rather than the default, 5 Minutes).


With these settings, I could finally see a short-lived, but huge, spike in consumed read capacity. This spike of more than 32,000 reads in one minute -- invisible in the 5-minute average -- was now clearly displayed.

I hope this helps you if you're ever faced with isolating the nature and timing of an unexpected spike in read or write activity on one of your DynamoDB tables or indexes.

Friday, February 16, 2018

Copying an EC2 AMI between regions with boto 2

There are some good articles about copying an Amazon Machine Image (AMI) from one region to another, such as this one. It rightly states that copying can be accomplished using the console, command line tools, API or SDKs. I chose to use an SDK, specifically boto 2, but was unable to find clear instructions. I'm pleased to present a short Python script that shows how to do it.

The script is pretty basic -- no AWS authentication (so you'll need to provide an AWS access key ID and secret access key for an account with appropriate permissions), error handling, etc. But it works.

Here's the whole script. The comments explain some of the more interesting points.

#! /usr/bin/python2

from boto import ec2
from datetime import datetime
import time

COPY_FROM_REGION = 'us-east-1'  # Region to copy from. Change this if you like.
COPY_TO_REGION = 'eu-west-1'  # Region to copy to. Change this if you like.
AMI_ID = 'ami-XXXXXXXX'  # Change this to the ID of the AMI you wish to copy. This AMI must already exist.

ec2_conn = ec2.connect_to_region(COPY_TO_REGION)  # Boto 2 interface to EC2.

# Give the target AMI a unique name. Not truly necessary, but convenient. I chose seconds since epoch as a source
# of uniquness. The name can really be anything you want.
timestamp = int((datetime.utcnow() - datetime(1970, 1, 1)).total_seconds())
ami_name = 'eu-copy-of-{}-{}'.format(AMI_ID, timestamp)
print 'copying AMI to {}'.format(ami_name)
ec2_conn.copy_image(COPY_FROM_REGION, AMI_ID, name=ami_name)  # Initiate copying.

# Now we wait...
while True:
    # The image won't even show up in the target region for a while. Wait until it exists.
    images = ec2_conn.get_all_images(filters={
        'is-public': 'false',
        'name': ami_name
    })
    if images:
        image = images[0]  # Now the image exists in the target region.
        print image.id, image.state
        # Now wait for the image to be in the "available" state. This could take a few minutes, especially if it's big.
        while True:
            images = ec2_conn.get_all_images(filters={
                'is-public': 'false',
                'name': ami_name,
                'state': 'available'
            })
            if images:
                print 'available'
                break
            print 'not available'
            time.sleep(10)
        break
    print 'not found'
    time.sleep(10)


One final tip: I wasn't able to find documentation of the properties of the image objects returned by get_all_images. But Python offers an easy solution: just print out image.__dict__.

Happy copying!