Monday, September 24, 2018

A more granular view of DynamoDB throughput using CloudWatch

DynamoDB, Amazon Web Services' NoSQL database in the cloud, provides guaranteed levels of read and write throughput, measured in units called RCUs and WCUs. The only catch is that you must specify the read and write capacity in advance. If the actual consumed capacity exceeds the provisioned capacity, throttling occurs, and read or write requests may be rejected.

You can set a CloudWatch alarm to alert you if consumed capacity comes close to exceeding provisioned capacity.

And it's easy to view the provisioned and consumed capacity, along with any throttled requests that may have occurred, on a table's Metrics tab in the AWS console. Below is a three-day graph for one of my tables, showing that some throttled reads occurred at 18:00 UTC on Sep 22.




By zooming in on the time range in question and setting the period to 1 Minute, I was able to get a more precise view of the read throughput at the time of the throttling.




But the console doesn't tell the whole story. In fact, it appears as if consumed capacity never exceeded provisioned capacity around 18:00 UTC. So why was there throttling?

To troubleshoot the cause of spikes in consumed capacity, I needed to view more granular metrics that are only available in CloudWatch. CloudWatch is accessible via the AWS API, command-line interface, or console. Here, we'll focus on using the console to see detailed CloudWatch metrics for a DynamoDB table.

Since I use multiple AWS services extensively, there are thousands of CloudWatch metrics. To find the relevant one, I started by filtering on my DyanmoDB table's name in the search box.





Next, I clicked to expand DynamoDB -> Table Metrics.



From the list of table metrics, I selected the relevant one, ConsumedReadCapacityUnits.






Next, I set the time range to the period from a few minutes before to a few minutes after the throttling occurred. I clicked custom, selected Absolute, and entered the start and end dates and times.




The resulting graph still didn't reveal the information I needed. It made it look as if there was only about one read per minute. My table's provisioned throughput was more than 600 reads per second, so why the throttling?



My next step was to click the Graphed Metrics tab.



That allowed me to use dropdowns to set the Statistic to Sum (rather than the default, Average) and the Period to 1 Minute (rather than the default, 5 Minutes).


With these settings, I could finally see a short-lived, but huge, spike in consumed read capacity. This spike of more than 32,000 reads in one minute -- invisible in the 5-minute average -- was now clearly displayed.

I hope this helps you if you're ever faced with isolating the nature and timing of an unexpected spike in read or write activity on one of your DynamoDB tables or indexes.