Monday, July 3, 2017

Using CloudFront access logs to investigate unexpected traffic

Amazon's content delivery network (CDN), CloudFront, has many benefits when serving web content to a large global audience. A small, but important, one of those benefits is detailed logging, which I recently took advantage of to investigate some unusual web traffic.

It all started with a CloudWatch alarm. I had configured CloudWatch to email me if my CloudFront distribution received more than a specified number of requests per minutes. I received such an email, clicked the link to view the alarm details in the CloudWatch console, and accessed a graph plotting requests per minute against time. Zooming out to a two-week scale, I immediately noticed sharp peaks occurring around 20:00 UTC most days.



How to find out where the excess traffic was coming from? CloudWatch and CloudFront offer a variety of graphs with resolutions as detailed as one minute. And CloudFront offers a report of top referrers, that is, websites from which the most requests to the CloudFront distribution originated. However, the shortest time unit the top referrers report covers is one day, not of much use for identifying the source of a traffic spike lasting only a few minutes.

The answer was to consult the CloudFront access logs, which, when enabled, are stored as text files on Amazon S3. Fortunately, logging was already turned on for my CloudFront distribution. If it's not, you can enable it in the AWS Console by selecting your distribution and clicking Edit on the General tab.






The log files can then be found in the specified bucket and folder in the S3 console. Focusing on the peak that occurred around 20:00 UTC on June 24, I typed in a prefix and sorted by Last modified to zero in on the relevant logs. (Tip: timestamps are in UTC on the CloudWatch graph and in the names of the log files on S3, but the last-modified time in the S3 console is in your local time zone.)



Each log is a zipped (.gz) file. Unzipping it yields a tab-delimited text file, which you can open in a text editor, view in your favorite spreadsheet program, or analyze by writing a script. Here are the first few lines of a typical log file:




Following two header rows, each row represents one request to a URL served from the CloudFront distribution. The relevant fields are time and cs(Referrer). The referrer is the URL of the page from which the event originated. I wrote a Python script to read the log files and output a CSV file with one row per request, where each row consists of the time (truncated to the minute) and the domain name.

It was then simple to sort the CSV file by minute and domain name. In this way, I found the domain that was responsible for the excessive traffic in the minutes shortly after 20:00 UTC on June 24. Armed with this knowledge, I was able to contact the owner of that domain and ask why their website was receiving such heavy use around 20:00 each day. (As of this writing, I'm waiting for their reply.)

If you prefer not to write a script, you might instead take advantage of Amazon Athena. Once you define an appropriate schema, Athena lets you query data on S3 without downloading it or writing code.

Many thanks to the engineers at the AWS Loft in New York for pointing me in the direction of CloudFront access logs. I hope you found this article informative and helpful.