ProgBlog

Monday, July 3, 2017

Using CloudFront access logs to investigate unexpected traffic

Amazon's content delivery network (CDN), CloudFront, has many benefits when serving web content to a large global audience. A small, but important, one of those benefits is detailed logging, which I recently took advantage of to investigate some unusual web traffic.

It all started with a CloudWatch alarm. I had configured CloudWatch to email me if my CloudFront distribution received more than a specified number of requests per minutes. I received such an email, clicked the link to view the alarm details in the CloudWatch console, and accessed a graph plotting requests per minute against time. Zooming out to a two-week scale, I immediately noticed sharp peaks occurring around 20:00 UTC most days.

How to find out where the excess traffic was coming from? CloudWatch and CloudFront offer a variety of graphs with resolutions as detailed as one minute. And CloudFront offers a report of top referrers, that is, websites from which the most requests to the CloudFront distribution originated. However, the shortest time unit the top referrers report covers is one day, not of much use for identifying the source of a traffic spike lasting only a few minutes.

The answer was to consult the CloudFront access logs, which, when enabled, are stored as text files on Amazon S3. Fortunately, logging was already turned on for my CloudFront distribution. If it's not, you can enable it in the AWS Console by selecting your distribution and clicking Edit on the General tab.

The log files can then be found in the specified bucket and folder in the S3 console. Focusing on the peak that occurred around 20:00 UTC on June 24, I typed in a prefix and sorted by Last modified to zero in on the relevant logs. (Tip: timestamps are in UTC on the CloudWatch graph and in the names of the log files on S3, but the last-modified time in the S3 console is in your local time zone.)

Each log is a zipped (.gz) file. Unzipping it yields a tab-delimited text file, which you can open in a text editor, view in your favorite spreadsheet program, or analyze by writing a script. Here are the first few lines of a typical log file:

Following two header rows, each row represents one request to a URL served from the CloudFront distribution. The relevant fields are time and cs(Referrer). The referrer is the URL of the page from which the event originated. I wrote a Python script to read the log files and output a CSV file with one row per request, where each row consists of the time (truncated to the minute) and the domain name.

It was then simple to sort the CSV file by minute and domain name. In this way, I found the domain that was responsible for the excessive traffic in the minutes shortly after 20:00 UTC on June 24. Armed with this knowledge, I was able to contact the owner of that domain and ask why their website was receiving such heavy use around 20:00 each day. (As of this writing, I'm waiting for their reply.)

If you prefer not to write a script, you might instead take advantage of Amazon Athena. Once you define an appropriate schema, Athena lets you query data on S3 without downloading it or writing code.

Many thanks to the engineers at the AWS Loft in New York for pointing me in the direction of CloudFront access logs. I hope you found this article informative and helpful.

Friday, June 9, 2017

Turning unstructured text into a SQL query

It sounds trivial, but querying a SQL database for records that match data provided in an unstructured format such as a text file can be a challenge. The goal is to construct a SQL query with a possible lengthy in clause, without a lot of manual copying and pasting. Consider this example: a colleague supplies a list of customer names in the body of an email message. (The principle is the same if the list is in a text file, a PDF document, or any similar format.) Something like this:

Hey Steve,
Just got this list of customers from Marketing. Could you look up the date of most recent login for each of them? Of course it's an emergency, has to be done before tomorrow's big conference.
Thanks,
Lenny Lastminute
-------------------------------
Acme Anvils, Inc.
Big Bertha's Balloons, LLC
Crazy Cars & Co
Dynamite Dog Food Enterprises
Excellent Eggs, Inc.

Since the list has only five customers, it wouldn't be hard to copy and paste to come up with a query like this:
select name, last_login_date from customer where name in (
'Acme Anvils, Inc.',
'Big Bertha''s Balloons, LLC',
'Crazy Cars & Co',
'Dynamite Dog Food Enterprises',
'Excellent Eggs, Inc.'
) order by name

But now let's imagine that the list has hundreds of entries, and the customer table has millions of rows, so we don't just want to retrieve all of them and manually identify the relevant ones. Fortunately, there are some tricks we can do in LibreOffice Calc (or Excel or your favorite spreadsheet program if you're not a Linux geek).

Start by pasting the list into the first column of a blank spreadsheet:

Now place a formula like this in cell B1:
="select name, last_login_date from customer where name in ('" & A1 & "'"

Yep, those are single-quotes (which SQL requires) inside double quotes (which LibreOffice Calc requires).

And place a formula like this in cell B2:
=B1 & ", '" & A2 & "'"

Copy the formula from cell B2 to all the remaining cells in column B. You'll end up with something like this:

Then we just need to close the parens, and perhaps throw in an order by clause and a semicolon, which we can do with a formula like this below the last cell in column B:
=B5 & ") order by name;"

That cell now contains our SQL query:
select name, last_login_date from customer where name in ('Acme Anvils, Inc.', 'Big Bertha's Balloons, LLC', 'Crazy Cars & Co', 'Dynamite Dog Food Enterprises', 'Excellent Eggs, Inc.') order by name;

But we're not quite done. If you have sharp eyes, you may have noticed the SQL syntax error. One of the customer names, Big Bertha's Balloons, LLC, contains an apostrophe, which is the same as a single quote, which is a reserved character in SQL. We can escape it by doubling it. To automate replacing ' with '', change the formula from =B1 & ", '" & A2 & "'" to:

=B1 & ", '" & SUBSTITUTE(A2, "'", "''") & "'"

This results in a valid SQL query, like so:

select name, last_login_date from customer where name in ('Acme Anvils, Inc.', 'Big Bertha''s Balloons, LLC', 'Crazy Cars & Co', 'Dynamite Dog Food Enterprises', 'Excellent Eggs, Inc.') order by name;

I hope this saves you some time someday!

Wednesday, January 18, 2017

Installing Numpy on Ubuntu

I needed the numpy module in order to run one of my Python programs. I'm running Python 2.7.6 on Ubuntu 14.04.

I first tried pip install numpy. No luck. After a while, this error was displayed: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 72: ordinal not in range(128). The installation did not complete successfully.

Then I tried sudo apt-get install python-numpy. That did the trick. A tip of the hat to my colleague Collin Stocks for this advice.

A uWSGI Bug: A (Partial) Success Story About Logging and Testing

Users of SpinCar's mobile app recently started mentioning that they were seeing an error message. Not every user was affected, but it happened to enough users that it was clearly not an isolated problem. The error message was far from specific, indicating that something unexpected went wrong without giving any details.

The first place we looked to isolate the problem was MixPanel. We use this service to log events that occur in the app, including nonfatal errors. (We also use Fabric.io's Crashlytics, but that tends to be more useful for fatal errors that actually crash the app.) Fortunately, MixPanel did reveal details about at least one occurrence of the error. We could see that the app had made a particular call to the our back-end API, and that the API responded with an HTTP status of 502 (bad gateway). Furthermore, the full details of the offending HTTP request had been logged.

Our good luck in continued when we tried to replicate the problem. Simply by making the same HTTP request in a web browser, we provoked the same 502 response, using the production copy of the API. The next question was: Could we do so in the development environment, where we'd be able to troubleshoot and test potential fixes? The answer was yes! Running a local copy of the API and sending it the same request once again led to the 502 response.

In the development environment, we were able to see a detailed error message mentioning "invalid request block size." A Google search revealed that uwsgi, which use to serve our API, limits the size of requests, and the default maximum is 4,096 bytes. We corrected the problem by launching uwsgi with the buffer-size parameter to set a larger maximum. The bug was resolved within a few hours of the initial report, with no need to submit a new version of the iOS app for Apple's approval.

That's a success story, and the success was driven largely by two factors. First, adequate logging enabled us to get details about the error, even though the message displayed to users was uninformative, and the users themselves were non-technical customer personnel at distant locations. Second, the development environment closely mimics the production environment. In this case, it was a critical factor that uwsgi was used -- and used with the same settings -- in dev as well as on prod.

But, of course, the success is only partial. Real success would be avoiding the bug in the first place. How did this error slip through the cracks despite the fact that our app and API undergo extensive code review and automated testing? The answer is that the bug only occurs with very large requests, which only occur when users have used the app to create a great many videos -- far more than are created in our automated tests. This bug could have been revealed by the type of testing known as load testing or stress testing, in which actions are performed repeatedly and large data sets are used. There are a couple of reasons we haven't implemented this type of testing yet. One is that the open-source automated testing tools we rely on tend to crash during long test runs. But the main reason is that we haven't yet found the time to develop load tests. It's definitely on the to-do list!

Saturday, December 24, 2016

Amazon LightSail

Amazon Web Services recently announced LightSail, a low-cost service for quickly launching a virtual server with limited options. Fewer steps and less technical knowledge are required to get a simple website up and running than with traditional AWS services like EC2. LightSail seems intended to compete with low-cost virtual hosting services such as GoDaddy's.

Upon selecting LightSail from the AWS management console, it opens in a new tab with its own subdomain, lightsail.aws.amazon.com. This isn't true of most other AWS services, and may imply that Amazon views LightSail in a separate category.

At work, where scalability and reliability are key, I have no plans to use LightSail. However, I did have a chance to put it to the test for a personal project, a new website for my band, The Showoffs. Users with no technology background might not find it easy to get started with LightSail. But if you're familiar with the basics of DNS and virtual hosting, LightSail offers a quick and affordable setup. In about an hour, I went from nothing -- no domain registered, no server, and no content -- to a functioning WordPress site.

Amazon provides documentation here. For my project, the articles on getting started, static IP addresses, and using WordPress with LightSail were particularly relevant. (On the other hand, this DNS article led me astray. It turned out not to be pertinent; all I had to do was create an A record in Route 53.)

The main steps to create theshowoffsband.com were as follows:

In Route53, registered the domain name. This required replying to a verification email.
In LightSail:

Created an instance. I chose the WordPress 4.6.1 instance image and the $5/month instance plan.
Created a static IP address.
Created a DNS zone, but I think this step was unnecessary.
Connected to the server using SSH. A button in the LightSail opens a browser-based SSH connection with a singe click (screenshot below).
Obtained the WordPress credentials by running an Amazon-provided script.

Returned to Route53 and created an A record mapping theshowoffsband.com to the static IP address. I then waited several minutes for the DNS record to propagate.
Visited my new website's WordPress admin page, logged in, and started adding content.

The cost? $12 per year for the domain registration, and $5 per month for the virtual server. If you show up at the Showoffs' next gig, it will have been worth it!

SSH connection to the LightSail virtual server with a single click

Sunday, April 3, 2016

Automated testing of localhost URLs with Selenium, Pytest and Sauce Labs

Selenium WebDriver is a great way to create automated tests of web applications. It can simulate a user's interactions with the a web page and examine the results.

Although it's not required to use a test framework to organize and execute your Selenium WebDriver tests, a good test framework does make this task more convenient. As a Python programmer, I'm a fan of Pytest.

Selenium WebDriver supports a wide variety of web browsers, but is limited to those browsers installed on the computer where WebDriver runs. That's where a service like Sauce Labs comes in. Sauce Labs runs Selenium WebDriver on virtual machines in the cloud, offering many combinations of device, operating system and browser.

So the combination of Selenium WebDriver, Pytest and and Sauce Labs enables testing of a web application across many platforms. There's just one catch. While our web application is under development, we likely wish to run it on our local machine, with a localhost URL. A Sauce Labs virtual machine can't access the local machine to run such tests -- unless we install Sauce Connect. Sauce Connect is free software from Sauce Labs that creates a secure tunnel between Sauce Labs' servers and our machine.

What follows is a step-by-step guide to creating and running tests using Selenium WebDriver, Pytest, Sauce Labs and Sauce Connect. This guide refers to Python 2.7 running on Ubuntu 15.04, but the same principles apply to any programming language and operating system.

Let's start by creating a web app to test. There are many ways to do this. Our example uses a "hello world" Python Flask app, based on the sample from the Flask Website. Here's the code, which saved as hello.py:

# Source of hello.py

from flask import Flask
app = Flask(__name__)

@app.route("/")
def hello():
    return "Hello World!"

if __name__ == "__main__":
    app.run(port=8080)

Once we run

python hello.py

we can visit http://localhost:8080 in our browser and see a web page that displays "Hello World!" Now we can write a test to verify that the web application is working as expected. Selenium WebDriver supports almost any programming language and web browser with Selenium. This example uses Python and Firefox. Let's make a directory named tests and create a file in it named test_hello.py:

# Source of test_hello.py

from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://localhost:8080/')
body_text = driver.find_element_by_css_selector('body').text
assert body_text == 'Hello World!'
driver.quit()

As long as Firefox is installed and hello.py is still running, we can execute

cd tests
python test_hello.py

Selenium WebDriver will open Firefox, navigate to http://localhost:8080, and verify that the text "Hello World!" is displayed.

Let's introduce Pytest to make the test a little more convenient to run. Pytest's discovery feature automatically finds tests based on naming conventions, such as files and functions whose names start with test_. We'll make a small change to test_hello.py, placing everything inside a function named test_1:

# Source of test_hello.py

def test_1():
    from selenium import webdriver
    driver = webdriver.Firefox()
    driver.get('http://localhost:8080/')
    body_text = driver.find_element_by_css_selector('body').text
    assert body_text == 'Hello World!'
    driver.quit()

Now, as long as Pytest is installed, we can simply run

py.test

Our test will run, and we'll get a handy summary of the results:

This test ran on Firefox. Suppose we'd like to test our web application on an operating system and browser that aren't installed locally, such as Windows 8.1 with Internet Explorer 11. For such remote testing with Selenium WebDriver running on a virtual machine, we'll need a Sauce Labs account. We'll need to supply our Sauce Labs username and an access key in order to connect to a Sauce Labs virtual server. We start by changing test_hello.py, replacing driver = webdriver.Firefox() with several lines of code that connect to a remote WebDriver.

# Source of test_hello.py
# Using a remote WebDriver

import os


def test_1():
    from selenium import webdriver
    desired_capabilities = {
        'platform': "Windows 8.1",
        'browserName': "internet explorer",
        'version': "11.0",
        'screenResolution': '1280x1024'
    }
    sauce_url = 'http://%s:%s@ondemand.saucelabs.com:80/wd/hub' %\
                (os.environ['SAUCE_USERNAME'], os.environ['SAUCE_ACCESS_KEY'])
    driver = webdriver.Remote(
        desired_capabilities=desired_capabilities,
        command_executor=sauce_url
    )
    driver.get('http://localhost:8080/')
    body_text = driver.find_element_by_css_selector('body').text
    assert body_text == 'Hello World!'
    driver.quit()

The code above assumes the Sauce Labs username and access key are stored in environment variables named SAUCE_USERNAME and SAUCE_ACCESS_KEY, respectively.

Once again, run the test with

py.test

The test runs, but doesn't pass:

Why? Because the Selenium WebDriver instance running on a Sauce Labs server can't access http://localhost:8080, which is running on our local machine. The error message provides a hint how to fix this: install Sauce Connect, which we can download here. This example uses Sauce Connect v4.3.14 for Linux. Versions for other operating systems are also available.

Installation on Linux is as simple as unzipping the downloaded file and ensuring that the resulting bin/sc file is on the path and has execute permission. Don't forget that, as above, our Sauce Labs credentials must be stored in the environment variables SAUCE_USERNAME and SAUCE_ACCESS_KEY.

Prior to running the test again, we run sc, which creates a secure tunnel between Sauce Labs' servers and our local machine. The tunnel remains available until we terminate sc (e.g by pressing Ctrl+c). Leaving sc running (either in the background or in a separate terminal window), we run py.test once more. This time, our test should pass.

At this point, we have everything we need to run remote tests of localhost URLs. However, there are some best practices we can implement to make such tests more convenient and reliable. Sauce Labs recommends that a tunnel be established and terminated for each test run, rather than leaving the tunnel open indefinitely. We'll accomplish this in two steps. First, we'll pass command-line arguments and add a few more lines of code to run sc as as daemon (that is, in the background). Second, we'll use a Pytest fixture to ensure the tunnel is ready before any test runs.

The full list of Sauce Connect command-line arguments is available here. The ones we want are:

-t: Controls which domains are accessed via the tunnel. For improved performance, we'll tell sc to use the tunnel for localhost URLs only.
--daemonize: Runs sc as a daemon.
--pidfile: The name of a file to which sc will write its process ID (so we know which process to kill when we're done).
--readyfile: The name of a file that sc will touch to indicate when the tunnel is ready.
--tunnel-identifier: Gives our tunnel a name so we can refer to it.

We could place the code to launch sc at the top of function test_1() in test_hello.py, and the code to terminate the tunnel at the bottom of the same function. However, this has a drawback. Presumably, we plan to write many test functions, not just one. We need to tunnel to be created before the first test runs, and terminated after the last test finishes -- regardless of the order in which the tests might be executed. Pytest provides a powerful feature known as fixtures that can accomplish this.

We could create a module-scoped fixture that applies to every test function in our test_hello.py file. But we can do even better. Let's create a session-scoped fixture. By defining our fixture in a specially-named file, conftest,py, and using a decorator to give the fixture session scope, the fixture will automatically apply to every test function in every test file we choose to create. The comments in the source code below provide more details.

# Source of conftest.py

import os
import pytest
import signal
import subprocess
import time


@pytest.fixture(scope="session")  # Session scope makes the fixture apply to any test function in any test file
def tunnel(request):
    sc_pid_file_name = '/tmp/sc_pid.txt'  # File where sc will store its PID
    sc_ready_file_name = '/tmp/sc_ready.txt'  # File sc will touch when the tunnel is ready
    sc_pid = None

    def fin():  # Function that executes when the last test using the fixture goes out of scope
        if sc_pid:
            os.kill(int(sc_pid), signal.SIGTERM)  # Kill sc's process, terminating the tunnel

    try:
        os.remove(sc_ready_file_name)
    except OSError:
        pass

    # Sauce Connect reads credentials from environment variables SAUCE_USERNAME and SAUCE_ACCESS_KEY
    subprocess.call([
        'sc',
        '-t', 'localhost',                            # Use tunnel for localhost URLs only
        '--readyfile', sc_ready_file_name,            # Name of the "ready" file
        '--tunnel-identifier', 'my_tunnel',           # Name of the tunnel
        '--daemonize', '--pidfile', sc_pid_file_name  # Run as daemon; store PID in specified file
    ])

    with open(sc_pid_file_name) as sc_pid_file:
        sc_pid = sc_pid_file.read()  # Read the PID
    request.addfinalizer(fin)  # Register the finalizer function

    # Wait for the tunnel to be ready
    start_time = time.time()
    while True:
        if os.path.exists(sc_ready_file_name):
            break
        if time.time() - start_time > 30:
            raise Exception('Timed out waiting for Sauce Connect')

Now we just need to make two small changes to test_hello.py, passing the name of the fixture function as an argument to the test function, and including the tunnel identifier in the desired_capabilities dictionary that controls Selenium WebDriver's behavior.

# Source of test_hello.py
# Using a remote WebDriver

# Using a Pytest fixture

import os


def test_1(tunnel):
    from selenium import webdriver
    desired_capabilities = {
        'platform': "Windows 8.1",
        'browserName': "internet explorer",
        'version': "11.0",
        'screenResolution': '1280x1024',
        'tunnelIdentifier': 'my_tunnel'
    }
    sauce_url = 'http://%s:%s@ondemand.saucelabs.com:80/wd/hub' %\
                (os.environ['SAUCE_USERNAME'], os.environ['SAUCE_ACCESS_KEY'])
    driver = webdriver.Remote(
        desired_capabilities=desired_capabilities,
        command_executor=sauce_url
    )
    driver.get('http://localhost:8080/')
    body_text = driver.find_element_by_css_selector('body').text
    assert body_text == 'Hello World!'
    driver.quit()

That's it! We now have a framework for running any number of test functions, with a tunnel automatically created before the first test, and terminated after the last test.

Sunday, July 19, 2015

Python dictionary comprehensions

Python programmers may be familiar with list comprehensions, a compact syntax for defining a list. Here's a typical example that creates a list containing the squares of the first five positive integers:

print [n*n for n in range(1,6)]
[1, 4, 9, 16, 25]

But suppose we want to create a dictionary, rather than just a list, mapping each integer to its square. We can use a dictionary comprehension. It differs from a list comprehension in using curly braces instead of square brackets, and in specifying two expressions separated by a colon to the left of the

for.

print {n: n*n for n in range(1,6)}
{1: 1, 2: 4, 3: 9, 4: 16, 5: 25}

This is equivalent to:

d = {}
for n in range(1,6):
    d[n] = n*n
print d

Dictionary comprehensions are a great of example of how we can do a lot in Python with a single, very readable, line of code.