Wednesday, January 18, 2017

A uWSGI Bug: A (Partial) Success Story About Logging and Testing

Users of SpinCar's mobile app recently started mentioning that they were seeing an error message. Not every user was affected, but it happened to enough users that it was clearly not an isolated problem. The error message was far from specific, indicating that something unexpected went wrong without giving any details.

The first place we looked to isolate the problem was MixPanel. We use this service to log events that occur in the app, including nonfatal errors. (We also use Fabric.io's Crashlytics, but that tends to be more useful for fatal errors that actually crash the app.) Fortunately, MixPanel did reveal details about at least one occurrence of the error. We could see that the app had made a particular call to the our back-end API, and that the API responded with an HTTP status of 502 (bad gateway). Furthermore, the full details of the offending HTTP request had been logged.

Our good luck in continued when we tried to replicate the problem. Simply by making the same HTTP request in a web browser, we provoked the same 502 response, using the production copy of the API. The next question was: Could we do so in the development environment, where we'd be able to troubleshoot and test potential fixes? The answer was yes! Running a local copy of the API and sending it the same request once again led to the 502 response.

In the development environment, we were able to see a detailed error message mentioning "invalid request block size." A Google search revealed that uwsgi, which use to serve our API, limits the size of requests, and the default maximum is 4,096 bytes. We corrected the problem by launching uwsgi with the buffer-size parameter to set a larger maximum. The bug was resolved within a few hours of the initial report, with no need to submit a new version of the iOS app for Apple's approval.

That's a success story, and the success was driven largely by two factors. First, adequate logging enabled us to get details about the error, even though the message displayed to users was uninformative, and the users themselves were non-technical customer personnel at distant locations. Second, the development environment closely mimics the production environment. In this case, it was a critical factor that uwsgi was used -- and used with the same settings -- in dev as well as on prod.

But, of course, the success is only partial. Real success would be avoiding the bug in the first place. How did this error slip through the cracks despite the fact that our app and API undergo extensive code review and automated testing? The answer is that the bug only occurs with very large requests, which only occur when users have used the app to create a great many videos -- far more than are created in our automated tests. This bug could have been revealed by the type of testing known as load testing or stress testing, in which actions are performed repeatedly and large data sets are used. There are a couple of reasons we haven't implemented this type of testing yet. One is that the open-source automated testing tools we rely on tend to crash during long test runs. But the main reason is that we haven't yet found the time to develop load tests. It's definitely on the to-do list!

No comments:

Post a Comment