Friday, June 9, 2017

Turning unstructured text into a SQL query

It sounds trivial, but querying a SQL database for records that match data provided in an unstructured format such as a text file can be a challenge. The goal is to construct a SQL query with a possible lengthy in clause, without a lot of manual copying and pasting. Consider this example: a colleague supplies a list of customer names in the body of an email message. (The principle is the same if the list is in a text file, a PDF document, or any similar format.) Something like this:

Hey Steve,
Just got this list of customers from Marketing. Could you look up the date of most recent login for each of them? Of course it's an emergency, has to be done before tomorrow's big conference.
Thanks,
Lenny Lastminute
-------------------------------
Acme Anvils, Inc.
Big Bertha's Balloons, LLC
Crazy Cars & Co
Dynamite Dog Food Enterprises
Excellent Eggs, Inc.

Since the list has only five customers, it wouldn't be hard to copy and paste to come up with a query like this:
select name, last_login_date from customer where name in (
    'Acme Anvils, Inc.',
    'Big Bertha''s Balloons, LLC',
    'Crazy Cars & Co',
    'Dynamite Dog Food Enterprises',
    'Excellent Eggs, Inc.'
) order by name

But now let's imagine that the list has hundreds of entries, and the customer table has millions of rows, so we don't just want to retrieve all of them and manually identify the relevant ones. Fortunately, there are some tricks we can do in LibreOffice Calc (or Excel or your favorite spreadsheet program if you're not a Linux geek).

Start by pasting the list into the first column of a blank spreadsheet:

Now place a formula like this in cell B1:
="select name, last_login_date from customer where name in ('" & A1 & "'"

Yep, those are single-quotes (which SQL requires) inside double quotes (which LibreOffice Calc requires).

And place a formula like this in cell B2:
=B1 & ", '" & A2 & "'"

Copy the formula from cell B2 to all the remaining cells in column B. You'll end up with something like this:

Then we just need to close the parens, and perhaps throw in an order by clause and a semicolon, which we can do with a formula like this below the last cell in column B:
=B5 & ") order by name;"

That cell now contains our SQL query:
select name, last_login_date from customer where name in ('Acme Anvils, Inc.', 'Big Bertha's Balloons, LLC', 'Crazy Cars & Co', 'Dynamite Dog Food Enterprises', 'Excellent Eggs, Inc.') order by name;

But we're not quite done. If you have sharp eyes, you may have noticed the SQL syntax error. One of the customer names, Big Bertha's Balloons, LLC, contains an apostrophe, which is the same as a single quote, which is a reserved character in SQL. We can escape it by doubling it. To automate replacing ' with '', change the formula from =B1 & ", '" & A2 & "'" to:
=B1 & ", '" & SUBSTITUTE(A2, "'", "''") & "'"

This results in a valid SQL query, like so:
select name, last_login_date from customer where name in ('Acme Anvils, Inc.', 'Big Bertha''s Balloons, LLC', 'Crazy Cars & Co', 'Dynamite Dog Food Enterprises', 'Excellent Eggs, Inc.') order by name;

I hope this saves you some time someday!