Testing SQL the hard way

Neal Lathia
2 min readApr 5, 2020

Looking at the minutiae of discovering when analytics SQL is completely broken (the hard way).

⚠️ Note: You can jump to the complete blog post on my github pages blog; the first section is below.

📊 Context

It’s not uncommon for Data Science workflows to have large chunks of SQL. Maybe you have a sequence of queries that you run every day to produce dashboards, or maybe you have a bunch of queries that spit out features that you feed into machine learning algorithms. If you’re a Data Scientist, you’re bound to have written a SELECT statement at some point of your day!

There’s an online wiki for programming languages called progopedia which lists SQL as “not a programming language:” this perfectly characterises everything that is strange about these words that we write, execute, and rely on to get our insights right every single day. SQL doesn’t have all the sensible things that real programming languages have to help us feel confident that they are doing what we think they’re doing. Primarily, it lacks a straightforward abstraction for “unit” testing.

About a year and half ago, a colleague Jack and I were wrangling with a query that had to be right–we are, after all, working in a regulated industry. We were dealing with a bunch of queries that were orchestrated together with Airflow. The main problem we had was that we could not rely on the input to our query being right and needed to prevent this query from quietly succeeding if anything in its results was wrong.

We went the entire journey from manually validating entries in our table to automating the detection and reporting of inconsistencies so that we could flag or fix errors that were upstream of our query: this post is a heavily simplified example of what we did. In case you’re wondering, I’ve called the post “testing SQL the hard way” because we have now abandoned this approach (and replaced it with dbt). But it had some fun lessons, so here goes!

👯‍♂️ A table of users

Jump over to my github blog to read the full post.

--

--