Thursday, January 1, 2015

Avoiding inaccurate results when using SQL joins with aggregate functions

Using JOIN in a SQL query is a powerful way to combine data from two or more tables. However, if a join is used incorrectly in combination with an aggregate function such as COUNT or SUM, it may not give the output you intended. With a little care, you can get the right result.

As an example, suppose our database represents a group of students. Each student has zero or more email addresses. So we'll have a table named STUDENT, as well as a table named EMAIL that includes a STUDENT_ID column. An email address can be marked inactive, meaning the student doesn't currently use it.

Our goal is to count the number of email addresses for each student.

Here's the SQL to create the two tables:

create table student
(
  id integer,
  name character varying
);

create table email
(
  student_id integer,
  address character varying,
  is_active boolean
);


Let's populate our STUDENT and EMAIL tables with some data:

insert into student values(1, 'Steve');
insert into student values(2, 'Dana');
insert into student values(3, 'Jason');
insert into student values(4, 'Mark');

insert into email values(1, 'steve@gmail.com', true);
insert into email values(1, 'steve@yahoo.com', true);
insert into email values(2, 'dana@gmail.com', true);
insert into email values(2, 'dana@gmail.com', false);
insert into email values(4, 'mark@gmail.com', false);


Here are the resulting contents of the two tables:


















Notice that Jason has no email address, and Mark's only email address is inactive.

As a first attempt at counting email addresses, we might try this:

select s.name, count(e.address)
from student s
join email e on e.student_id = s.id
group by (s.name)


 
The counts for Dana, Mark and Steve are correct, but Jason is missing. We forgot to use a LEFT OUTER JOIN, which will include every record from the STUDENT table even if there is no corresponding record in the EMAIL table. So let's try:

select s.name, count(e.address)
from student s
left outer join email e on e.student_id = s.id
group by (s.name)










As we wanted, Jason shows up with a count of 0. So far, so good. Next, suppose we want to count only active email addresses. We expect the count for Dana to be 1, and for Mark to be 0. Let's try this query:

select s.name, count(e.address)
from student s
left outer join email e on e.student_id = s.id
where e.is_active
group by (s.name)








Jason and Mark have gone missing. To understand why, consider the output of this simple JOIN query:

select * from student s join email e on e.student_id = s.id:











Now condition the query on e.is_active, and only three rows will remain:

select * from student s join email e on e.student_id = s.id where e.is_active








It's these three rows that are considered by the aggregate function COUNT: two for Steve, one for Dana, and none for Jason or Mark.

How can we avoid this problem and make sure results are displayed for all students? The answer is to apply the filter e.is_active as part of the JOIN condition, not in the query's WHERE clause. Now we get the desired result:

select s.name, count(e.address)
from student s
left outer join email e on e.student_id = s.id and e.is_active
group by (s.name)










This type of mistake can be easy to overlook. When dealing with a large database, if you use the wrong query, the results may still look reasonable: you might not notice that a few students are missing. The moral of the story is, when using JOIN together with aggregate functions, always consider whether you want any filter conditions to be part of the JOIN condition, or part of a WHERE clause.