Grouping by varying conditions is something that is hard to accomplish using straight SQL, but is something that comes up from time to time with analysis. Perhaps consider it an extended version of the more famous “group-wise maximum” problem. Since the “real life” problem I recently addressed involves our internal systems that I can’t talk about here, I’ll give another example that deals with the same issues.
The scenario: You have a list of students and classes (and a mapping of student/class), and the students all have grades and some have scholarships.
students
+------------+--------------+-----------------+
| student_id | student_name | has_scholarship |
+------------+--------------+-----------------+
| 1234 | John | yes |
| 1235 | Jane | no |
| 1236 | Joe | no |
| 1237 | Jennifer | yes |
| 1238 | Jacob | no |
+------------+--------------+-----------------+
classes
+----------+------------+
| class_id | class_name |
+----------+------------+
| 1 | Science |
| 2 | History |
| 3 | Maths |
| 4 | Literature |
+----------+------------+
enrollment
+------------+----------+-------------+
| student_id | class_id | grade_point |
+------------+----------+-------------+
| 1235 | 1 | 4.0 |
| 1236 | 1 | 3.0 |
| 1234 | 2 | 3.5 |
| 1238 | 2 | 2.0 |
| 1237 | 3 | 4.0 |
| 1235 | 4 | 2.0 |
| 1238 | 4 | 2.5 |
+------------+----------+-------------+
-- load this straight into mysql
create table students (student_id int primary key, student_name varchar(100), has_scholarship enum('yes', 'no'));
create table classes (class_id int primary key, class_name varchar(100));
create table enrollment (student_id int, class_id int, grade_point decimal(2,1), primary key (student_id, class_id));
insert into students values (1234,"John", "yes"),(1235, "Jane", "no"),(1236, "Joe", "no"),(1237, "Jennifer", "yes"),(1238, "Jacob", "no");
insert into classes values (1, "Science"), (2, "History"), (3, "Maths"), (4, "Literature");
insert into enrollment values (1235,1, 4.0),(1236, 1, 3.0),(1234, 2, 3.5),(1238, 2, 2.0),(1237, 3, 4.0), (1235,4,2.0), (1238,4,2.5);
For a school list, they need to identify a top student from every class so they can publish this in the school newsletter. The criteria for the report is that they want is that for any class that has a student with a scholarship, use that student. But if the class has no students with a scholarship, include a student with a 4.0 grade. If the class has no one with a scholarship and no 4.0 students, we just “forget” to mention the class at all in the report.
This seems simple based on the requirements, but when you start to dive into the implementation the complexity starts to emerge. You know you want to use your trusty friend GROUP BY class_id there, but how to select a student?
One approach is to write two queries, and then UNION them together. This approach would involve looking for classes with no scholarship students and selecting a 4.0 student; and then joining that with the list of classes with scholarship students. Pretty easy, yes, but it does not scale well as the data set increases beyond our silly little example here.
Another approach may be to use a series of nested subqueries using some fancy MySQL @variable tricks. I started going down this path before realizing that it too would scale poorly for the “real-life” dataset I was considering. And I wasn’t going to even _consider_ correlated subqueries.
What I finally settled on was to use a combination of GROUP BY, HAVING, and COALESCE. In general I try to avoid the use of HAVING as it causes the server to process rows before discarding them — however here for my real data set it is a perfect compromise as the data set is very large, however we only need to filter out a few records per grouped output row.
Final query:
SELECT
COALESCE(MAX(IF(students.has_scholarship='yes',students.student_id, NULL)), MAX(IF (enrollment.grade_point=4.0,enrollment.student_id, NULL)), NULL) preferred_student_id,
students.student_id,
students.student_name,
classes.class_id,
classes.class_name
FROM students
JOIN enrollment on enrollment.student_id=students.student_id
JOIN classes on classes.class_id=enrollment.class_id
GROUP BY
classes.class_id
HAVING students.student_id=preferred_student_id
AND preferred_student_id IS NOT NULL;
This was the first time I had ever found a use for the COALESCE statement.
I will also say that my first approach was going to be processing this data inside the application which consumed the query, but since this was legacy codebase that no one wanted to modify the pure SQL approach seemed superior.
So am I nuts? How would you do it?
UPDATE Roland B points out that my approach doesn’t actually work. His solution however, works great.
select c.class_id
, c.class_name
, substring_index(group_concat(
s.student_name
order by
if(s.has_scholarship=’yes’,0,1)
, e.grade_point desc
), ‘,’, 1)
from enrollment e
inner join students s
on e.student_id = s.student_id
inner join classes c
on e.class_id = c.class_id
where s.has_scholarship = ‘yes’ or e.grade_point = 4
group by class_id