Conditional Results and Grouping

Grouping by varying conditions is something that is hard to accomplish using straight SQL, but is something that comes up from time to time with analysis. Perhaps consider it an extended version of the more famous “group-wise maximum” problem. Since the “real life” problem I recently addressed involves our internal systems that I can’t talk about here, I’ll give another example that deals with the same issues.

The scenario: You have a list of students and classes (and a mapping of student/class), and the students all have grades and some have scholarships.

students
+------------+--------------+-----------------+
| student_id | student_name | has_scholarship |
+------------+--------------+-----------------+
|       1234 | John         | yes             |
|       1235 | Jane         | no              |
|       1236 | Joe          | no              |
|       1237 | Jennifer     | yes             |
|       1238 | Jacob        | no              |
+------------+--------------+-----------------+

classes
+----------+------------+
| class_id | class_name |
+----------+------------+
|        1 | Science    |
|        2 | History    |
|        3 | Maths      |
|        4 | Literature |
+----------+------------+

enrollment
+------------+----------+-------------+
| student_id | class_id | grade_point |
+------------+----------+-------------+
|       1235 |        1 |         4.0 |
|       1236 |        1 |         3.0 |
|       1234 |        2 |         3.5 |
|       1238 |        2 |         2.0 |
|       1237 |        3 |         4.0 |
|       1235 |        4 |         2.0 |
|       1238 |        4 |         2.5 |
+------------+----------+-------------+

-- load this straight into mysql
create table students (student_id int primary key, student_name varchar(100), has_scholarship enum('yes', 'no'));
create table classes (class_id int primary key, class_name varchar(100));
create table enrollment (student_id int, class_id int, grade_point decimal(2,1), primary key (student_id, class_id));
insert into students values (1234,"John", "yes"),(1235, "Jane", "no"),(1236, "Joe", "no"),(1237, "Jennifer", "yes"),(1238, "Jacob", "no");
insert into classes values (1, "Science"), (2, "History"), (3, "Maths"), (4, "Literature");
insert into enrollment values (1235,1, 4.0),(1236, 1, 3.0),(1234, 2, 3.5),(1238, 2, 2.0),(1237, 3, 4.0), (1235,4,2.0), (1238,4,2.5);

For a school list, they need to identify a top student from every class so they can publish this in the school newsletter. The criteria for the report is that they want is that for any class that has a student with a scholarship, use that student. But if the class has no students with a scholarship, include a student with a 4.0 grade. If the class has no one with a scholarship and no 4.0 students, we just “forget” to mention the class at all in the report.

This seems simple based on the requirements, but when you start to dive into the implementation the complexity starts to emerge. You know you want to use your trusty friend GROUP BY class_id there, but how to select a student?

One approach is to write two queries, and then UNION them together. This approach would involve looking for classes with no scholarship students and selecting a 4.0 student; and then joining that with the list of classes with scholarship students. Pretty easy, yes, but it does not scale well as the data set increases beyond our silly little example here.

Another approach may be to use a series of nested subqueries using some fancy MySQL @variable tricks. I started going down this path before realizing that it too would scale poorly for the “real-life” dataset I was considering. And I wasn’t going to even _consider_ correlated subqueries.


What I finally settled on was to use a combination of GROUP BY, HAVING, and COALESCE. In general I try to avoid the use of HAVING as it causes the server to process rows before discarding them — however here for my real data set it is a perfect compromise as the data set is very large, however we only need to filter out a few records per grouped output row.

Final query:

SELECT
        COALESCE(MAX(IF(students.has_scholarship='yes',students.student_id, NULL)), MAX(IF   (enrollment.grade_point=4.0,enrollment.student_id, NULL)), NULL) preferred_student_id,
        students.student_id,
        students.student_name,
        classes.class_id,
        classes.class_name
FROM    students
JOIN    enrollment on enrollment.student_id=students.student_id
JOIN    classes on classes.class_id=enrollment.class_id
GROUP BY
        classes.class_id
HAVING  students.student_id=preferred_student_id
AND     preferred_student_id IS NOT NULL;


This was the first time I had ever found a use for the COALESCE statement. :-) I will also say that my first approach was going to be processing this data inside the application which consumed the query, but since this was legacy codebase that no one wanted to modify the pure SQL approach seemed superior.

So am I nuts? How would you do it?

UPDATE Roland B points out that my approach doesn’t actually work. His solution however, works great.

select c.class_id
, c.class_name
, substring_index(group_concat(
s.student_name
order by
if(s.has_scholarship=’yes’,0,1)
, e.grade_point desc
), ‘,’, 1)
from enrollment e
inner join students s
on e.student_id = s.student_id
inner join classes c
on e.class_id = c.class_id
where s.has_scholarship = ‘yes’ or e.grade_point = 4
group by class_id

5 Responses to “Conditional Results and Grouping”

  1. Scott Noyes Says:

    Why not add the scholarship/grade restrictions to a WHERE clause? Eliminates the COALESCE and the HAVING. Then you just have to resolve selecting the scholar over the non-scholar, which is just groupwise-max on the `has_scholarship` field.

    WHERE students.has_scholarship = ‘yes’ OR enrollment.grade_point = 4.0

    What do you do in the case of ties (two scholars, or no scholars and two 4.0 students)?

    What do you do if the scholar has a 2.0 grade but there is a non-scholar with a 4.0?

  2. ryan Says:

    Scott, I knew it was a varient of the groupwise max, but the multiple conditions threw me. The problem seems simple but when you try to implement it the underlying complexity comes out. I’d be interested to see how you’d approach it from that direction though.

    > What do you do if the scholar has a 2.0 grade but
    > there is a non-scholar with a 4.0?

    We always return the scholarship student.

    > What do you do in the case of ties (two scholars,
    > or no scholars and two 4.0 students)?

    Let’s assume that only one scholarship student could exist per class — here is where the analogy breaks down a bit. But really, we just want one row per class: if it has a scholarship student, use any of those; and if not then choose any 4.0 student.

    If it helps… in the “real” dataset there are millions of “classes”, and only 1-5 “students” per class. This is why I feel it lends itself to the HAVING solution as it avoids a self-join.

  3. Roland BOuman Says:

    Hi!

    I don’t think your code is perfect – with your original set I get this result:

    1 Science Jane
    2 History John
    3 Maths Jennifer

    Now jennifer has a scolarship and grade 4.0 at maths. Now I add Jane to her maths enrollment. Jane does not have a scholarship and also has a grade 4.0

    insert into enrollment values (1235, 3, 4.0)

    According to your spec:

    “for any class that has a student with a scholarship, use that student. ”

    SO the result should not change. However, they do – suddenly the Maths class is gone from the list. Or maybe that was intended?

    Anyway – you can get around it in a simple manner: GROUP_CONCAT:

    select c.class_id
    , c.class_name
    , substring_index(group_concat(
    s.student_name
    order by
    if(s.has_scholarship=’yes’,0,1)
    , e.grade_point desc
    ), ‘,’, 1)
    from enrollment e
    inner join students s
    on e.student_id = s.student_id
    inner join classes c
    on e.class_id = c.class_id
    where s.has_scholarship = ‘yes’ or e.grade_point = 4
    group by class_id

  4. ryan Says:

    Roland, you are absolutely correct. The GROUP BY I was using prevents the student_id from matching except in my test case.

    Your approach is much better (meaning that it works ;-) ), and unlike other approaches does not require a self join. It is also a perfect fit for the dataset I am considering, which is many “classes” and few “students” per class. Clever, thanks!

    I am going going to update the entry since you have invalidated my approach, and suggest your solution instead.

  5. Roland BOuman Says:

    Ryan, thanks for the kind words – nice to do a challenge like this once in a while.

    Now if only group_concat would have a LIMIT clause, that would be great for this type of thing…

Leave a Reply