Archive for April, 2009

Why 5.4?

Wednesday, April 22nd, 2009

The second most frequent question people have asked me since MySQLConf started is “Why 5.4? What happened to 5.2 and 5.3?  Why not 5.5?”  I got an answer to this at lunch today from someone who was involved in the decision making process on this.  So here is the story as I understand it.

Why not 5.2?  IF you recall your ancient history, 5.2 was the original plan for post-5.1 features, before it was renamed to 6.0 for marketing reasons.  Because of some internal issues at Sun/MySQL and also to reduce confusion they did not want to reuse that same version number for the next product.  This makes a lot of sense to me, reusing the same version number for two different products could definitely cause some confusion.

Why not 6.0?  Short answer is that because of the reduced featureset, the new version didn’t warrant a whole new major version number.  I think this is correct, because if they released 6.0 without Falcon, Maria, Online Backup, etc it would feel pretty strange.  Instead we get the impression that this is a stepping stone on the way to the features that were originally promised in 6.0.

Why not 5.5?  Well 5.5 can imply that you are “halfway” to something else, which again implies a major new featureset, and the 5.4 features are really incremental (though awesome* and much needed**) features.  They can now save up the more impressive 5.5 number for a release that adds features more dramatically.

So why 5.4 instead of 5.3?  The answer I heard on this was that they wanted to create a gap between the aborted 5.2 version and what comes next.  Maybe someone else can clarify a bit better on this, but this was the answer I heard.  I guess this makes sense too… if they skipped just one it would seem like they are missing something, if they skip two versions it’s more clear that it is just an arbitrary number to indicate the incremental improvements.

So when you add it all up, the version number 5.4 actually starts to make a bit more sense.

* awesome: performance on multicore so we can better scale vertically
** much needed: signal/resignal in stored procs

stunned

Monday, April 20th, 2009

I am stunned at the news that Oracle is buying Sun.  This is not because I fear change or uncertainty, last time this year I was cautiously optimistic about Sun’s purchase of MySQL.  But not this year, it’s fear and disappointment over what this means for MySQL.

When I read this as a rumour a few weeks ago I thought it was a joke of an idea.  Why would a high margin software company want to buy a declining hardware business, even if that hardware is great?  As for their software, I cannot imagine that Oracle is interested in Java, MySQL, etc as revenue generating products, it would just be a tiny blip for them.

It will be incredibly interesting to see what comes next, and I’m sure we’ll see a lot of that at the UC.  I’ll be honest though… I need some convincing, and I imagine I’m not the only one.

Conditional Results and Grouping

Wednesday, April 8th, 2009

Grouping by varying conditions is something that is hard to accomplish using straight SQL, but is something that comes up from time to time with analysis. Perhaps consider it an extended version of the more famous “group-wise maximum” problem. Since the “real life” problem I recently addressed involves our internal systems that I can’t talk about here, I’ll give another example that deals with the same issues.

The scenario: You have a list of students and classes (and a mapping of student/class), and the students all have grades and some have scholarships.

students
+------------+--------------+-----------------+
| student_id | student_name | has_scholarship |
+------------+--------------+-----------------+
|       1234 | John         | yes             |
|       1235 | Jane         | no              |
|       1236 | Joe          | no              |
|       1237 | Jennifer     | yes             |
|       1238 | Jacob        | no              |
+------------+--------------+-----------------+

classes
+----------+------------+
| class_id | class_name |
+----------+------------+
|        1 | Science    |
|        2 | History    |
|        3 | Maths      |
|        4 | Literature |
+----------+------------+

enrollment
+------------+----------+-------------+
| student_id | class_id | grade_point |
+------------+----------+-------------+
|       1235 |        1 |         4.0 |
|       1236 |        1 |         3.0 |
|       1234 |        2 |         3.5 |
|       1238 |        2 |         2.0 |
|       1237 |        3 |         4.0 |
|       1235 |        4 |         2.0 |
|       1238 |        4 |         2.5 |
+------------+----------+-------------+

-- load this straight into mysql
create table students (student_id int primary key, student_name varchar(100), has_scholarship enum('yes', 'no'));
create table classes (class_id int primary key, class_name varchar(100));
create table enrollment (student_id int, class_id int, grade_point decimal(2,1), primary key (student_id, class_id));
insert into students values (1234,"John", "yes"),(1235, "Jane", "no"),(1236, "Joe", "no"),(1237, "Jennifer", "yes"),(1238, "Jacob", "no");
insert into classes values (1, "Science"), (2, "History"), (3, "Maths"), (4, "Literature");
insert into enrollment values (1235,1, 4.0),(1236, 1, 3.0),(1234, 2, 3.5),(1238, 2, 2.0),(1237, 3, 4.0), (1235,4,2.0), (1238,4,2.5);

For a school list, they need to identify a top student from every class so they can publish this in the school newsletter. The criteria for the report is that they want is that for any class that has a student with a scholarship, use that student. But if the class has no students with a scholarship, include a student with a 4.0 grade. If the class has no one with a scholarship and no 4.0 students, we just “forget” to mention the class at all in the report.

This seems simple based on the requirements, but when you start to dive into the implementation the complexity starts to emerge. You know you want to use your trusty friend GROUP BY class_id there, but how to select a student?

One approach is to write two queries, and then UNION them together. This approach would involve looking for classes with no scholarship students and selecting a 4.0 student; and then joining that with the list of classes with scholarship students. Pretty easy, yes, but it does not scale well as the data set increases beyond our silly little example here.

Another approach may be to use a series of nested subqueries using some fancy MySQL @variable tricks. I started going down this path before realizing that it too would scale poorly for the “real-life” dataset I was considering. And I wasn’t going to even _consider_ correlated subqueries.


What I finally settled on was to use a combination of GROUP BY, HAVING, and COALESCE. In general I try to avoid the use of HAVING as it causes the server to process rows before discarding them — however here for my real data set it is a perfect compromise as the data set is very large, however we only need to filter out a few records per grouped output row.

Final query:

SELECT
        COALESCE(MAX(IF(students.has_scholarship='yes',students.student_id, NULL)), MAX(IF   (enrollment.grade_point=4.0,enrollment.student_id, NULL)), NULL) preferred_student_id,
        students.student_id,
        students.student_name,
        classes.class_id,
        classes.class_name
FROM    students
JOIN    enrollment on enrollment.student_id=students.student_id
JOIN    classes on classes.class_id=enrollment.class_id
GROUP BY
        classes.class_id
HAVING  students.student_id=preferred_student_id
AND     preferred_student_id IS NOT NULL;


This was the first time I had ever found a use for the COALESCE statement. :-) I will also say that my first approach was going to be processing this data inside the application which consumed the query, but since this was legacy codebase that no one wanted to modify the pure SQL approach seemed superior.

So am I nuts? How would you do it?

UPDATE Roland B points out that my approach doesn’t actually work. His solution however, works great.

select c.class_id
, c.class_name
, substring_index(group_concat(
s.student_name
order by
if(s.has_scholarship=’yes’,0,1)
, e.grade_point desc
), ‘,’, 1)
from enrollment e
inner join students s
on e.student_id = s.student_id
inner join classes c
on e.class_id = c.class_id
where s.has_scholarship = ‘yes’ or e.grade_point = 4
group by class_id

Adding new partitions beyond MAXVALUE

Thursday, April 2nd, 2009

I have found that MySQL RANGE partitions on the primary key are a great way to achieve scale for insert-heavy InnoDB tables.  I have used this to maintain an excellent and predictable insert rate, to avoid some of the well documented problems with insert performance as table sizes grow (especially with large/many secondary indexes).  In addition purging old data is fast and non-blocking because you can just DROP PARTITION as a single very fast operation.

However, MySQL partitioning brings up an interesting little issue when you exceed the space you initially allocate for your partitions.  Here in our sample table we define 4 partitions to handle 100M records, and everything after that will fall into the “pmx” bucket.


CREATE TABLE partition_test (
`id` int unsigned primary key auto_increment,
`payload` varchar(35) not null default '',
`stamp` timestamp default current_timestamp on update current_timestamp
) ENGINE=InnoDB
PARTITION BY RANGE (`id`) (
PARTITION p00 VALUES LESS THAN (025000000),
PARTITION p01 VALUES LESS THAN (050000000),
PARTITION p02 VALUES LESS THAN (075000000),
PARTITION p03 VALUES LESS THAN (100000000),
PARTITION pmx VALUES LESS THAN MAXVALUE
);

Let’s say that you get to 75M records, and now you want to extend your partition set beyond your current allocation.  The simple way is to just ADD PARTITION, like so:


ALTER TABLE partition_test ADD PARTITION (
PARTITION p04 VALUES LESS THAN (125000000)
);
# yields:
ERROR 1481 (HY000): MAXVALUE can only be used in last partition definition

This is uncool.  One option is to DROP the MAXVALUE partition, and then add it… but I find that a bit scary.  Thankfully there is a better way to add the new partition, by using REORGANIZE:


ALTER TABLE partition_test REORGANIZE PARTITION pmx INTO (
PARTITION p04 VALUES LESS THAN (125000000),
PARTITION pmx VALUES LESS THAN MAXVALUE
);

By using REORGANIZE, you can rebuild pmx and recreate it in the same step.  Any values that are currently in pmx will have to be rebuilt with the REORGANIZE statement, but if you are thinking ahead and have no records in that partition, REORGANIZE is a very fast and safe operation.