usefulness of the query cache

September 30th, 2009

Lot of talk lately about the effectiveness or lack thereof of the MySQL query cache.  I’m kind of surprised we are still having this discussion in late 2009, I really thought these issues have been long settled.

It’s well known that the query cache is terrible for environments with a constant stream of writes.  Each write invalidates the cached results for every query touching that table.  This means that your database needs to be constantly inserting and purging the query cache in an environment with mixed reads and writes (aka the most common type of “Web 2.0″ workload you’ll pardon the outdated cliché).

But now, Mark’s post goes one step further and reveals that even in a completely read-only environment, the query cache is still not effective.  This was news to me, as I haven’t even bothered testing this feature in a few years.  I’m sure if you really tried you could devise a scenario where it would outperform the InnoDB default, and it seems that consultants have encountered some installations where it has helped.  However I’m quite certain that in a well-designed modern workload, the query cache is not going to be a good approach.

The obvious alternative solution is to use an external caching system, preferably a L1/L2 cache using APC/memcached or similar.  Combine that with the most excellent MySQL/memcached UDF to intelligently prune your cache and you’ve got a system that easily can achieve performance at high concurrency.

I know what I’m saying here is not exactly breaking news (memcached was created 6 years ago), this really is the standard implementation now… my only point is that any discussion about the query cache that doesn’t suggest an external L1/L2 cache is missing the most important aspect.

Drizzle made the right call to remove the query cache feature entirely, with a plugin interface to optionally support that as needed.  That approach may be too drastic for MySQL itself, which still has some low-end use cases that may be able to get a performance benefit from it (and where the caches may be too much overhead).  In my opinion though, it should be disabled or set to DEMAND by default in a future version (and maybe ship the memcached UDF plugin too!).  Is it too late for 5.4 GA?

Disk I/O Bottlenecks

August 25th, 2009

My first approach to diagnosing a performance problem is to start by trying to find the system’s bottleneck — the limiting factor in a series of events that causes a slowdown for the whole process.

As a DBA and MySQL specialist, one of the first things I look at is to see if we are bottlenecked on disk I/O.  However, I don’t really like CPU iowait as a metric for measuring performance problems with disk I/O.  Here’s a not-too-uncommon example of how the traditional iowait approach is very misleading on a multicore server.

Machine tested: Sun Fire X4140 w/8 disks RAID1+0 and 8 CPU cores.

1) Generate a large file (20G) onto the RAID partition
# dd if=/dev/urandom of=/data/sample.dat bs=1024 count=20000000
2) Generate single-threaded I/O activity by copying the file back and forth
# while [ 1 -ne 0 ]; do cp /data/sample.dat /data/sample2.dat; done

While step #2 is going on, monitor the activity in top.  As you watch, it will tell you that load is around 2, the CPU is 70-85% idle, and iowait is only between 5-20%.  If the CPU is mostly idle and the iowait is that low, what is the earth is the bottleneck — the limiting factor preventing this process from finishing faster?

From this data it’s easy to conclude that some software inefficiency is causing a performance issue.  I’ve watched a senior sysadmin take this information and (wrongly) conclude that the performance issue was because MySQL sucks as it doesn’t scale to enough CPUs — when in fact it was actually making good utilization of the disks but only 4/8 CPUs were being fully utilized in a 4-threaded bulk load procedure.  This threw off iowait in precisely the same way that our test showed above.

So instead of using top, while the #2 process is still running look at the same server using iostat.  Specifically, run iostat -x so that you can see the %util — this represents disk utilization for your system devices.  Now you’ll see something much more informative than the output from top: you should see your RAID device utilization at a fairly consistent 80-100%.

Unlike the top output, this starts to give you a much better picture of what the bottleneck is in the process.  The iostat manpage says: “%util : Percentage  of  CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100%.”  There is our precious bottleneck that we’ve been searching for.

Knowing the bottleneck gives you a critical piece of information: in our simple test, the limiting factor was the disk.  Buying a better CPU or upgrading the network link really isn’t going to help this problem, instead we should focus on increasing disk I/O throughput (more/faster disks, SAN, etc) — and/or reducing the amount of stress we are placing on the disk via better query tuning, adding RAM, sharding, or any of the well known scaling solutions.

Toolset for quick and dirty bottleneck approach:

CPU bottleneck:
mpstat -P all  (or press “1″ inside top)

  • question: are you effectively using all your CPUs or is a subset doing most of the work?
  • followup: can you parallelize the processes to utilize more CPU?

Disk I/O bottleneck:
iostat -x -m 2

  • question: are any of your block devices saturated? what is the read/write I/O profile?
  • followup: tune database queries, increase # of spindles and/or drive speed, add RAM to push more data into the Innodb Buffer Pool, etc

Network bottleneck:
pktstat -t

  • question: what is your % utilization per interface?
  • followup: find someone who knows more about the network than me

Aside: how does iostat determine it’s magic utilization number?

Digging into the iostat source code and kernel documentation a bit, it all becomes a bit clearer. Utilization is determined by this simple formula:

#define S_VALUE(m,n,p) (((double) ((n) - (m))) / (p) * HZ)

xds->util = S_VALUE(sdp->tot_ticks, sdc->tot_ticks, itv);

itv is just an interval.  sdp and spc are the device’s past and current states.  But what does tot_ticks come from?

/* Try to read given stat file */
if ((fp = fopen(filename, "r")) == NULL)
return 0;

i = fscanf(fp, "%lu %lu %llu %lu %lu %lu %llu %lu %lu %lu %lu",
&rd_ios, &rd_merges_or_rd_sec, &rd_sec_or_wr_ios, &rd_ticks_or_wr_sec,
&wr_ios, &wr_merges, &wr_sec, &wr_ticks, &ios_pgr, &tot_ticks, &rq_ticks);

OK, now we’re getting somewhere. filename represents a file for your device: /sys/block/$device/stat — and tot_ticks is simply the 10th of 11 fields in that file.

What is that mysterious 10th field though?  The kernel docs have this to say: “This value counts the number of milliseconds during which the device has had I/O requests queued”.  And now, it’s crystal clear: the incrementing counter tells you how long the disk been queuing.  Take two samples over a known period of time and you can measure the percentage of time the device has been queuing stuff.  It’s very simple, but also incredibly effective!

So now in theory you should be able to match the result of iostat -x by running a bash script to sample that value and measure the change over time.  And yes, you can!  That bash script matches exactly what iostat -x tells you, and outputs to a status and log file telling you the minute average and peak.  Then you can take that output and feed it into your favourite monitoring software so you can see pretty graphs, send warnings, and correlate problem reports to spikes in real I/O activity — much more accurate than with iowait.

Why 5.4?

April 22nd, 2009

The second most frequent question people have asked me since MySQLConf started is “Why 5.4? What happened to 5.2 and 5.3?  Why not 5.5?”  I got an answer to this at lunch today from someone who was involved in the decision making process on this.  So here is the story as I understand it.

Why not 5.2?  IF you recall your ancient history, 5.2 was the original plan for post-5.1 features, before it was renamed to 6.0 for marketing reasons.  Because of some internal issues at Sun/MySQL and also to reduce confusion they did not want to reuse that same version number for the next product.  This makes a lot of sense to me, reusing the same version number for two different products could definitely cause some confusion.

Why not 6.0?  Short answer is that because of the reduced featureset, the new version didn’t warrant a whole new major version number.  I think this is correct, because if they released 6.0 without Falcon, Maria, Online Backup, etc it would feel pretty strange.  Instead we get the impression that this is a stepping stone on the way to the features that were originally promised in 6.0.

Why not 5.5?  Well 5.5 can imply that you are “halfway” to something else, which again implies a major new featureset, and the 5.4 features are really incremental (though awesome* and much needed**) features.  They can now save up the more impressive 5.5 number for a release that adds features more dramatically.

So why 5.4 instead of 5.3?  The answer I heard on this was that they wanted to create a gap between the aborted 5.2 version and what comes next.  Maybe someone else can clarify a bit better on this, but this was the answer I heard.  I guess this makes sense too… if they skipped just one it would seem like they are missing something, if they skip two versions it’s more clear that it is just an arbitrary number to indicate the incremental improvements.

So when you add it all up, the version number 5.4 actually starts to make a bit more sense.

* awesome: performance on multicore so we can better scale vertically
** much needed: signal/resignal in stored procs

stunned

April 20th, 2009

I am stunned at the news that Oracle is buying Sun.  This is not because I fear change or uncertainty, last time this year I was cautiously optimistic about Sun’s purchase of MySQL.  But not this year, it’s fear and disappointment over what this means for MySQL.

When I read this as a rumour a few weeks ago I thought it was a joke of an idea.  Why would a high margin software company want to buy a declining hardware business, even if that hardware is great?  As for their software, I cannot imagine that Oracle is interested in Java, MySQL, etc as revenue generating products, it would just be a tiny blip for them.

It will be incredibly interesting to see what comes next, and I’m sure we’ll see a lot of that at the UC.  I’ll be honest though… I need some convincing, and I imagine I’m not the only one.

Conditional Results and Grouping

April 8th, 2009

Grouping by varying conditions is something that is hard to accomplish using straight SQL, but is something that comes up from time to time with analysis. Perhaps consider it an extended version of the more famous “group-wise maximum” problem. Since the “real life” problem I recently addressed involves our internal systems that I can’t talk about here, I’ll give another example that deals with the same issues.

The scenario: You have a list of students and classes (and a mapping of student/class), and the students all have grades and some have scholarships.

students
+------------+--------------+-----------------+
| student_id | student_name | has_scholarship |
+------------+--------------+-----------------+
|       1234 | John         | yes             |
|       1235 | Jane         | no              |
|       1236 | Joe          | no              |
|       1237 | Jennifer     | yes             |
|       1238 | Jacob        | no              |
+------------+--------------+-----------------+

classes
+----------+------------+
| class_id | class_name |
+----------+------------+
|        1 | Science    |
|        2 | History    |
|        3 | Maths      |
|        4 | Literature |
+----------+------------+

enrollment
+------------+----------+-------------+
| student_id | class_id | grade_point |
+------------+----------+-------------+
|       1235 |        1 |         4.0 |
|       1236 |        1 |         3.0 |
|       1234 |        2 |         3.5 |
|       1238 |        2 |         2.0 |
|       1237 |        3 |         4.0 |
|       1235 |        4 |         2.0 |
|       1238 |        4 |         2.5 |
+------------+----------+-------------+

-- load this straight into mysql
create table students (student_id int primary key, student_name varchar(100), has_scholarship enum('yes', 'no'));
create table classes (class_id int primary key, class_name varchar(100));
create table enrollment (student_id int, class_id int, grade_point decimal(2,1), primary key (student_id, class_id));
insert into students values (1234,"John", "yes"),(1235, "Jane", "no"),(1236, "Joe", "no"),(1237, "Jennifer", "yes"),(1238, "Jacob", "no");
insert into classes values (1, "Science"), (2, "History"), (3, "Maths"), (4, "Literature");
insert into enrollment values (1235,1, 4.0),(1236, 1, 3.0),(1234, 2, 3.5),(1238, 2, 2.0),(1237, 3, 4.0), (1235,4,2.0), (1238,4,2.5);

For a school list, they need to identify a top student from every class so they can publish this in the school newsletter. The criteria for the report is that they want is that for any class that has a student with a scholarship, use that student. But if the class has no students with a scholarship, include a student with a 4.0 grade. If the class has no one with a scholarship and no 4.0 students, we just “forget” to mention the class at all in the report.

This seems simple based on the requirements, but when you start to dive into the implementation the complexity starts to emerge. You know you want to use your trusty friend GROUP BY class_id there, but how to select a student?

One approach is to write two queries, and then UNION them together. This approach would involve looking for classes with no scholarship students and selecting a 4.0 student; and then joining that with the list of classes with scholarship students. Pretty easy, yes, but it does not scale well as the data set increases beyond our silly little example here.

Another approach may be to use a series of nested subqueries using some fancy MySQL @variable tricks. I started going down this path before realizing that it too would scale poorly for the “real-life” dataset I was considering. And I wasn’t going to even _consider_ correlated subqueries.


What I finally settled on was to use a combination of GROUP BY, HAVING, and COALESCE. In general I try to avoid the use of HAVING as it causes the server to process rows before discarding them — however here for my real data set it is a perfect compromise as the data set is very large, however we only need to filter out a few records per grouped output row.

Final query:

SELECT
        COALESCE(MAX(IF(students.has_scholarship='yes',students.student_id, NULL)), MAX(IF   (enrollment.grade_point=4.0,enrollment.student_id, NULL)), NULL) preferred_student_id,
        students.student_id,
        students.student_name,
        classes.class_id,
        classes.class_name
FROM    students
JOIN    enrollment on enrollment.student_id=students.student_id
JOIN    classes on classes.class_id=enrollment.class_id
GROUP BY
        classes.class_id
HAVING  students.student_id=preferred_student_id
AND     preferred_student_id IS NOT NULL;


This was the first time I had ever found a use for the COALESCE statement. :-) I will also say that my first approach was going to be processing this data inside the application which consumed the query, but since this was legacy codebase that no one wanted to modify the pure SQL approach seemed superior.

So am I nuts? How would you do it?

UPDATE Roland B points out that my approach doesn’t actually work. His solution however, works great.

select c.class_id
, c.class_name
, substring_index(group_concat(
s.student_name
order by
if(s.has_scholarship=’yes’,0,1)
, e.grade_point desc
), ‘,’, 1)
from enrollment e
inner join students s
on e.student_id = s.student_id
inner join classes c
on e.class_id = c.class_id
where s.has_scholarship = ‘yes’ or e.grade_point = 4
group by class_id

Adding new partitions beyond MAXVALUE

April 2nd, 2009

I have found that MySQL RANGE partitions on the primary key are a great way to achieve scale for insert-heavy InnoDB tables.  I have used this to maintain an excellent and predictable insert rate, to avoid some of the well documented problems with insert performance as table sizes grow (especially with large/many secondary indexes).  In addition purging old data is fast and non-blocking because you can just DROP PARTITION as a single very fast operation.

However, MySQL partitioning brings up an interesting little issue when you exceed the space you initially allocate for your partitions.  Here in our sample table we define 4 partitions to handle 100M records, and everything after that will fall into the “pmx” bucket.


CREATE TABLE partition_test (
`id` int unsigned primary key auto_increment,
`payload` varchar(35) not null default '',
`stamp` timestamp default current_timestamp on update current_timestamp
) ENGINE=InnoDB
PARTITION BY RANGE (`id`) (
PARTITION p00 VALUES LESS THAN (025000000),
PARTITION p01 VALUES LESS THAN (050000000),
PARTITION p02 VALUES LESS THAN (075000000),
PARTITION p03 VALUES LESS THAN (100000000),
PARTITION pmx VALUES LESS THAN MAXVALUE
);

Let’s say that you get to 75M records, and now you want to extend your partition set beyond your current allocation.  The simple way is to just ADD PARTITION, like so:


ALTER TABLE partition_test ADD PARTITION (
PARTITION p04 VALUES LESS THAN (125000000)
);
# yields:
ERROR 1481 (HY000): MAXVALUE can only be used in last partition definition

This is uncool.  One option is to DROP the MAXVALUE partition, and then add it… but I find that a bit scary.  Thankfully there is a better way to add the new partition, by using REORGANIZE:


ALTER TABLE partition_test REORGANIZE PARTITION pmx INTO (
PARTITION p04 VALUES LESS THAN (125000000),
PARTITION pmx VALUES LESS THAN MAXVALUE
);

By using REORGANIZE, you can rebuild pmx and recreate it in the same step.  Any values that are currently in pmx will have to be rebuilt with the REORGANIZE statement, but if you are thinking ahead and have no records in that partition, REORGANIZE is a very fast and safe operation.

speaking at mysqlconf

March 6th, 2009

This year I was happy to be accepted as a speaker at MySQLConf 2009, presenting a session on MySQL Enterprise Monitor and Query Analyzer alongside Phil Hildebrand.

What’s it about?
We’ll talk about the Enterprise Monitor and Query Analyzer tools from a user perspective.  You may have read a Sun/MySQL press release or a whitepaper on those (maybe even mine), but it still may be hard to get a sense of what these products are and why they are useful.

Because these are part of MySQL’s Enterprise offering, they are not free open source software.  Thus I suspect only a small percentage of MySQL users are aware of what they do and probably even fewer have actually tried them out.  Well, Phil and I have used them extensively for quite some time, and we’ll walk you through the basics… without any marketing spin.  We’re both hands-on DBAs and not employed by Sun, so we’ll give you the unfiltered real deal.  :-)

Why should you attend?

Well first of all, I hope you’re going to MySQLConf if you can.  It’s a great conference full of sessions on a whole bunch of interesting topics.  I’ve attended for the past two years and met some great folks and learned quite a bit even though I’m a seasoned MySQL developer and DBA.  Given all the recent and ongoing developments with Falcon, PBXT, Drizzle, XtraDB, MySQL 6.0, and more… there is so much information to take in at the conference and related activities (Percona Conference, MySQL Camp).

The real question is why go to our session?  It’s a tough one because there are several other interesting sessions at the same time (Tuesday morning’s first session at 10:50am).  I would say come to this one if you have been wondering about or considering the MySQL Enterprise subscription for your organization and want to see what is involved and what other users think.

We’ll try to make it interesting for technical folks and not a marketing presentation — with practical examples, gotchas, and even a demonstration so you can get a sense of how it works and how it might be useful.

That’s the end of my session sales pitch.  See you at MySQLConf 2009!

bbcp for fast network copies

December 5th, 2008

This is not strictly database or MySQL related, however DBAs often deal with large data sets so others may find it useful.

I recently discovered bbcp, a utility to copy files efficiently over a network. It has the ability to use parallel network streams and specify window sizes, which can dramatically increase the transfer rates. It copies via the ssh protocol, so it fits nicely into common security environments.

Read the rest of this entry »

Quality of 5.1 GA release

November 29th, 2008

With all due respect to Monty (and I mean that — much respect is due), I have some serious issues with his portrayal of the 5.1 release.  I hate to make my first entry on Planet MySQL about a controversy, but he encouraged people to blog about their experience with 5.1, so that’s what I’ll do here.

Read the rest of this entry »

observation with regard to the “swappiness” problem

November 20th, 2008

Many of the prominent folks on Planet MySQL have encountered issues with mysql and linux going into swap (example 1,2,3).  None of the easy solutions (sysctl -w swappiness=0, innodb_flush_method       =O_DIRECT) have solved this problem for our configuration and I just deal by using a smaller innodb buffer pool than I’d like, and then occasionally using swapoff/swapon to temporarily bring things back to balance.

I did make one observation though, which may be a red herring but still interesting to me.  In one particular case I am using nested replication in the following configuration:

master1->master2->slave1
                ->slave2
                ->slave3
                ->slave4
                ->slave5

The master1 server handles all writes and some reads, and some of the slaves handle reads in a typical MySQL master/slave scenario.  All master2 does is batch out replication update to its slaves, and sits prepared to become active master if master1 dies.

The interesting part is that even though master1 and master2 are configured identically, master2 is the server that goes into swap while master1 is fine.  This is true even though master1 ends up being much busier with connections due to dealing with the applications directly.

I ended up reducing the innodb buffer pool on master2 so that it was smaller than master1.  And it went back into swap again, even though all it does is read one slave thread from master1 and batch out update to its 5 slaves. 

Very odd indeed. Is there something about having slaves that would increase the likelihood of mysql running into swap? If so perhaps there are some settings that could mitigate the issue there, such as reducing binlog size. I’m going to do some more investigation into this matter, but for now it’s just an interesting puzzle piece to consider.