Disk I/O Bottlenecks

My first approach to diagnosing a performance problem is to start by trying to find the system’s bottleneck — the limiting factor in a series of events that causes a slowdown for the whole process.

As a DBA and MySQL specialist, one of the first things I look at is to see if we are bottlenecked on disk I/O.  However, I don’t really like CPU iowait as a metric for measuring performance problems with disk I/O.  Here’s a not-too-uncommon example of how the traditional iowait approach is very misleading on a multicore server.

Machine tested: Sun Fire X4140 w/8 disks RAID1+0 and 8 CPU cores.

1) Generate a large file (20G) onto the RAID partition
# dd if=/dev/urandom of=/data/sample.dat bs=1024 count=20000000
2) Generate single-threaded I/O activity by copying the file back and forth
# while [ 1 -ne 0 ]; do cp /data/sample.dat /data/sample2.dat; done

While step #2 is going on, monitor the activity in top.  As you watch, it will tell you that load is around 2, the CPU is 70-85% idle, and iowait is only between 5-20%.  If the CPU is mostly idle and the iowait is that low, what is the earth is the bottleneck — the limiting factor preventing this process from finishing faster?

From this data it’s easy to conclude that some software inefficiency is causing a performance issue.  I’ve watched a senior sysadmin take this information and (wrongly) conclude that the performance issue was because MySQL sucks as it doesn’t scale to enough CPUs — when in fact it was actually making good utilization of the disks but only 4/8 CPUs were being fully utilized in a 4-threaded bulk load procedure.  This threw off iowait in precisely the same way that our test showed above.

So instead of using top, while the #2 process is still running look at the same server using iostat.  Specifically, run iostat -x so that you can see the %util — this represents disk utilization for your system devices.  Now you’ll see something much more informative than the output from top: you should see your RAID device utilization at a fairly consistent 80-100%.

Unlike the top output, this starts to give you a much better picture of what the bottleneck is in the process.  The iostat manpage says: “%util : Percentage  of  CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100%.”  There is our precious bottleneck that we’ve been searching for.

Knowing the bottleneck gives you a critical piece of information: in our simple test, the limiting factor was the disk.  Buying a better CPU or upgrading the network link really isn’t going to help this problem, instead we should focus on increasing disk I/O throughput (more/faster disks, SAN, etc) — and/or reducing the amount of stress we are placing on the disk via better query tuning, adding RAM, sharding, or any of the well known scaling solutions.

Toolset for quick and dirty bottleneck approach:

CPU bottleneck:
mpstat -P all  (or press “1″ inside top)

  • question: are you effectively using all your CPUs or is a subset doing most of the work?
  • followup: can you parallelize the processes to utilize more CPU?

Disk I/O bottleneck:
iostat -x -m 2

  • question: are any of your block devices saturated? what is the read/write I/O profile?
  • followup: tune database queries, increase # of spindles and/or drive speed, add RAM to push more data into the Innodb Buffer Pool, etc

Network bottleneck:
pktstat -t

  • question: what is your % utilization per interface?
  • followup: find someone who knows more about the network than me

Aside: how does iostat determine it’s magic utilization number?

Digging into the iostat source code and kernel documentation a bit, it all becomes a bit clearer. Utilization is determined by this simple formula:

#define S_VALUE(m,n,p) (((double) ((n) - (m))) / (p) * HZ)

xds->util = S_VALUE(sdp->tot_ticks, sdc->tot_ticks, itv);

itv is just an interval.  sdp and spc are the device’s past and current states.  But what does tot_ticks come from?

/* Try to read given stat file */
if ((fp = fopen(filename, "r")) == NULL)
return 0;

i = fscanf(fp, "%lu %lu %llu %lu %lu %lu %llu %lu %lu %lu %lu",
&rd_ios, &rd_merges_or_rd_sec, &rd_sec_or_wr_ios, &rd_ticks_or_wr_sec,
&wr_ios, &wr_merges, &wr_sec, &wr_ticks, &ios_pgr, &tot_ticks, &rq_ticks);

OK, now we’re getting somewhere. filename represents a file for your device: /sys/block/$device/stat — and tot_ticks is simply the 10th of 11 fields in that file.

What is that mysterious 10th field though?  The kernel docs have this to say: “This value counts the number of milliseconds during which the device has had I/O requests queued”.  And now, it’s crystal clear: the incrementing counter tells you how long the disk been queuing.  Take two samples over a known period of time and you can measure the percentage of time the device has been queuing stuff.  It’s very simple, but also incredibly effective!

So now in theory you should be able to match the result of iostat -x by running a bash script to sample that value and measure the change over time.  And yes, you can!  That bash script matches exactly what iostat -x tells you, and outputs to a status and log file telling you the minute average and peak.  Then you can take that output and feed it into your favourite monitoring software so you can see pretty graphs, send warnings, and correlate problem reports to spikes in real I/O activity — much more accurate than with iowait.

3 Responses to “Disk I/O Bottlenecks”

  1. Gavin Towey Says:

    Thanks for digging into the code and reporting the way that % util is generated.

    I wonder then if newer disks that do queuing to increase throughput and RAID controllers that do queuing would cause the output to look like the disk is more heavily used that it actually is.

  2. Istvan Podor Says:

    This article is just great. I experienced the same! I met with this kind of “expert sysadmin kind” a few times too. They always said (as I did :( ) before, mysql not scalling right. And that was just the past I think, with xtradb I got some pretty good scaling between cores. I remember when I optimized some db server, the sysadmin guy came to me and said: “Hey all the graphs show that you made it wrong, everything is higher (munin)” And I said suure its higher, its using more resource thats a good thing! And he had no reason why is the higher values are bad :)

    Anyhow its just great, and if you have some experience about how to tune a mysql server for lets say a raid controller with BBU and huge amount of on-controller memory that would be great and useful too!

    Thanks for your post!

  3. Bryan Says:

    I also like using “dstat -cdmns 1″. Provides a nice overview of everything. Not as good for determining bottlenecks, but still great for keeping an eye on the big picture.

Leave a Reply