As a performance engineer at MemSQL, one of my primary
responsibilities is to ensure that customer Proof of Concepts (POCs) run
smoothly. I was recently asked to assist with a big POC, where I was
surprised to encounter an uncommon Linux performance issue. I was
running a synthetic workload of 16 threads (one for each CPU core). Each
one simultaneously executed a very simple query (
select count(*) from t where i > 5) against a columnstore table.
In theory, this ought to be a CPU bound operation since it would be reading from a file that was already in disk buffer cache. In practice, our cores were spending about 50% of their time idle
In this post, I’ll walk through some of the debugging techniques and reveal exactly how we reached resolution.
What were our threads doing?
After confirming that our workload was indeed using 16 threads, I looked at the state of our various threads. In every refresh of my htop window, I saw that a handful of threads were in the D state corresponding to “Uninterruptible sleep”:
Why were we going off CPU?
At this point, I generated an off-cpu flamegraph using Linux
perf_events to see why we entered this state. Off-CPU means that instead of looking at what is keeping the CPU busy, you look at what is preventing it from being busy by things happening elsewhere (e.g. waiting for IO or a lock). The normal way to generate these visualizations is to use
perf inject -s, but the machine I tested on did not have a new enough version of
perf. Instead I had to use an
awk script I had previously written:
Note: recording scheduler events via
perf record can have a very large overhead and should be used cautiously in production environments. This is why I wrap the
perf record around a
sleep 1 to limit the duration.
In an off-cpu flamegraph, the width of a bar is proportional to the total time spent off cpu. Here we see a lot of time is spent in
From the repeated calls to
rwsem_down_write_failed, we see that culprit was
mmapcontending in the kernel on the
This was causing every
mmap syscall to take 10-20ms (almost half the latency of the query itself). MemSQL was so fast that that we had inadvertently written a benchmark for Linux
The fix was simple — we switched from using
mmap to using the traditional file
read interface. After this change, we nearly doubled our throughput and became CPU bound as we expected:
For more information and discussion around Linux performance, check out the original post on my personal blog.
Download MemSQL Community Edition to run your own performance tests for free today: memsql.com/download