Latest Tweets

Be careful with sysbench –test=cpu … it’s not SO obvious!

The issue

Some days ago, someone working in my department told me about an oddity concerning two GNU/Linux boxes. In particular, one of the two boxes geared an Intel I7 3770K processor, whereas the other one was powered by an Intel I7 950 processor. According to their specs, and this benchmark website: https://www.cpubenchmark.net/high_end_cpus.html, the former has better performance than the latter. That is, 5,643 for the Intel 950 and 9,593 for the 3770K. So, it is crystal clear that 9,593 > 5,643. And yet, that did not seem to be the case!

One way to benchmark the processor under GNU/Linux, according to most sysadmins, is by running the sysbench utility. So, my workmate executed the tool using the same version, same operating system, same amount of memory (16GB of DDR3 1666Mhz RAM), on both computers this way:

sysbench –test=cpu –cpu-max-prime=20000 –num-threads=1 run

What he got was far from being coherent: the I7 950 processor was about 1 second quicker than the 3770K! He repeated the tests more than once, of course, getting every single time more or less the same execution-time, and always the 950 was a bit better!

 What sysbench computes behind the scenes

Having a look at its source code, the CPU test computes a basic primality test for {3, n} (the squaring root of n). It performs these calculations using the GNU/Linux Threads implementation (Linux PThreads), although you can of course run just one thread if you are interested in benchmarking the processor in a “single-core” environment:

  for(c=3; c < max_prime; c++)
  {
    t = sqrt(c);
    for(l = 2; l <= t; l++)
      if (c % l == 0)
        break; 
    if (l > t )
      n++;
  }

In the previous code snippet, every integer variable is of type unsigned long long, that is, according to sysbench documentation, an INTEGER 64.

On both systems the sysbench utility was installed using the apt-get command, so we are dealing here with pre-compiled X86_64 ELF binaries. On some websites, there is a slight reference to ensure that you are always running the same version of this utility, and that it has been compiled using the same compiler and flags. Well, it makes sense. But most sites showing you how to benchmark the processor only illustrates you in running the sysbench command and take good care of the time it takes to compute the primality test before comparing it to others.

And that, I am afraid, is not good enough.

Where it fails

Let’s start by running the sysbench utility on either one system or the other, and profle it by means of the GNU/Linux perf utility. Well, nothing really surprising shows up:

82,42 : 40bde9: mov %rdx,%rax

According to the perf utility, 82,42% of the execution-time of the sysbench cpu test is spent inside this assembly code snippet: moving data from the RDX register to the RAX one. Cool, eh? Okay, we can have a look at the low-level assembly in order to understand what’s going on:

40bc45: 48 f7 75 f0 divq -0x10(%rbp)
40bc49: 48 89 d0 mov %rdx,%rax
40bc4c: 48 85 c0 test %rax,%rax
40bc4f: 74 11 je 40bc62 <cpu_execute_request+0x130>
40bc51: 48 83 45 f0 01 addq $0x1,-0x10(%rbp)

The previous code snippet is where the program computes this C-line of code: if (c % l == 0). So we can infer from the assembly that after dividing the value of c by l, it gets the remainder of that division and stores it in the RAX register (line in bold). This is done almost the 82,42% of the time, according to the perf utility. After that, the test %rax,%rax performs a bitwise AND with the RAX register, so if ZF=1, that means the remainder was zero, and the execution flow goes to the cpu_execute_request+0x130 offset.

Of course, this assembly has been generated by the gcc compiler, and don’t forget we are dealing with pre-compiled X86_64 binaries here.

And yet, one thing comes to mind: this code-snippet does not seem good enough. If we have a look at the DIVQ Intel instruction for X86_64 processors, we know that:

R[%rdx] ← R[%rdx]:R[%rax] mod S;
R[%rax] ← R[%rdx]:R[%rax] ÷ S

That is, the remainder of the division is always stored in the RDX register. So, what’s the point in moving the remainder from RDX and storing it in the RAX register and testing if it is zero? Would not be a lot easier to just test the RDX register directly?

Patching the binary

Okay; so now we are going to alter the program’s opcodes slightly. In fact, our code would be a bit more optimal, because after this small alteration there would be no need of copying a quad-word from the RDX register to the RAX register. Therefore, instead of having this:

divq -0x10(%rbp)
mov %rdx,%rax
test %rax,%rax
je 40bc62 <cpu_execute_request+0x130>
addq $0x1,-0x10(%rbp)

We’ll end up having this (bear in mind that we have to align our code by padding it with NOPS):

divq -0x10(%rbp)
test %rdx,%rdx
je 40be23 <cpu_execute_request+0x12d>
nop
nop
nop
addq $0x1,-0x10(%rbp)

We perform the changes by using the hexeditor utility, as shown in the next screenshot:

Patching the op-codes to get an equivalent yet optimal execution-path

Patching the op-codes to get an equivalent yet optimal execution-path

Now, we ran more than once the original sysbench utility binary and the one that we have just patched on just one of the two GNU/Linux boxes. Here we show just three consecutive executions, but we got more or less the same results over and over again:

Original Sysbench: 23.9315s23.9401s23.9369s

Patched Sysbench: 22.8170s22.8559s22.8450s

Of course, there is a slight improvement: a bit more than a second! This comes to demonstrate that we cannot take the sysbench execution times for granted: now, using the very same computer, and by running the same utility but with slight differences in its op-code instructions, we can decrease its execution time by one second! So if WE can achieve something like that by hand … what a real compiler could do? Let’s find out!

Let the compiler work its magic!

Instead of running the default pre-compiled X86_64 sysbench binary installed from the apt-get tool, I got its sources and compiled it yet again using the Intel C compiler (icc) that was originally installed on the computer this way:

CC=icc CFLAGS=-xHost ./configure & make

After that, I ran the sysbench test once again:

Threads fairness:
events (avg/stddev): 10000.0000/0.00
execution time (avg/stddev): 0.0025/0.00

Bloody hell! Now we are talking!

Compiler flags are important

To conclude, we can infer that sysbench can be, of course, a good tool for benchmarking but one must re-compile it using a compiler with capabilities to optimize the code according to the processor that gears the computer. Otherwise, its results and measured execution-times could make no sense at all.

Cheers!