Latest Tweets

The Art of Debugging: dealing with ATI drivers and heap corruption

 

Preamble

The ncurses-based setup program for installing ATI Graphic cards on Linux boxes has a heap corruption error. When the program is executed, it seems to do its tasks well, until it is about to end its execution, at that precise instant the program crashes and an awful memory corruption message appears on the screen. Concretely:

So, it is time to start using our debugging skills to understand what is going on and to fix this erroneous behaviour.

The heap and MALLOC_CHECK_ aid

The heap segment section is the place where our malloc(), calloc(), alloca() and, in general, all functions of the sort reserve memory for our dynamic data. So, in current glibc implementations, whenever a corruption of this heap segment section happens, that previous error message is shown, as long as there is the MALLOC_CHECK_ environtment variable correctly configured. The idea is to locate any sort of heap corruption and report it before something much worse can happen. Dynamic Allocation Memory problems are hard to debug, and normally the memory corruption provokes unknown and even funny behaviour somewhere in the future, making them almost impossible to track down. There, the glibc aid discussed now.

When we have the MALLOC_CHECK_ variable set to 0, nothing happens. That is, despite the fact the program still has this heap corruption error, it ends its execution and there is no error message. So far so good … doing so, the ATI setup program really does all its tasks well, and we can use our new graphic driver without further issues. But this is not the right thing to do.

Thus, our system has reported this error because we have this environment variable set this way:

export MALLOC_CHECK_=3

Then, when glibc detects this heap corruption issue, sends an abort() call to the program and ends its execution immediately. That’s precisely what is going on here.

We have to deal with the problem and fix it, and we have to do so because we are talking debugging skills here. In this particular case, we don’t have the sources, so we have to deal with assembly instructions. Don’t panic! Thanks to that previous error message, we do know a lot of useful information in order to start tracing this issue down.

Using glibc heap corruption error message to locate the assembly buggy instruction

According to the previous error message, there is a certain call to cfree() in order to free some segment of previously allocated heap memory. There, you have the error: you cannot free something which has been freed already. The culprit is our setup binary, obviously. Let’s have a look at the useful and relevant information:

lib/libc.so.6(cfree+0x76)[…]

setup.data/bin/x86_64/setup[40a6b0]

We know the buggy instruction calling cfree() is located at offset 0x406b0 in our setup binary image. Okay, we can disassemble setup in order to prove that:

objdump -d setup.data/bin/x86_64/setup > setup.S

Right after disassembling the code, we can use an ASCII editor and go directly to this memory offset address: 0x406b0:

8575 40a6b0: e8 53 90 ff ff callq 403708 <free@plt>

So, it is true. There, our buggy instruction calling cfree() and trying to free some previously allocated heap memory a second time. This instruction does not have to be here.

Tracing that buggy instruction down with gdb

Before we proceed fixing the buggy code, we are going to use gdb to witness from a debugger’s point of view this awful issue. First of all, this is a graphical program, using the ncurses library, so we have to deal with it from our lovely gdb in a different way. We have to open a new terminal, and then instruct gdb to redirect all channels for that program to be traced down there:

(gdb) tty /dev/pts/12
(gdb) file setup.data/bin/x86_64/setup
Reading symbols from /home/tonicas/dev/seg_faults_analysis/ati/installer_ok/setup.data/bin/x86_64/setup…(no debugging symbols found)…done.

Now, we have to break the program’s flow right here:

(gdb) b *0x40a6b0
Breakpoint 6 at 0x40a6b0

Recall this is the offset where the buggy call to cfree() is coded. Now, it is time to run the program, and when it does stop right there, we have to step into (not step over), the cfree() function:

(gdb) stepi
0x0000000000403708 in free@plt ()

And now, we are going to have a look at their parameters, stored obviously in the stack (pointed by the $rsp register):

(gdb) x/12w $rsp
0x7fff1787bb00: 0x00000000 0x00000000 0x7c0ba9e0 0x00007f3b
0x7fff1787bb10: 0x013cebf0 0x00000000 0x01318140 0x00000000
0x7fff1787bb20: 0x00490393 0x00000000 0x00000000 0x00000000

It comes to happen our parameter placed at offset $rsp + 0x10 is, precisely, the previously freed heap memory: 0x013cebf0. Have another look at the glibc error message earlier.

*** glibc detected *** setup.data/bin/x86_64/setup: double free or corruption (fasttop): 0x00000000013cebf0

Then, if we continue running our program this is what ends up happening:

Program received signal SIGABRT, Aborted.
0x00007f3779721ed5 in raise () from /lib/libc.so.6

That abort() called issued automatically by the glibc() heap corruption detection discussed earlier.

Fixing the problem

We don’t have the sources, but surely this is not going to be a problem in order to fix that buggy instruction. We do know the op code for the assembly call to cfree() is :

e8 53 90 ff ff

So, we can edit our setup.data/bin/x86_64/setup binary direclty using any sort of a hexadecimal editor. Then, we have to look for this op codes. There they are:

We have to replace this instruction. We can add the nop one, which does, precissely, nothing. But nop is only 1 byte long, so there has to be 5 nops in a row:

90 90 90 90 90

Thus:

Now, if we run the program again, there is no error message at all concerning the heap!