The Art of Debugging: fine-tuning the previous Fortran analysis


I’ve been studying that buggy Fortran code for a while, and I’ve come to such a new approach. My previous analysis concerning that segmentation fault was right. In fact, gdb was completely plain: the error was due to a wrong matrix-addressing. That i1 variable got this value: 32767, which was clearly out-bounds for that agetab matrix definition. There, the segmentation fault. However, the idea of adding a write() call inside the wdcoolco_ subroutine and how that changed completely the program’s behaviour was not precisely accurate. Okay, I made a mistake. It had nothing to do with the segment sections of our buggy Fortran code. But it could perfectly be!

So, I’ve been studying the assembly code for that subroutine. And I’ve discovered something making actual sense, after all. Below, I’m going to discuss all about it.

Assembly-to-Fortran mapping

First of all, let’s discuss the relationship between Fortran and assembly code. We do know we had a segmentation fault right here:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f3088bed6e0 (LWP 2260)]
0x000000000044f4da in wdcoolco_ ()
Current language:  auto; currently asm
(gdb) info register $rip
rip            0x44f4da    0x44f4da <wdcoolco_+442>

Thanks to the $rip register (the instruction pointer), we can be sure our segmentation fault occurs right in this assembly instruction:

0x000000000044f4da <wdcoolco_+442>:    movsd  0x7be848(%rsi,%r8,1),%xmm8
0x000000000044f4e4 <wdcoolco_+452>:    movsd  0x7be898(%rsi,%r8,1),%xmm9
0x000000000044f4ee <wdcoolco_+462>:    movsd  0x7be8a0(%rax,%r8,1),%xmm7
0x000000000044f4f8 <wdcoolco_+472>:    movsd  0x7be850(%rax,%r8,1),%xmm5

So, according to our analysis using debugging symbols, this instruction is , precissely, this high-level Fortran call:

y1=agetab(j1,i1  )

Thinking this way, we can even associate the next three high-level Fortran code lines to those previous assembly ones:

Assembly Fortran
movsd 0x7be848(%rsi,%r8,1),%xmm8 y1=agetab(j1,i1 )
movsd 0x7be898(%rsi,%r8,1),%xmm9 y2=agetab(j1,i1+1)
movsd 0x7be8a0(%rax,%r8,1),%xmm7 y3=agetab(j2,i2+1)
movsd 0x7be850(%rax,%r8,1),%xmm5 y4=agetab(j2,i2 )

Thus, it is obvious the segmentation fault is going to take place at the very same high-level Fortran code line, either we are using debugging symbols or not. That is, or we are going to see the segfault happening at bergy.f:3147 code line, or we are going to see it occurring right in this assembly code instruction offset: *wdcoolco_+442. It is the same instruction executed by the $rip register.

Understanding the previous assembly instruction

We are dealing with matrices, here. So, the instruction movsd does, precisely, that: moves data from one place (operand) to another. The first operand is a memory location, the second one (the destination) is a XMM 64 bit register. Thus, in this case:

movsd 0x7be848(%rsi,%r8,1),%xmm8

we are trying to read from

{(rsi + r8 * 1) + 0x7be848} = (rsi + r8 + 0x7be848)

and store its contents in the xmm8 register. No wonder xmm8 is where the variable y1 is going to be stored.

This addressing method is formulated this way:

{(base + index * scalar) + offset}

In our case:

base = rsi

index = r8

scalar = 1

offset = 0x7be848

It seems pretty obvious we are getting a segmentation fault here because the first operand, that is, the memory location we are trying to read from is invalid. We know our i1 variable is wrong (thanks to debugging the program using debugging symbols). We can trace down this i1 = 32767 value all across the registers:

(gdb) info registers


foseg          0x7fff    32767

So, it seems the variable is taking the same value all the time, no matter whether we are adding that write() call to our wdcoolco subroutine or not. I thought the value was changing all the time, but I was wrong. i1 is always 32767.

Differences affecting the base and index registers

If we compile and run the code without that write() call, the previous discussed assembly instruction uses rsi as the base register and r8 as the index register. Clearly, if we have a look at the value stored in rsi we get an enormous base memory location, so when the instruction is executed, the computing address is completely invalid and the program crashes:

rsi            0x27ffcdfc3d1700    11258784256956160

However, that does not happen when we have a write() call. If we compile the program again and look for this previous assembly instruction, we discover that, this time, the compiler has decided to use two different registers, rax as the base and rdx as the index register. This seems to happen all the time, as long as we have the write() call inside the wdcoolco subroutine:

115 0x000000000044f534 <wdcoolco_+532>: movsd  0x7be848(%rax,%rdx,1),%xmm8
116 0x000000000044f53e <wdcoolco_+542>: movsd  0x7be898(%rax,%rdx,1),%xmm9
117 0x000000000044f548 <wdcoolco_+552>: movsd  0x7be8a0(%rcx,%rdx,1),%xmm7
118 0x000000000044f551 <wdcoolco_+561>: movsd  0x7be850(%rcx,%rdx,1),%xmm5

It comes to happen our value for rax is always 0. And thus, when the program computes this memory location as an operand, the address is valid:

(gdb) b *wdcoolco_+532
Breakpoint 2 at 0x44f534
gdb) info register $rax
rax            0x0    0
(gdb) x/w (($rax + $rdx)+0x7be848)
0x7be858 <fredone_+144024>:    0x00000000

So that’s the main difference. The election for rax and rdx this time is completely up to the ifort compiler, and it is clearly related to the fact we are adding some text code, thus affecting the internal election for registers where the variables will be stored.

We can add any sort of syscall right there!

Well, if we get rid of that write call and put, let’s say, this another one:

mypid = getpid()

our program’s behaviour is exactly the same as, you guess, when we had that write() call. That’s because the election for base and index registers is just the same: rax and rdx. And rax, as before, has the 0 value. Then, we can conclude that when using system calls, like write() or getpid() or the like, the ifort compiler’s election for registers is different from that one not using them, tending to choose  rax and rdx when a syscall is performed  inside the wdcoolco subroutine. Why rax = 0 is out of this analysis’s frame, because we do know where the actual error lurks: that i1 variable at bergy.f:3147. But now, finally, we can understand why using the write() system call our program does not crash, despite the fact we still have that erroneous value of 32767 stored in the i1 variable.