The Art of Debugging: two case-studies (2/2)

Once upon a time … a matrix!

Sometimes, our program has an actual illegal memory access, but even so it does not crash. Whenever this comes to happen, our program has such an odd behaviour, indeed. But still, it seems to work. Let A,B be two square matrices of order 3×3. Then, Fortran compiler is going to store them this way: A = A^T , B = B^T.:

  3     integer B(3,3)
  4     integer A(3,3)

We can use gdb to prove this. First of all, we need to know the address locations for matrix A and matrix B:

(gdb) info address A
Symbol “a” is static storage at address 0x6a5f20.
(gdb) info address B
Symbol “b” is static storage at address 0x6a5ef0.

Now, we can explore these matrices directly at their memory addresses. Let’s explore matrix B:

(gdb) x/9 0x6a5ef0
0x6a5ef0 <test_array_$B.0.1>:    0x0000000a    0x0000000a    0x0000000a    0x0000000a
0x6a5f00 <test_array_$B.0.1+16>:    0x0000000a    0x0000000a    0x0000000a    0x0000000a
0x6a5f10 <test_array_$B.0.1+32>:    0x0000000a

Fortran stores all data sequentially. We are about to prove it:

(gdb) x/21 0x6a5ef0
0x6a5ef0 <test_array_$B.0.1>:    0x0000000a    0x0000000a    0x0000000a    0x0000000a
0x6a5f00 <test_array_$B.0.1+16>:    0x0000000a    0x0000000a    0x0000000a    0x0000000a
0x6a5f10 <test_array_$B.0.1+32>:    0x0000000a    0x00000000    0x00000000    0x00000000
0x6a5f20 <test_array_$A.0.1>:    0x00000001    0x00000002    0x00000003    0x00000002
0x6a5f30 <test_array_$A.0.1+16>:    0x00000004    0x00000006 0x00000003 0x00000006
0x6a5f40 <test_array_$A.0.1+32>:    0x00000009

So, matrix A follows immediately after matrix B. Now, we do know it seems pretty feasible to hack matrix A from, let’s say, an erroneous address referenced to or by matrix B (or the other way around). Now, let i,j be two integer index variables for addressing each row and column for both given matrices A,B. Now, suppose we are referencing our matrix B erroneously this way – because of an actual code-defect -:

 B(1,4) = 255

We are writing to an address memory location which does not belong to our formerly matrix, i.e, B. Let’s have a look which memory address this one is:

(gdb) p &(B(1,5))
$2 = (PTR TO -> ( INTEGER(4) )) 0x6a5f20

The address is 0x6a5f20, but according to our previous memory data exploration, this address is where A(1,1) is stored. So, if our program does execute that previous line of code, this is the awful result:

(gdb) p A(1,1)
$4 = 1
(gdb) set variable B(1,5)=255
(gdb) p A(1,1)
$3 = 255

Pretty nasty, huh? Well, this is the case where our program does not (apparently) crashes, it even ends its execution orderly, and we think it is fine and there is nothing wrong with it.

A Fortran case study: segfault “only if there isn’t a write() statement call in the code – but not always and certainly not on any computer 😉 -“.

Running this Fortran code, we’ve got only a segmentation fault as long as there isn’t  a write() statement call at  line 3007. Let’s play with it:


6  0.551900000000000        1.52890000000000
[New Thread 0x7f66b6b686e0 (LWP 18122)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f66b6b686e0 (LWP 18122)]
0x000000000044fb5e in wdcoolco_ ()
Current language:  auto; currently asm

Now, we uncomment this line numbered 3007 and run the program again:


DA CO wd+ MS CASE C=           0
CO+MS CASE C=           0
CO+MS TPAGB CASE=           0
ONe+MS TPAGB CASE=           0

Program exited normally.

Wow! Odd, isn’t it? Yeah, yeah. It would be odd, if we did not know anything about gdb, segment sections, and the sort. So, whenever this kind of weird behaviour happens in our code, let’s say: “Oh, if I add a trivial statement my code works fine; if I get rid of that very same statement, then it does not.”, there is, indeed, an illegal memory access somewhere in the code. That is a bloody fact.

The best way to determine where, obviously, is to compile the code adding the debugging symbols. This code was compiled using Intel Fortran, so there we go:

ifort bergy.f -o bergy.e -debug all

Now,  it is time to use gdb:

entrada wdool
[New Thread 0x7feca1b0c6e0 (LWP 18162)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7feca1b0c6e0 (LWP 18162)]
0x0000000000416f3e in wdcoolco (m=0.51273999999999997, tc=2.4539, teff=5418.2025206093458, grav=6.9330764944871506) at bergy.f:3147
3147    8     y1=agetab(j1,i1  )
Current language:  auto; currently fortran

Oops! I told ya! There it is, that bastard! We have for sure an illegal memory access. According to our gdb output, it is in bergy.f, at line 3147, where we can find this quite a clue statement:

y1=agetab(j1,i1  )

It is a piece of cake to determine our access to the matrix agetab is absolutely wrong:

(gdb) print i1
$1 = 32767
(gdb) print j1
$2 = 2

There, according to the fortran source code, agetab is defined as nrow=10 and ncol=900. Clearly, the value stored in the i1 integer variable is totally out of the scope for agetab‘s memory location: 32767!  We can even know where we are trying to read from:

(gdb) p &(agetab(j1,i1))
$3 = (PTR TO -> ( REAL(8) )) 0xa5c828
(gdb) info symbol 0xa5c828
No symbol matches 0xa5c828.
(gdb) x/w 0xa5c828
0xa5c828:    Cannot access memory at address 0xa5c828

So, we have a segmentation fault because the address 0xa5c828 is not valid. Indeed, if we have a look at the segment sections for this program reading  /proc/our_pid/maps, we get:

00400000-0057f000 r-xp 00000000 08:04 4129049                            /home/tonicas/docs/incidents/antares/seg/first_bergy/bergy.e
0077f000-00787000 rw-p 0017f000 08:04 4129049                            /home/tonicas/docs/incidents/antares/seg/first_bergy/bergy.e
00787000-0086c000 rw-p 00000000 00:00 0
01a11000-01a32000 rw-p 00000000 00:00 0                                  [heap]


Our text segment (where our executable statements reside, like a call to write() ;-)) is the first one; the second and third ones are those of our data segment. The heap is for any call to malloc() and the like, that is, for our dynamic allocation memory inside our code. It is the fourth segment shown here. Yeah, the previous memory address does not belong to any of this segment sections! There, you got a segmentation fault.

Okay, but … why, if we add the write() statement at line 3007, doesn’t our program  crash? That’s pretty simple. Recall the first section of this post. When we were showing how we can hack another memory address location without getting a segmentation fault. This is the same principle.

Have a look at the next post: in order to find out.