Latest Tweets

The Art of Debugging: two case-studies (1/2)

Preamble

Debugging is more than a technique. It is, in fact, an art. But this art is hard to master, and so there are a huge amount of developers thinking it such a burden. It could be the quickest way to get rid of an awful bug. But in order to do so, one has to know the basic concepts and ideas related to the almost divine art of locating defects and correct them in software. I’ve been working as a Systems Manager for more than fourteen years. Along the way, I’ve been debugging software for five years out of these fourteen. Thus, I’ve got some tricks I would like to share. I think the best way to understand the art of debugging is through some real examples. I don’t know pretty much about Fortran development, but that does not matter, as you are about to witness, for our debugging purposes. It is said, and I could not agree more, that the most common defects present on software are those ones related to memory-management, arrays, pointers, and so on. And it is true. Below, we are going to discuss two real cases concerning segmentation faults. Oh! It is not that bad, trust me on this one. When a segmentation fault occurs, one can take a deep breath and relax. If there is a segfault, we do know there’s a defect somewhere in the code. Sometimes, though, there is NO segfault at all, but as N. Matloff and P. Salzman put it, we cannot conclude from the absence of a segfault that a memory operation is correct.

We will begin with the first case, a molecular dynamics C simulation code. To conclude, we will analyse a numeric physics simulation code written in Fortran.

Case study 1: A molecular dynamics C code (MD)

First of all, this program is written entirely in C. It uses the math library (-lm). So, the first thing we have to do is to compile it and run it and see what comes out:

gcc -lm MD.c -o MD.e

./MD.e

(….)

ssssssssssssssssssssssssssssssssssssssssssssssssssssssssss

ssssssssssssssssssssssssssssssssssssssssssssssssssssssssss

ssssssssssssssssssss

Segmentation fault

Okay, now we know it crashes. But that is good, very good indeed. If we do have a segmentation fault, we do know for sure there is a defect in the code. I’m not going to deal with maths here. I don’t have to. What I do have to do is to compile the code again, this time adding the debugging symbols to the binary and then trace it down using gdb:

gcc -g -lm MD.c -o MD.e

gdb ./MD.e

Inside gdb, I run the program. Obviously, after some iterations I’ve got the segmentation fault message, but this time I have some additional – and truly important – information:

sssssssssssssssssssssssssssssssssssssssssssssssss(…)

Program received signal SIGSEGV, Segmentation fault.
0x0000000000402019 in Calc_vacf () at MD.c:633
633            vacf[j-i]=0;

So, I know there is a problem with the line numbered 633 in MD.c source code file. Wow! It seems even magic, isn’t it? There are certain theories about how to debug a piece of code. Now, it seems pretty obvious we are dealing with an array called vacf[]. So we can apply here the induction technique, thinking this way: “Oh, this is a vector. It is mostly common to access any vector wrongly. This issue could possibly be just the same case.” Clearly, the line

vacf[j-i] = 0;

tries to set the j-i th element of the vector to zero. We need to know this access is correct before going further. To do that, we can determine the real size for vectf[] vector:

print vacf
$1 = {0 <repeats 1000 times>}

Well, our vecf[] vector has been initialized up to 1000 elements (well, this assertion is not precisely accurate, because the program has crashed doing so, that is, initializing our vector ;-)). Don’t forget that does mean “let i be an element belonging to vacf, then it proves that 0<=i<=999, We can even have a quick look at the C source code without leaving the gdb debugging session to be sure:

(gdb) list vacf
64    double virial;                                    // Contribución del virial
65
66    int tau;                                    // Posición para los arrays vels_x y vels_y
67    double vels_x[(int) N][tau_max];                        // Array que guarda las velocidades x de un instante de tiempo
68    double vels_y[(int) N][tau_max];                        // Array que guarda las velocidades y de un instante de tiempo
69    double vacf[tau_max];                                // Guarda la correlación de velocidades

We have to know the value for tau_max, which is 1000 indeed:

30 # define tau_max    1000                            // Número de iteraciones en las que se evalúa la correlación

Now, we need to know whether j or i are erroneous. We are using the hypothesis either j or i or both are indeed erroneous, because we are theorizing our access to the vector is not correct. Let’s have a look at their current values:

(gdb) print i
$2 = 0
(gdb) print j
$3 = -1583524721

Well, as far as I know the value for i could be correct. But surely, the value for j is not! It seems it has not been initialized at all. So, we have to focus on this j variable, and go backwards, in order to determine why this variable has been infected. Could be either because it has not been correctly initialized, or because inside this code something infects it. First, we look for this j variable to determine whether it is local to the function where we are (Calc_vacf ()), or has been passed as an argument.

(gdb) list i
624
625    void Calc_vacf()
626    {
627
628            printf(“estic en calcul de vacf         “);
629        int i,j,k;
630        double sum,norma;
631
632        for(i=0;i<tau_max;i++)
633            vacf[j-i]=0;

We can do that because, according to our gdb debugging session, we are still in Calc_vacf() function:

(gdb) frame
#0  0x0000000000402019 in Calc_vacf () at MD.c:633
633            vacf[j-i]=0;

If we have a quick look at this function, we see i has been correctly initialized (for i=0 ….), but this is not the case when it comes to the j variable. It is not initialized at all, so now we’ve got this odd value of -1583524721. It has to be said this value is going to change in different executions of the same program on the very same computer in a row, obviously.

Well,  it was not that hard, am I right? We have discovered our j variable, inside Calc_vacf() function, has not been initialized and this is the main cause provoking our segmentation fault. How to fix it is up to its owner and main developer. Don’t forget I know shit about molecular dynamics!!!!!! 😉