After compiling and installing the ARPACK libraries on a GNU/Linux 64 bit cluster, a certain Fortran code is compiled, linked using these libraries and finally executed. During its execution, a Not A Number (NaN) runtime error showed up, as shown below:
1 NaN —
2 NaN —
3 NaN —
4 NaN —
Apparently, this very same code had been used with no evident issues for such a long time on other GNU/Linux platforms. Therefore, the problem seemed to be related to the ARPACK libraries.
Adding the debug symbols
Our first approach was to recompile the program by adding the “-g” flag so that the debug symbols could be added to the binary image. This way, tracing the program through gdb would be easier. We ran the code once again, this time inside a gdb session, for a while and, surprisingly enough, there were not NaN errors in the output file but actual numbers instead.
1 0.82E+00 —
2 0.79E+00 —
3 0.69E+00 —
4 0.68E+00 —
It was clearly enough that this had to be a Dynamic Memory (DM) problem, because before adding the debug symbols the program showed “NaNs” instead of actual numbers. After altering the elf executable by increasing its size due to the fact that we wanted the debug symbols in it, the program appeared to be working fine. At least, it was numbers and not NaNs. Obviously, the program was far from being working fine. It did have memory access errors, and surely we had to determine whether this was due to a problem with the recently compiled and installed ARPACK libraries or not.
Whenever a DM problem tends to appear, or is suitable to appear, we have to check for invalid array indexes. The best way to do that when it comes to the gfortran compiler is by adding the flag: “-fbounds-check”.
Checking for valid array indexes
We altered the Makefile so that we could check for invalid array accesses:
FFLAGS90 = -fbounds-check -O3 -fopenmp -frecord-marker=4 -I/$(PWD)
Then we compiled and ran the code and this came up:
../twSANB < C2k1.25L05N05M10cmSA_NB.in
Solution File “C2k1.25L05N05M10cm_NB.xa” read.
L= 5 N= 5 nsym= 2 M= 10 k= 1.25000000000000 Re= 1683.30497714247 cz=
0.728102428184088 ct= 0.00000000000000
Fortran runtime error: Array reference out of bounds for array ‘xjac’,
upper bound of dimension 1 exceeded (in file
Stressed in bold we can clearly see that there was an invalid array access, indeed. It was a trivial error code in a Fortran source file not belonging to the ARPACK libraries. Therefore, the problem was located in the program’s source, not in ARPACK. The invalid array index was eventually fixed by altering the program’s source code slightly, and now it is working so far so good. Case closed ;-).