GNU/Linux LKMS, The Art Of Adapting old-code 1/3: Understanding Kernel OOPS messages

Why adapting old-code is so important

This post will be the first one in a series of three dealing with old LKM code drivers, and the techniques, tools and different approaches we can use in order to adapt them so that they can be used on modern GNU/Linux kernels. Far from being an universal way, these posts can be considered as a guide to fix these problems, allowing us to re-use old LKM drivers on modern kernels. Adapting old LKM code is important because sometimes there are no good alternatives to a particular driver we have been using for long, right before upgrading our GNU/Linux distro or Kernel version. The upgrade procedure is absolutely mandatory to stay away from known security issues and incidents; on the other hand it provokes some sort of hardware obsolescence . From time to time, some drivers don’t load any more because of some major changes in the GNU/Linux Kernel APIs. When this happens, it is really annoying. Most system managers just don’t upgrade their systems to prevent this, waiting for a while until that particular driver has been updated accordingly to load on the new kernel.

Next posts will discuss largely about adapting the old ATI Catalyst 9.3 LKM sources, so that it can be loaded on a GNU/Linux Kernel 3.2.X.  However, this first one will focus mainly on some particular issue that was triggered after completing the whole process of altering its source code. The main point is to show what sort of tools and techniques we can use during the task of adapting old kernel code to a newer one, often implying debugging and going through Kernel OOPS or even Kernel PANIC messages.

Case study:  KERNEL OOPS whilst loading the fglrx.ko LKM.

Let’s suppose we’ve got the ATI fglrx.ko LKM – version 9.3-  “Catalyst” sources. We want to load this driver on a GNU/Linux Kernel 3.2.0-4-amd64. Let’s assume that we could complete the whole process of adapting the entire source code and now it does compile without errors on a GNU/Linux Kernel 3.2.0-4-amd64. As soon as we perform the insmod fglrx.ko command, a KERNEL OOPS shows up on the screen:


We have to focus on the last function that has been called, that is, __init_waitqueue_head();. According to the back trace, a call to this function fired the “unable to handle paging request at 0000000062513ba0” error. The function “__init_waitqueue_head” belongs to the GNU/Linux Kernel API. Rather than looking for a particular error inside the GNU/Linux kernel sources, it is quite feasible that one of its parameters – that is, the ones passed to the function itself – is wrong. So, we have to check these parameters, right before the call is made, and that means checking the KCL_WAIT_CreateObject function, which is part of the ATI Catalyst LKM sources.

Therefore, inside our fglrx sources directory, we ran the cscope utility in the first place in order to build the symbol database:

cd build_mod/ ; cscope -bv
Building cross-reference…

Then, we ran the cscope utility again to find the KCL_WAIT_CreateObject function so that we could analyse it:

111 KCL_WAIT_ObjectHandle ATI_API_CALL KCL_WAIT_CreateObject(void)
112 {
113     wait_queue_head_t* wait_object = __kmalloc(sizeof(wait_queue_head_t), GFP_ATOMIC);
115     if (wait_object)
116     {
117         init_waitqueue_head(wait_object);
118     }
120     return (KCL_WAIT_ObjectHandle)wait_object;
121 }

Line 117, according to the back-trace, is the one triggering the Kernel OOPS. So, wait_object must be the wrong parameter. On modern GNU/Linux Kernels, the “wait_object” variable does not have to be previously initialized, because the Kernel will be in charge of that. Therefore, as clearly shown in the previous code snippet, the wait_object is a pointer of type wait_queue_head_t and has been initialized by means of calling the kmalloc function. Looking back at the back trace, it happens that this memory area is located at 0x0000000062513ba0. Whenever the call to init_waitqueue_head is made, the GNU/Linux Kernel tries to initialize the wait_object so that it points to a new valid memory address. Because this address has already been initialized, the Kernel cannot do its job and a “handling paging request” issue shows up.

To fix this problem, we can rewrite the previous code snippet this way:

111 KCL_WAIT_ObjectHandle ATI_API_CALL KCL_WAIT_CreateObject(void)
112 {
114     wait_queue_head_t wait_object;
118     init_waitqueue_head(&wait_object);
121     return (KCL_WAIT_ObjectHandle)&wait_object;
122 }

Loading the module

# uname -r

# insmod fglrx.ko

# dmesg|grep fglrx

[ 11.434392] [fglrx] Maximum main memory to use for locked dma buffers: 1885 MBytes.
[ 11.435076] [fglrx] vendor: 1002 device: 71c5 count: 1
[ 11.436082] [fglrx] ioport: bar 1, base 0x4000, size: 0x100
[ 11.436564] [fglrx] Kernel PAT support detected, disabling driver built-in PAT support
[ 11.436592] [fglrx] module loaded – fglrx 8.59.2 [Mar 13 2009] with 1 minors
[13380.033213] fglrx_pci 0000:01:00.0: setting latency timer to 64