Latest Tweets

MODEST: catching the sys_writev() system call

What MODEST couldn’t achieve

By so long, MODEST LKM was not able to catch sys_writev() system call. This was an issue with a little impact on the needing of using it in common way.

The theory

Reading the Kernel sources, in fact, the sys_writev() implementation is not much different than sys_write() system call design. The main difference is applied to the idea of passing a pointer to iovec data structure, instead of a single pointer to a char * data buffer. But it doesn’t matter: the effect of calling sys_write() or sys_writev() inside the Linux Kernel has not much impact – from a developer’s point of view -.

Thus, I wrote a small C program in order to test writev() syscall, trying to analyze it and making out what happens behind the scenes. It was actually a piece of cake. Let’s take a look at the small C program named sys-writev.c – and included inside the CVS project kmodest-src -:

23 int main (int argc, char **argv){
24
25     /* Just a vector of chars */
26     struct iovec myvec[DIMENSION];
27     int i, *pVec = NULL;
28
29     /* Fill it */
30     for(i=0; i<DIMENSION;i++){
31         myvec[i].iov_len = sizeof(int);
32         myvec[i].iov_base = malloc(sizeof(int));
33         /* Store a value */
34         memcpy(myvec[i].iov_base,&i,sizeof(i));
35         /* Point to it */
36         pVec = myvec[i].iov_base;
37         printf("Entry: %d, size: %d, @0x%x, value: %dn", i, sizeof(myvec[i].iov_base), (int)myvec[i].iov_base, *pVec);
38     }
39
40     /* Create the file */
41     fd = open("./sys-writev.out", O_CREAT|O_TRUNC|O_WRONLY,S_IRWXU);
42     if(fd){
43         /* Write the vector to disk...: */
44         while(1){
45             int vr = writev(fd, myvec, DIMENSION);
46             if(vr==-1)perror("writev"); /* Error */
47             else printf("Bytes written: %dn", vr);
48             sleep(1);
49             /* Now, do a sys_write() call : */
50             write(fd, "WRITE",5);
51         }
52         close(fd);
53     }
54
55     /* Release */
56     for(i=0; i<DIMENSION; i++) free(myvec[i].iov_base);
57     return 0;

Thus, compiling with -static and testing inside GDB I realized that the same CPU Registers used by sys_write() syscall could be used again, practically with the same meaning – just adding a trivial type cast somewhere in my code:

(gdb) info b
Num Type Disp Enb Address What
1 breakpoint keep y 0x080539a6 <writev+6>
breakpoint already hit 1 time
2 breakpoint keep y 0x08053920 <do_writev>
breakpoint already hit 1 time
3 breakpoint keep y 0x08053941 <do_writev+33>
(gdb) c
Continuing.

Breakpoint 3, 0x08053941 in do_writev ()
(gdb) info registers
eax 0x92 146
ecx 0xbfcc8748 -1077115064
edx 0xa 10
ebx 0x6 6

In the last gdb session’s output, %eax store __NR_syswritev id (146d) as usual, %ebx the File Descriptor number (fd), %ecx the address of const struct iovec* (in this case, the virtual memory address of 0xbfcc8748) , and the latter %edx the number of iovec’s blocks to write: 0xA (10d).

How to catch it

A little effort was made in order to catch this system call. First of all, deleting the cmpl assembly instruction inside my own system handler routine (syscall_entry.S). Then, adding my own version of sys_writev() kernel syscall’s implementation inside MODEST’s code (modest.c):

52     movl %ebx , userfd
53     movl %ecx , memory
54     movl %edx , bytes_to_read
55
56     # NOTE:
57     #   sys_write() stores just one memory buffer location on %ecx and its lenght in %edx.
58     #   In case of sys_writev() syscall, %ecx stores the iovec @ and %edx the number of iovecs blocks of data.
59
60     # Save all registers:
61     SAVE_ALL
62
63     # Call to new syscall handler, body on modest.c
64     # As a first argument to the handler (eax), contain syscall_id.
65
66     # First at all, I need the three args to this syscalls, stored on
67     # CPU registers. According to kernel api doc, eax contains syscall_id,
68     # and the next registers ebx, ecx and edx the first, second and third
69     # argument to the syscall.
70
71     # call my own handler - see modest.c - :
72     call    modest_syshandler
73
74     # Restores registers to avoid crashes when I return from here
75     # and pass control to original_syscall_handler:
76     RESTORE_ALL
77
78     # Pushes address of original handler into the stack
79     # So, when I exit with ret, EIP will point to this old syshandler:
80     pushl original_syscall_handler
81
82     # "ret" transfer control to the original handler
83     # The original handler will execute "iret" to end interruption
84     ret
709 ssize_t my_sys_writev (unsigned long fd, const struct iovec __user *vec, unsigned long vlen){
710
711     struct file *flp = NULL;
712     ssize_t ret = -EBADF;
713     loff_t pos = 0;
714
715     /* Get file pointer : */
716     flp = fcheck_files (current-&gt;files , fd);
717     if(!flp) return -EBADF;
718
719     /* Lock the file incrementing ref count of it : */
720     if (!atomic_inc_not_zero(&amp;flp-&gt;f_count)) return -EINVAL;
721
722     /* Okay, now get current offset : */
723     pos = flp-&gt;f_pos;
724
725     /* Call to vfs writev to write data on it from this position : */
726     ret = vfs_writev(flp,vec,vlen,&amp;pos);
727     flp-&gt;f_pos = pos;   /* Update file pos offset ! */
728
729     /* Free file count ref: */
730     fput(flp);
731
732     /* I do need to study this lines */
733     if (ret &gt; 0)
734         current-&gt;wchar += ret;
735     current-&gt;syscw++;
736
737     /* Return data written to file : */
738     return ret;
739
740 }

To conclude, my high level C handler’s code needs to discriminate the action to be taken depending on what syscall id is received – well, in fact, is executed -, by current process p:

/* Depending on the syscall_id : sys_write, or sys_writev, call the appropiate own syscall : */
547             switch(syscall_id){
548                 case __NR_write: my_sys_write( krn_fd, memory , bytes_to_read);
549                 break;
550                 /*
551                  * NOTE:
552                  * memory in writev is the location of iovec pointer structure, and bytes_to_read
553                  * the number of iovec block vectors to be written, so I have to typecast "memory"
554                  */
555                 case __NR_writev: my_sys_writev(krn_fd,(const struct iovec __user*)memory,bytes_to_read);
556                 break;
557             }

Just a trivial typecast (in order to convert from char __user* to struct iovec __char *iovec), and that’s all. There’s a new video on project’s CVS called “sys-writev-catched.ogg” illustrating it.

Logic approach

To picture what’s happening inside my handler with these little changes, I wrote a trivial second order logic formula. Here it is:

∃x{P(x)^∃y[S(y)^(C(x,a)∨C(x,b))]} →∃u(H(u)^E(u))

where:

P(x): “x is a process”, C(x,y): “y is executed or called by x”; S(x): “x is a syscall”; H(x): “x is a handler”; E(x): “handler is executed”.