Latest Tweets

Automatic job resubmit using Sun Grid Engine

Preamble

For small computational requirements, a certain queue called “comuna.q” has been set up on a GNU/Linux cluster running Sun Grid Engine as a job scheduler/queue manager. This queue allows a job to run for 24 hours tops before being terminated – killed -. Sometimes this 24 hours time limit does not fit the job execution at all. In these cases, the job’s owner is in charge of adjusting some input files by hand and finally re-submitting the job by running the qsub command once again. Allowing a larger time limit on this queue is out of the question, not to mention the checkpoints and migration facilities. Thus, another approach is mandatory: automatic job resubmit.

Terminating a job

By default, SGE sends a SIGKILL signal to the job whenever the time limit is up. This also happens when the job’s owner issues a qdel command to remove the job from the queue, as long as it is running. We do know that the SIGKILL signal cannot be caught, so when the job receives this signal, it simply ends its execution abruptly. Luckily, SGE can be properly configured so that a certain queue – comuna.q in our particular case – can be altered in order to send a different sort of signal to all the jobs running in it whenever this previously set up time limit is reached. Instead of sending the SIGKILL signal, we can tell SGE to send the SIGTERM one. This signal can be intercepted or even ignored..

This is shown below. We tell SGE to send this new signal by means of setting up the queue properly running the qmon graphical utility.

Setting up the comuna.q queue properly to send the SIGTERM signal to terminate jobs.

Trapping the SIGTERM signal in the job script

Now that we can trap this signal, we have to alter our job script accordingly. We want to re-submit our job automatically whenever necessary after it has been terminated abruptly. So, we do know that if our job script does execute the trap function, the Sun Grid Engine has decided to kill it, probably because of this time limit issue. Obviously, our job can receive this signal because we have decided to terminate it on our own accord by means of running the qdel command. Therefore, we need a way to decide whether we do want our job to be resubmitted or not. Using an auxiliary file will do.

The trap function could look like:

function resubmit_myself () {
        # Don't resubmit if we don't want to:
        test -f $NO_RESUBMIT && exit 0
        # Otherwise, resubmit myself:
        $QSUB ./job.sh
}

We have to tell our job script to trap only the SIGTERM signal by means of running our resubmit_myself function just implemented:

trap resubmit_myself SIGTERM

A more or less complete job script could look like:

#!/bin/bash
NO_RESUBMIT="./resubmit.no"
QSUB="/opt/sge/bin/lx26-amd64/qsub"
#------------------------------------------------------------------
#       resubmit_myself()
#               This function is triggered when a SIGTERM signal
#               is received. This signal arrives whenever a "qdel"
#               command is issued, or when the the time limit has
#               expired
#------------------------------------------------------------------
function resubmit_myself () {
        # Don't resubmit if we don't want to:
        test -f $NO_RESUBMIT && exit 0
        # Otherwise, resubmit myself:
        $QSUB ./job.sh
}
 
trap resubmit_myself SIGTERM
 
#$ -S /bin/bash      
#$ -N AutoResubmitSample
 
#$ -o $JOB_NAME.$JOB_ID.out
#$ -e $JOB_NAME.$JOB_ID.err
 
#$ -q comuna.q
 
#$ -cwd
 
echo 'Job started at '
date
 
cd $SGE_O_WORKDIR/
# Run the actual job:
./myjob
 
# Job finished
echo 'Job ended at'
date

Running our job and letting it resubmit itself

 Whenever the resubmit.no file is not in our work directory, and as long as the job does not terminate its execution on its own accord, by means of calling the _exit() function, that is, whenever it is not ended because of the SIGTERM signal, as soon as SGE kills it because of this time limit exhaustion, our trap function will be executed and the job re-submitted.

To test it, we can submit the job as usual:

qsub ./job.sh

And just wait for this 24 hours time limit to be reached. If our job is still running, it is going to be terminated and then re-submitted to the same queue once again, until it is completed.

If at some point we do not want this job to be re-submitted, we can prevent it by means of creating the auxiliar file previously mentioned this way:

touch resubmit.no

We can even remove this job by ourselves at any time by running the qdel command, but in order to do so we have to previously create this auxiliary file. Thus:

touch resubmit.no

qdel job-id

It seems fairly evident that the trap function is due to do more work than the one is actually doing in our sample script. Normally some adjustments and emendations are required – like renaming input-data files, and the sort – before allowing a process to be submitted to a SGE queue. Easy task to be done by means of extending the trap function functionalities.