Latest Tweets

Changing the STACKSIZE (ulimit -s) limit whenever running mpirun

Preamble

Today, some basic SysOp stuff. đŸ˜‰

Sometimes, when the total amount of the stack size is not enough to run parallel jobs with OpenMPI, and there is no presence of a scheduler like SGE or Torque, it seems quite impossible to deal with the “ulimit -s” command so that a job could be run without the awful “segmentation fault” error.

This is so because, whenever an mpi job is executed by means of running the command mpirun, a non interactive shell is executed in turn on every single host where the job is going to be spawned. Therefore, if we have a certain command like this on our .bashrc file:

ulimit -s unlimited

it will be ignored.

One  solution (per-job ulimit)

 Bearing this in mind, that is, having a non-interactive shell run on every single computer node with orted and our actual job, we can do a simple trick -though it is a bit nasty – so that our shell becomes an actual interactive one:

mpirun –hostfile <hosts_file> | -np <number_of_threads> bash -c “ulimit -s unlimited && our_binary_to_be_executed”

This is gonna run the commands one after another, because we are explicitly summoning an actual interactive shell (bash in this case) on every single computer node involved. This shell, in turn, will execute our job. This allows to specify different stack sizes, so to speak, on a per-job basis. But it is a bit nasty, because our shell will become interactive on every single node, and a bit more stuff is going to take place before actually running our job. Not an elegant solution. Besides, we have to inform our users in order to make them modify the way they actually use the mpirun command and paired scripts – if any -.

Altering the stacksize limit for every single node once and for all (default per-node stack size)

There’s another approach we can take into account: to change the default amount of the stack size per node once and for all. This is more elegant, and clean, because users don’t need to worry about tricking their way of running the mpirun command, and their shells are not going to become interactive. Obviously, the problem now is that every single non interactive shell for every single user is going to have this new limit – or no limit at all -. In order to accomplish this, all we have to do is to change the /etc/bash.bashrc file so that, instead of:

[ -z "$PS1" ] && return

we have this:

if [ -z "$PS1" ]; then
       ulimit -s unlimited
fi

and the next non-interactive shell will be executing the ulimit command before whatever job comes next.

An example

Below, a trivial MPI Hello World application running without the per-job trick on a standard computer node with no default stack size changed:

mpirun –hostfile hosts_file ./mpi_hello_world
Limit: 8192
Hello world from processor node3, rank 0 out of 8 processors
Limit: 8192
Hello world from processor node3, rank 1 out of 8 processors
Limit: 8192
Hello world from processor node3, rank 2 out of 8 processors
Limit: 8192 …..>

To conclude our discussion, the very same program now running on a computer node where the small trick of changing its default stack size for non interactive shells have been applied (default per-node stack size trick):

mpirun –hostfile hosts_file ./mpi_hello_world
Limit: -1
Hello world from processor node3, rank 0 out of 8 processors
Limit: -1
Hello world from processor node3, rank 1 out of 8 processors
Limit: -1
Hello world from processor node3, rank 2 out of 8 processors
Limit: -1
Hello world from processor node3, rank 3 out of 8 processors
Limit: -1 ….>

Next time, folks, do yourselves a favour and use SGE or Torque! đŸ˜‰