HPCC user guide - MPI jobs

MPI is using multiple processors across multiple nodes, MPI stands for Message Passing Interface.  This is less of an issue now that we have nodes with 64 cores and 512GB RAM – most jobs will be able to run on the one physical node and therefore not need to consider message passing between nodes.

The procedure on centos5 is different from centos6 – I will use the version for centos 6 here as we are slowly upgrading the nodes with centos5.

There is an important environment variable $PBS_NODEFILE which is a file containing the nodes that your job has been assigned. You use this to pass the list of nodes to the mpi programme.

The use of pbsdsh and mpirun

pbsdsh is supplied as part of torque and so already knows what nodes your job has assigned. You can use this to launch multiple copies of your programme.  eg pbsdsh script.sh will launch a copy of script.sh on each processor you have been assigned, gathering the output and error files

Example 1

qsub  parallel-simple.sh

parallel-simple.sh contains

#!/bin/sh

#PBS -l cput=55:30:00, walltime=100:00:00, nodes=2:ppn=2:centos6

#PBS –m abe

#PBS –M mjm4y@udcf.gla.ac.uk

/usr/local/bin/pbsdsh   script.sh

This will launch 4 copies of script.sh – 2 copies on each of the two nodes you have been assigned

It is up to you to write the code if you require them to communicate between the processes.

A more complicated example using mpirun is shown next:

Example 2

qsub parallel-complex.sh

parallel-complex.sh contains

#!/bin/sh

#PBS -l cput=55:30:00, walltime=100:00:00, nodes=2:ppn=16:centos6

#PBS –m abe

#PBS –M mjm4y@udcf.gla.ac.uk

/usr/lib64/openmpi/bin/mpirun -machinefile $PBS_NODEFILE -np 32 /export/home/mjm4y/thread.sh

We use mpirun to control the process between nodes. The environment variable $PBS_NODEFILE contains a file with the list of nodes assigned to your job.

This job will then run on 16 processors on each of 2 nodes – the mpirun command will handle message passing between the nodes.

If you are using an mpi aware programme you should find that it can communicate between the processes on different nodes.

If you want to handle your own message passing then you need to be aware of the $PBS_NODEFILE variable – this variable contains a list of the nodes you have been allocated – a particular node will appear twice if you have been allocated two slots on a node. To see this try the following – qsub –I –l nodes=2:ppn=2. then in the shell you are presented with type cat $PBS_NODEFILE – the output is 4 lines, each line is the hostname of a node and each node is repeated twice (because you have asked for 2 processors on each node).

There is a temporary folder created on each node that you are using at /tmp/$PBS_JOBID they are deleted when your job ends.