NOTE: BEWARE OF SEMAPHORE EXHAUSTION ! |
| |
If your MPI jobs crash, you may leave abandoned semaphores on the compute nodes. Since any machine has a limit of 32 semaphores available, this may block other users from running MPI codes. If you get an error from an MPI job that looks something like this:
p4_error semget failed for setnum: 0
then that means most likely that the quotient of semaphores for a machine is exhausted. Since you have no control over what machine(s) the scheduler places your MPI job on, you will get unpredictable results until this is resolved.
What should I do when I see a semaphore-related error ?
Unfortunately, anyone who leaves abandoned semaphores on the cluster can block the parallel calculations of everyone else. Furthermore, if there are unreleased semaphores on the system, only the user who created them or the superuser can clean them up. Therefore, if you get an error message like the one you see above, you should first determine if you have left any open semaphores on the system (see below). If not, you should e-mail nninhelp@seas.harvard.edu immediately and report the problem. This will get top priority since it is a real show-stopping problem.
How do I see what semaphores I have created ?
To see semaphores created by you across the cluster, execute:
> cluster-fork ipcs -s
To clean semaphores that your user has created, run:
> cluster-fork /opt/mpich/intel9/sbin/cleanipcs
This path may vary depending on MPICH version used. See the write-up of modules for information on other MPI installations. Currently (11/15/07) only MPICH1 is in use for condor and so no ambiguity should be possible for now, at least. |
| |
| |