Slurm, formerly known as SLURM (Simple Linux Utility for Resource Management), is a powerful computational workload scheduler used on many of the world's largest supercomputers. Its main function is to allocate computing resources to workloads submitted to a compute cluster, like BioHPC's Nucleus. When workloads outnumber the available resources, Slurm manages a fair-share resource allocation system, placing workloads in queues until they can be executed.
Think of it like...
For more detailed information about Slurm and its latest documentation, please refer to the official Slurm website.
Job - A unit of work submitted to Slurm, consisting of one or more tasks.
Task - A single executable program within a job.
Partition - A logical division of the cluster with its own resource limits and scheduling policies. Also known as a job queue.
Reservation - A block of computing resources pre-allocated for specific jobs or users at a specified time.
Node - A single computer within a cluster, containing a motherboard, CPU, RAM, and possibly a GPU.
Compute Node - Nodes within the Slurm cluster that run the jobs. Each compute node contains two sockets.
Socket - A physical slot on a node where a physical CPU is installed. Each physical CPU contains cores with direct access to the node's RAM.
Core - A single physical processor unit within the CPU capable of performing computations. Each physical core comprises two logical cores for processing two concurrent threads simultaneously.
Thread - A sequence of computer instructions that can be processed independently by a single logical core.
Job Handling:
Resource Management:
Typical Slurm Workflow:
Slurm partitions are separate collections of nodes. BioHPC has a total of 500 compute nodes as of 2018, of which 494 are addressable by Slurm. These nodes are classified into 9 partitions based on their hardware and memory capacity, with some partitions overlapping nodes with others. These partitions are as follows:
Partition | Nodes | Node List | CPU | Physical (Logical) Cores | Memory Capacity (GB) | GPU |
32GB | 280 | NucleusA[002-241],NucleusB[002-041] | Intel E5-2680 | 16 (32) | 32 | N/A |
128GB | 24 | Nucleus[010-033] | Intel E5-2670 | 16 (32) | 128 | N/A |
256GB | 78 | Nucleus[034-041, 050-081, 084-121] | Intel E5-2680v3 | 24 (48) | 256 | N/A |
256GBv1 | 48 | Nucleus[126-157,174-189] | Intel E5-2680v4 | 28 (56) | 256 | N/A |
384GB | 2 | Nucleus[082-083] | Intel E5-2670 | 16 (32) | 384 | N/A |
GPU | 40 | Nucleus[042-049],NucleusC[002-033] | various | various | various | Tesla K20/K40/P4/P40 |
GPUp4 | 16 | NucleusC[002-017] | Intel Gold 6140 | 36 (72) | 384 | Tesla P4 |
GPUp40 | 16 | NucleusC[018-033] | Intel Gold 6140 | 36 (72) | 384 | Tesla P40 |
GPUp100 | 12 | Nucleus[162-173] | Intel E5-2680v4 | 28 (56) | 256 | Tesla P100 (2X) |
GPUv100 | 2 | NucleusC[034-035] | Intel Gold 6140 | 36 (72) | 384 | Tesla V100 16GB (2x) |
GPUv100s | 10 | NucleusC[036-045] | Intel Gold 6140 | 36 (72) | 384 | Tesla V100 32GB (1x) |
GPU4v100 | 12 | NucleusC[070-081] | Intel Gold 6240 | 36(72) | 376 | Tesla V100 32GB (4x) |
GPUA100 | 16 | NucleusC[086-101] | Intel Gold 6240 | 36(72) | 1423 | Tesla A100 40GB (1x) |
GPU4A100 | 10 | NucleusC[102-111] | Intel Gold 6354 | 36(72) | 977 | Tesla A100 80GB (4x) |
PHG | 8 | Nucleus[122-125, 158-161] | Intel E5-2680v3 | 24 (48) | 256 | N/A |
super | 432 | All non-GPU and non-PHG nodes | various | various | various | N/A |
If a partition is not explicitly specified upon job submission, Slurm will allocate your job to the 128GB partition by default. The PHG partition is only available for the PHG group.
Figure 1 shows a high-level view of the computational resources available on a node similar to one of Nucleus' 128GB partition nodes.
We see that the node contains:
So the maximum concurrent threads the node can process simultaneously at any one time is:
When a job is submitted to Slurm, will be assigned a job state to describe the current status of the job, which will provide insights into the job's progress, resource usage, and potential issues that may require intervention.
Figure 2 outlines several common job states that a user will encounter, connected together with some typical transitions between the states.
Below are the descriptions for each discussed job state, along with its shorthand code in parenthesis - STATE (ST) :
PENDING (PD): The job is waiting in the queue to be scheduled and run. It is waiting for resources to become available or for other conditions to be met.
CONFIGURING (CF): The job has been allocated nodes and is in the process of setting up before execution begins.
RUNNING (R): The job is currently executing on the allocated nodes.
RESIZING (RZ): The job is about to change resource size by either increasing or decreasing the number of nodes, CPUs, or memory that the job is using. If the job is requesting new resources, it might be sent back to the PENDING state if the requested resources are unavailable.
SUSPENDED (S): The job has been temporarily stopped and is not currently executing, but it retains its allocation of nodes and can be resumed later.
COMPLETING (CG): The job is in the process of completing. Some processes may still be running, but Slurm is in the process of cleaning up and finalizing the job.
COMPLETED (CD): The job has completed successfully, and all processes have exited with an exit code of zero.
CANCELLED (CA): The job was cancelled by the user or system administrator before it could complete.
FAILED (F): The job terminated with a non-zero exit code, indicating that it did not complete successfully.
TIMEOUT (TO): The job was terminated because it exceeded the time limit specified for the job or partition.
For more details on these listed job states and others, visit Slurm's documentation.
For more information on each command shown below, type man <command> when logged into Nucleus.
[s219741@Nucleus005 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST super up infinite 4 drain* Nucleus[013-014,051],NucleusA045 super up infinite 15 drng NucleusA[004-008,011-012,017,019-025] super up infinite 25 drain NucleusA[003,009-010,013-016,018,026-041,046] super up infinite 271 alloc Nucleus[010-012,016-017,024-041,050,052-053,055-075,077-082,099-112,114-115,119-148,158-161,174-177,184-194,196],NucleusA[043-044,047-092,132-197,209,236],NucleusB[006-041] super up infinite 101 idle Nucleus[015,018-023,054,076,083-098,113,116-118,149-157,178-183,195],NucleusA[042,093-131],NucleusB[042-057] 256GB up infinite 1 drain* Nucleus051 256GB up infinite 57 alloc Nucleus[034-041,050,052-053,055-071,073-075,077,080-081,099-110,114-115,120,122-125,158-161] 256GB up infinite 21 idle Nucleus[054,076,084-098,113,116-118] 384GB up infinite 1 alloc Nucleus082 384GB up infinite 17 idle Nucleus083,NucleusB[042-057] 128GB* up infinite 2 drain* Nucleus[013-014] 128GB* up infinite 15 alloc Nucleus[010-012,016-017,024-033] 128GB* up infinite 7 idle Nucleus[015,018-023] 256GBv1 up infinite 39 alloc Nucleus[126-148,174-177,184-194,196] 256GBv1 up infinite 16 idle Nucleus[149-157,178-183,195] GPU up infinite 50 alloc Nucleus[042-043,045],NucleusC[002-003,005-008,016-017,019,022-026,028-031,036-046,048-057,059-066] GPU up infinite 21 idle Nucleus[044,046-049],NucleusC[009-015,018,020-021,027,032-033,067-069] 32GB up infinite 1 drain* NucleusA045 32GB up infinite 15 drng NucleusA[004-008,011-012,017,019-025] 32GB up infinite 25 drain NucleusA[003,009-010,013-016,018,026-041,046] 32GB up infinite 157 alloc Nucleus[072,078-079,111-112,119,121],NucleusA[043-044,047-092,132-197],NucleusB[006-041] 32GB up infinite 40 idle NucleusA[042,093-131] GPUp4 up infinite 8 alloc NucleusC[002-003,005-008,016-017] GPUp4 up infinite 7 idle NucleusC[009-015] GPUp40 up infinite 10 alloc NucleusC[019,022-026,028-031] GPUp40 up infinite 6 idle NucleusC[018,020-021,027,032-033] GPUp100 up infinite 1 drain* Nucleus166 GPUp100 up infinite 2 drain Nucleus[171-172] GPUp100 up infinite 5 alloc Nucleus[162,167-170] GPUp100 up infinite 3 idle Nucleus[163-165] GPUv100s up infinite 30 alloc NucleusC[036-057,059-066] GPUv100s up infinite 3 idle NucleusC[067-069] GPU4v100 up infinite 10 alloc NucleusC[070-075,077-080] GPU4v100 up infinite 1 idle NucleusC076 GPUA100 up infinite 1 drain NucleusC096 GPUA100 up infinite 15 alloc NucleusC[086-095,097-101] GPU4A100 up infinite 10 alloc NucleusC[102-111] 512GB up infinite 1 down* NucleusB079 512GB up infinite 1 drain NucleusB107 512GB up infinite 80 alloc NucleusA[198-241],NucleusB[002-005,058-064,075-078,080-083,086-087,092-106] 512GB up infinite 16 idle NucleusB[065-074,084-085,088-091] GPU4H100 up infinite 1 down* NucleusC122 GPU4H100 up infinite 3 drain NucleusC[118-120] GPU4H100 up infinite 1 alloc NucleusC121
[s219741@Nucleus005 ~]$ sbatch testjob.sh Submitted batch job 4298643
[s219741@Nucleus005 ~]$ srun -n 1 -p 256GB hostname Nucleus075 [s219741@Nucleus005 ~]$ srun -n 1 -p 256GB python hello.py Hello World, from Nucleus075
Start an interactive job on allocated resources to manually run programs on BioHPC compute nodes.
Helpful for manually running code on BioHPC.
--pty
: This key flag requests a pseudo-terminal (PTY) allocation. This is essential for interactive sessions, as it allows proper interaction with programs that expect a terminal-like environment./bin/bash
: Specifies that you want to launch a Bash shell, providing you with a command-line interface on the allocated compute node.[s219741@Nucleus005 ~]$ srun -n 1 -p 128GB --pty /bin/bash [modulestats] Wrapper already loaded [s219741@Nucleus030 ~]$ module load python [s219741@Nucleus030 ~]$ python hello.py Hello, world!
Below is an explanation for the attributes within the returned table once squeue has been run:
[s219741@Nucleus005 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
4174562 32GB TEM.1 s177440 R 100-05:01:24 4 Nucleus[079,111,119],NucleusB041
4175483 32GB TEM.2 s177440 R 99-03:36:42 4 Nucleus112,NucleusB[038-040]
4178109 32GB Alpha_99 s224996 R 97-02:03:47 7 NucleusA[192-195],NucleusB[027,029-030]
4178121 256GBv1 Alph_Ful s224996 R 97-01:54:13 2 Nucleus[184-185]
4185089 GPUv100s bash s185484 R 91-21:36:50 1 NucleusC051
4209196 32GB AlPha_99 s224996 R 72-00:17:48 7 NucleusA[132-138]
4215457 128GB PDAC s202223 R 69-03:18:36 1 Nucleus028
4235967 32GB TEM.3 s177440 R 52-03:09:46 4 NucleusB[034-037]
4236138 256GBv1 AlPh_Str s224996 R 51-21:45:05 2 Nucleus[177,194]
4242976 32GB R164S.1 s177440 R 49-02:45:18 4 NucleusB[010-011,025-026]
4243043 32GB R164S.2 s177440 R 49-01:08:47 4 NucleusB[006-009]
4243044 32GB R164S.3 s177440 R 49-01:07:55 4 NucleusA[173-176]
4243045 32GB R164N.1 s177440 R 49-01:07:00 4 NucleusA[163-164,171-172]
4243047 32GB R164N.2 s177440 R 49-01:06:06 4 NucleusA[159-162]
4249997 32GB R164N.3 s177440 R 41-01:33:00 4 NucleusA[154-156,158]
4251687 512GB BATCH_DH ansir_fm R 36-23:55:53 1 NucleusB063
4252072 512GB BATCH_TR ansir_fm R 36-06:09:03 1 NucleusA205
-u <username>
- Request jobs or job steps from a comma separated list of one or more users.
Helpful for users to check the status of their submitted jobs on Nucleus.
[s219741@Nucleus005 ~]$ squeue -u s219741 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4298707 32GB TestJob s219741 R 0:10 1 NucleusA049 4298708 32GB TestJob s219741 R 0:07 1 NucleusA140 4298709 32GB TestJob s219741 R 0:05 1 NucleusA141
[s219741@Nucleus005 ~]$ scontrol show job 4298714 JobId=4298714 JobName=TestJob UserId=s219741(219741) GroupId=biohpc_admin(1001) MCS_label=N/A Priority=4294791033 Nice=0 Account=(null) QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:13 TimeLimit=02:00:00 TimeMin=N/A SubmitTime=2024-02-29T15:02:48 EligibleTime=2024-02-29T15:02:48 StartTime=2024-02-29T15:02:49 EndTime=2024-02-29T17:02:49 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=32GB AllocNode:Sid=Nucleus005:46165 ReqNodeList=(null) ExcNodeList=(null) NodeList=NucleusA186 BatchHost=NucleusA186 NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=32,mem=28G,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=28G MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=/home2/s219741/testjob.sh WorkDir=/home2/s219741 StdErr=/home2/s219741/job_4298714.err StdIn=/dev/null StdOut=/home2/s219741/job_4298714.out Power=
Useful for deleting jobs that are unresponsive, or need to be manually stopped.
scancel Example
[s219741@Nucleus004 ~]$ squeue -u 219741 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4327688 32GB webGUI s219741 R 12:53:22 1 NucleusA147 4327759 32GB webGUI s219741 R 5:32:55 1 NucleusA082 4327931 128GB nf-callP s219741 R 2:58:57 1 Nucleus013 4327937 128GB nf-callP s219741 R 2:56:59 1 Nucleus017 4327954 32GB webGUI s219741 R 2:32:11 1 NucleusB021 4328288 32GB TestJob s219741 R 0:08 1 NucleusA167 [s219741@Nucleus004 ~]$ scancel 4328288 [s219741@Nucleus004 ~]$ squeue -u s219741 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4327688 32GB webGUI s219741 R 12:54:10 1 NucleusA147 4327759 32GB webGUI s219741 R 5:33:43 1 NucleusA082 4327931 128GB nf-callP s219741 R 2:59:45 1 Nucleus013 4327937 128GB nf-callP s219741 R 2:57:47 1 Nucleus017 4327954 32GB webGUI s219741 R 2:32:59 1 NucleusB021
[ydu@biohpcws009 ~]$ sacct -j 28066 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 28066 ReadCount+ super 32 COMPLETED 0:0 28066.batch batch 1 COMPLETED 0:0 28066.0 count_rea+ 2 COMPLETED 0:0
--start=<YYYY-MM-DD>
- Specify a start date to begin reporting the job and job step accounting information on or after the given date.
--end=<YYYY-MM-DD>
- Further refine the results by adding an --end
parameter to specify an end date.
[s219741@Nucleus030 ~]$ sacct --start=2024-01-01 --end=2024-02-01 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 4223127 webGUI 32GB 32 TIMEOUT 1:0 4223127.bat+ batch 32 COMPLETED 0:0 4224383 nf-trackS+ super 56 COMPLETED 0:0 4224383.bat+ batch 56 COMPLETED 0:0 4224384 nf-trimRe+ super 32 COMPLETED 0:0 4224384.bat+ batch 32 COMPLETED 0:0
#!/bin/bash #SBATCH --job-name=serialJob # job name #SBATCH --partition=super # select partion from 128GB, 256GB, 384GB, GPU and super #SBATCH --nodes=1 # number of nodes requested by user #SBATCH --time=0-00:00:30 # run time, format: D-H:M:S (max wallclock time) #SBATCH --output=serialJob.%j.out # standard output file name #SBATCH --error=serialJob.%j.time # standard error output file name #SBATCH --mail-user=username@utsouthwestern.edu # specify an email address #SBATCH --mail-type=ALL # send email when job status change (start, end, abortion and etc.) module add matlab/2014b # load software package matlab -nodisplay -nodesktop -nosplash < script.m # execute program
#!/bin/bash #SBATCH --job-name=mutiTaskJob # job name #SBATCH --partition=super # select partion from 128GB, 256GB, 384GB, GPU and super #SBATCH --nodes=2 # number of nodes requested by user #SBATCH --ntasks=64 # number of total tasks #SBATCH --time=0-00:00:30 # run time, format: D-H:M:S (max wallclock time) #SBATCH --output=mutiTaskJob.%j.out # standard output file name #SBATCH --error=mutiTaskJob.%j.time # standard error output file name #SBATCH --mail-user=username@utsouthwestern.edu # specify an email address #SBATCH --mail-type=ALL # send email when job status change (start, end, abortion and etc.) module add matlab/2014b # load software package let "ID=$SLURM_NODEID*$SLURM_NTASKS/$SLURM_NNODES+SLURM_LOCALID+1" # distribute tasks to 2 nodes based on their ID srun matlab -nodisplay -nodesktop -nosplash < script.m ID # execute program
#!/bin/bash #SBATCH --job-name=mutiThreading # job name #SBATCH --partition=super # select partion from 128GB, 256GB, 384GB, GPU and super #SBATCH --nodes=1 # number of nodes requested by user #SBATCH --ntasks=30 # number of total tasks #SBATCH --time=0-10:00:00 # run time, format: D-H:M:S (max wallclock time) #SBATCH --output=mutiThreading.%j.out # redirect both standard output and erro output to the same file #SBATCH --mail-user=username@utsouthwestern.edu # specify an email address #SBATCH --mail-type=ALL # send email when job status change (start, end, abortion and etc.) module add phenix/1.9 # load software package phenix.den_refine model.pdb data.mtz nproc=30 # execute program with 30 CPUs
#!/bin/bash #SBATCH --job-name=MPI # job name #SBATCH --partition=super # select partion from 128GB, 256GB, 384GB, GPU and super #SBATCH --nodes=2 # number of nodes requested by user #SBATCH --ntasks=64 # number of total tasks #SBATCH --time=0-00:00:10 # run time, format: D-H:M:S (max wallclock time) #SBATCH --output=MPI.%j.out # redirect both standard output and erro output to the same file #SBATCH --mail-user=username@utsouthwestern.edu # specify an email address #SBATCH --mail-type=ALL # send email when job status change (start, end, abortion and etc.) module add mvapich2/gcc/1.9 # load MPI library mpirun ./MPI_only # execute 64 MPI tasks across 2 nodes
#!/bin/bash #SBATCH --job-name=MPI_pthread # job name #SBATCH --partition=super # select partion from 128GB, 256GB, 384GB, GPU and super #SBATCH --nodes=4 # number of nodes requested by user #SBATCH --ntasks=8 # number of total MPI tasks #SBATCH --time=0-00:00:10 # run time, format: D-H:M:S (max wallclock time) #SBATCH --output=MPI_pthread.%j.out # redirect both standard output and erro output to the same file #SBATCH --mail-user=username@utsouthwestern.edu # specify an email address #SBATCH --mail-type=ALL # send email when job status change (start, end, abortion and etc.) module add mvapich2/gcc/1.9 # load MPI library let "NUM_THREADS=$SLURM_CPUS_ON_NODE/($SLURM_NTASKS/$SLURM_NNODES)" # calculate number of threads per MPI job mpirun ./MPI_pthread $NUM_THREADS # 8 MPI tasks across 4 nodes, each MPI executes with 16 threads
#!/bin/bash #SBATCH --job-name=cuda-test # job name #SBATCH --partition=GPU # select partion GPU #SBATCH --nodes=1 # number of nodes requested by user #SBATCH --gres=gpu:1 # use generic resource GPU, format: --gres=gpu:[n], n is the number of GPU card #SBATCH --time=0-00:00:10 # run time, format: D-H:M:S (max wallclock time) #SBATCH --output=cuda.%j.out # redirect both standard output and erro output to the same file #SBATCH --mail-user=username@utsouthwestern.edu # specify an email address #SBATCH --mail-type=ALL # send email when job status change (start, end, abortion and etc.) module add cuda65 # load cuda library ./matrixMul # execute GPU program
Several full Slurm job examples can be downloaded here.
#!/bin/bash #SBATCH --job-name=my_conda_job # Descriptive job name #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH --time=0-00:30 #SBATCH --output=my_conda_job.%j.out #SBATCH --error=my_conda_job.%j.err #SBATCH --mail-user=username@utsouthwestern.edu #SBATCH --mail-type=ALL # Load Conda module (if necessary) module load python/3.12.x-anaconda-mamba # Activate Conda environment conda activate my_environment # Execute Python script python my_python_script.py
#!/bin/bash
#SBATCH --job-name=my_conda_job # Descriptive job name
#SBATCH --partition=super
#SBATCH --nodes=1
#SBATCH --time=0-00:30
#SBATCH --output=my_conda_job.%j.out
#SBATCH --error=my_conda_job.%j.err
#SBATCH --mail-user=username@utsouthwestern.edu
#SBATCH --mail-type=ALL
module load namd/gpu/2.14
charmrun $(which namd2) ++local ++auto-provision +ignoresharing equilibration.conf > equil.log
charmrun $(which namd2) ++local ++auto-provision +ignoresharing md.conf >md.log
#!/bin/bash #SBATCH --job-name=my_conda_job # Descriptive job name #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH --time=0-00:30 #SBATCH --output=my_conda_job.%j.out #SBATCH --error=my_conda_job.%j.err #SBATCH --mail-user=username@utsouthwestern.edu #SBATCH --mail-type=ALL # Load Gromacs Module module load gromacs/5.1.4 # Execute Gromacs commands gmx grompp -f input.mdp -c conf.gro -p topol.top -o md_out.tpr gmx mdrun -deffnm md_out
A single user can allocate the following maximum amount of computing resources across all submitted jobs:
To request a job extension, please email the BioHPC team at biohpc-help@utsouthwestern.edu
My Slurm job has failed. What can I do to fix it?
1. Gather Information
squeue -u your_username
to check if your job is running, pending, failed, etc.#SBATCH -o
and #SBATCH -e
)2. Common Issue Areas
3. Debugging Steps
Additional Tips
Slurm Documentation
Slurm Cheatsheet: https://slurm.schedmd.com/pdfs/summary.pdf
Official Slurm documentation: https://slurm.schedmd.com/documentation.html
Slurm Quick Start User Guide: https://slurm.schedmd.com/quickstart.html
Tutorials and Guides
Slurm Tutorial from SchedMD (Slurm creators): https://slurm.schedmd.com/tutorials.html
Slurm Tutorial from Livermore computing resource: https://hpc.llnl.gov/banks-jobs/running-jobs/slurm
Slurm User Guide from Rochester Institute of Technology: https://research-computing.git-pages.rit.edu/docs/slurm_quick_start_tutorial.html
Online Courses and Videos
Slurm Job Management - Center for Advanced Research at University of South California: https://www.youtube.com/watch?v=GD5Ov75lQoM
Slurm Video Tutorials from SchedMD: https://slurm.schedmd.com/tutorials.html
Community Resources
Submit a Ticket for Slurm on BioHPC to: biohpc-help@utsouthwestern.edu
Slurm Knowledge Base: https://slurm.schedmd.com/kb.html
Slurm User Community: https://slurm.schedmd.com/community.html