Case Status
Log In

Wiki

Options

 
Batch Scheduling
  • RSS Feed

Last modified on 11-04-2016 15:26 by User.

Tags:

Batch Scheduling

Introduction

All jobs on GenomeDK must be executed as batchjobs through the queueing system. On GenomeDK we are using Slurm.
 
A node can be shared by more users, so you should always take extra care in requesting to correct amount of resources (nodes, cores and mem). There is no reason to occupy an entire node if you are only using a single core and a few gigabytes of memory. Always make sure to utillize the resources on the requested nodes efficiently.
 
The frontend must not be used for any jobs, except for test jobs running a couple of minutes.
 
Available queues/type of nodes:
Partition/queue name Nodes/cores Hardware Remarks
normal 95 / 1520
Two Intel/"Sandy Bridge" E5-2670 CPUs @ 2.67 GHz, 8 cores/CPU
64 GB memory @ 1600 MHz
2 TB SATA disks. Raid 0: ~280MB/s
10 GigE and 1 GigE NIC's.
Default walltime = 4 hours
short 56 / 896
Two Intel/"Sandy Bridge" E5-2670 CPUs @ 2.67 GHz, 8 cores/CPU
128 GB memory @ 1600 MHz
2 TB SATA disks. Raid 0: ~280MB/s
InfiniBand 4X QDR and 1 GigE NIC's

Max accepted walltime = 12 hours

Default walltime = 4 hours

normal 38 / 912
Two Intel/"Haswell" E5-2680v3 CPUs @ 2.5 GHz, 12 cores/CPU
256 GB memory @ 2133 MHz
2 TB SATA disks. Raid 0: ~350MB/s
InfiniBand FDR and 1 GigE NIC's
Default walltime = 4 hours
fat1 1 / 32
Four AMD/Opteron 6212 CPUs @ 2.67 GHz, 8 cores/CPU
512 GB memory @ 800 MHz
2 TB SAS disk: ~200MB/s
10 GigE and 1 GigE NIC's.
Performance of cpu/mem about 25% the nodes in normal
fat2 3 / 72
Four Intel/"Westmere" E7-4807 CPUs @ 1.87 Ghz, 6 cores/CPU
1024 GB memory @ 800 MHz
2 TB SAS disk: ~200MB/s
10 GigE and 1GigE NIC's.
Performance is about 50% of the nodes in normal
When a job starts, a unique directory will be created in a local /scratch -filesystem on each node the job has allocated. You can refer to this directory as /scratch/$SLURM_JOBID 
When the job terminates the scratch-directory(ies) and its/their contents are automatically erased.
 
To benefit from the backfilling mechanism, all jobs should specify a realistic wallclock time. Backfilling help jobs to start earlier, the drawback is that if the specified wallclock time is too small, the job will abort. The wallclock time can be changed during jobexecution with scontrol update jobid TimeLimit=48:00:00. However, to avoid fooling the backfill mechanism, only the sysadmin can raise a jobs time limit.
 
To get started with Batch Scheduling on GenomeDK, we recommend that you use the qx utility.
 
When the qx utility does not cut it, then use qx to jumpstart a new jobscript: qx -v compute > job.sh. Also, see jobscript examples for more information on how to execute computations on the cluster.

Frequently used commands

priority

Show queued jobs. Use '-a' for all users.

gnodes

Displays a graphical overview of the activity on the cluster.

| s01n61   0G  OOOOOOOOOOOOOOOO | s02n53   0G  .......!!!!!!!!! |
| s01n62       U--------------- | s02n54   0G  ______________OO |
| s01n63  32G  ..............OO | s02n55  64G  ................ |
 
In this example gnomes output, underscores represents allocated cores with no activity, O represents one allocated cores with of activity and ! represents a heavier load than requested. This means CPU bound jobs should result in all Os like on s01n61, while memory bound jobs will only have activity on a few nodes like s01n63. On s02n54 the user has asked for all the cores, but is only using a few and on s02n53 the user haven't asked for enough cores. 
 
The amount of non-allocated memory is shown for every node . If a node has available memory and cores, it is free to run jobs within this space.
 
sbatch

Submit batch jobs to the queue.

Submit job to normal queue, requesting 2 nodes with 16 cores:
sbatch -p normal -n 2 -c 16 jobscript
 
Submit job to fat1 queue, requesting 1 core on 1 node and all the memory:
sbatch -p fat1 --mem=512g jobscript
 
scancel <jobid> Cancel a queued or running job
jobinfo <jobid>

Gives some information about a job.

Name                : worlds_greatest_job
User                : user
Nodes               : s02n64
Cores               : 1
State               : COMPLETED
Submit              : 2014-07-09T13:49:54
Start               : 2014-07-09T13:49:55
End                 : 2014-07-09T14:00:34
Reserved walltime   : 2-00:00:00
Used walltime       :   00:10:39
Used CPU time       :   00:10:11
% User (Computation): 99.45%
% System (I/O)      :  0.55%
Mem reserved        : 4000M/node
Max Mem used        : 2.13G (s02n64)
Max Disk Write      : 577.00M (s02n64)
Max Disk Read       : 606.00M (s02n64)

Batch scheduling using qx

qx will autogenerate and submit a jobscript based on the parameters provided. It may also be used in a dryrun, where a jobscript will be generated without it being executed. This way allowing for jumpstarting new jobscripts.

The quick method

Any execution which can be performed from the front-end node, such as:

[user@fe1]$ md5sum large-file > large-file.md5

Can be submitted to the job queue with the command:

[user@fe1]$ qx --no-scratch "md5sum large-file > large-file.md5"
33201

Type 'mj' to see the job:

                                                            Elap   Alloc
Job ID           Username Queue    Jobname    SessID NDS  S Time   nodes
---------------- -------- -------- ---------- ------ ---  - -----  -----
33201            user     normal   md5sum         --   1  R 00:00  s02n64/1

The job output is written to the file 'slurm-33201.out' located in the current directory.

The quick method, with active waiting

If you want to wait for the result to finish (and block your session), use '-w'. This will also make qx output the stdout/stderr directly, instead of writing it to files.

[user@fe1 ~]$ qx --no-scratch -w md5sum large-file
9c21291051e62b9f153513ebf1bc566d  large-file

 

Using dispatch to run embarrassingly parallel jobs

An Embarrassingly Parallel job (EP-job) consists of several sub-jobs, with no dependency or communication between them. EP-jobs are able to achieve an excellent degree of utillization of the hardware (CPU and memory) because the individual tasks don't have to wait for each other or for the communication channels to become ready. The problem with EP-jobs is however is that they can be cumbersome to start, either having to create separate scripts for each or having some mapping from a one dimensional index into your full set of parameters.

The adispatch tool is provided to make starting this kind of job easier.

To use adispatch you create a file with each of the commands you want to run on separate lines. If you want to set any options for the jobs like timelimit or number of cores needed per command you can add SBATCH directives to the top of your command file. Instead of using SBATCH directives you can also give options to the adispatch commands that are passed along to sbatch, like -p express to use the express partition in the example below. The command file can also have comments (starting with #) or empty lines, both of which are ignored.

[user@fe1 ~]$ cat command_file
#SBATCH -c 4
#SBATCH --mem 2g
script param1 param2 ABC
script param2 param3 XYZ
...

[user@fe1 ~]$ adispatch -p express command_file
Submitted batch job 13288671

In this example we have a bunch of commands that specify in the command file that they need 4 cores and 2G of memory each. Additionally we know that they don't take long, so we ask for the express partition when submitting.

For any user that have used the old dispatch command, we still have a version of it available that only supports the dispatch -r format. The newer adispatch works better, since the resource usage is more dynamic. If there are a few commands that end up taking significantly longer than the rest the old dispatch would still hold on to enough resources for the "worst case," rather than just what the commands are actually using. With the new version you can also get more cores if the cluster is empty, your jobs can start earlier without having to wait for a larger allocation and jobs from other users can be interleaved - giving a more fair usage overall. The old dispatch also has some limitations. You need to create a wrapper script that doesn't do anything other than calling dispatch and asking for resources. It cannot read SBATCH directives from the command file and needs them to be specified on the command line or in the wrapper script. It needs the timelimit to cover the entire job and not just one of the commands. Finally it doesn't have a way of indicating that each command needs multiple cores.