The first thing we will do is to make sure that we can connect to the cluster.
- ipsych.genome.au.dk users, must follow the instructions at http://ipsych.genome.au.dk
- Windows users will want to get putty (and winscp for transferring files).
Linux and Mac OS X already has a command called
To connect you run the following command from a terminal or by putting the same address in to putty.
[me@local ~]$ ssh email@example.com [me@genomedk ~]$
Creating a script
Now we will create a simple text file on our local machine, with the following content:
#!/bin/bash source /com/extra/bwa/0.7.5a/load.sh reference_prefix="/data/refseq/iGenome/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa" bwa mem $reference_prefix test-reads.fq > test-reads.sam
The script loads a pre-installed piece of software - in this case bwa - from a local software repository. It is important that you load the software in your script -- loading it in your session and starting a script will only work if you remember to load the software every time you submit the job.
Too see what else is available you can use the command
Copying a file to the cluster
* ipsych.genome.au.dk users must follow the guide in Ipsych Usage.
To get this script from your local machine to the cluster you can use the
scp command on linux/mac and the WinSCP program on windows. This uses the same protocol as ssh so you connect to the same address with the same credentials.
[me@local ~]$ scp scriptfile firstname.lastname@example.org:faststorage/scriptfile email@example.com password: scriptfile 100% 242 0.2KB/s 00:00 [me@local ~]$
Downloading data from the internet
Now you might have noticed that the script mentions a
test-reads.fq which we haven't made. This is a file that is publicly available via http. You can use the
curl commands to download it.
The cluster is behind a firewall. To download through the firewall you have to configure a proxy, by setting the environment variables
This will enable the proxy for this session, if you want it to be enabled every session you can add the same export commands to the
.bash_profile file in your home folder.
[me@genomedk ~]$ export http_proxy="http://in:3128" && export ftp_proxy="http://in:3128" [me@genomedk ~]$ wget http://cs.au.dk/~aeh/genomedk_tut/test-reads.fq --2015-01-14 10:37:14-- http://cs.au.dk/~aeh/genomedk_tut/test-reads.fq Resolving in... 10.20.0.50 Connecting to in|10.20.0.50|:3128... connected. Proxy request sent, awaiting response... 200 OK Length: 380 [text/plain] Saving to: “test-reads.fq” 100%[==============================================>] 380 --.-K/s in 0s 2015-01-14 10:37:14 (44.4 MB/s) - “test-reads.fq” saved [380/380] [me@genomedk ~]$ mv test-reads.fq faststorage/ [me@genomedk ~]$ cd faststorage/ [me@genomedk faststorage]$
We move the script and the datafile into the faststorage folder, as this storage is more suited for datafiles.
The cluster has two storage systems. One is a standard shared NFS system, where your homedir is located. This is good for smaller files and less often used files. The other storage, is a much faster system, with much more available space. The disadvantage is that the performance of small files is not as good as on NFS and there is no client side caching so certain usage patterns are not very fast. To access the faster storage you use the faststorage folders. You have a personal one in your homedir and every project dir also has one.
As a rule of thumb all non-tiny data files should be on faststorage.
If you need backups you can create a folder called BACKUP and everything under that folder will have backups. Try to avoid putting larger derived files under backup - you can always derive them again, if you make sure to backup your pipeline, which was used to generate the files.
Submitting a job
Now to actually run our script we need to submit it to the queue of jobs. This is done by executing
srun scriptfile which will wait until the job is done, showing you the output from the job as it executes.
[me@genomedk faststorage]$ srun --mem-per-cpu=4G --partition=express scriptfile srun: job 2396710 queued and waiting for resources srun: job 2396710 has been allocated resources [M::main_mem] read 2 sequences (102 bp)... [main] Version: 0.7.5a-r405 [main] CMD: bwa mem /data/refseq/iGenome/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa test-reads.fq [main] Real time: 6.297 sec; CPU: 2.907 sec [me@genomedk faststorage]$
The script should finish quickly with no errors. If you look at the srun command you can see that we asked for the express partition. This is a couple of machines used for test jobs that will always have a time limit of at most 1 hour. That also means there should rarely, if ever, be any wait time. We also ask for 4GB of memory to work with.
Let's try with a slightly larger input file. Modify the script to look like this:
Now we specify the extra parameters in the file itself, which will only work with the
sbatch command that we will see in a minute. Instead of the express partition we now ask for the normal partition where the main bulk of the jobs are run and the default time limit is 48 hours. We also ask for more memory and 4 cores.
As an example, we simply run the same bwa command four times in parallel (indicated by the
wait). If we ran multiple commands without asking for extra cores they would share a single cpu, getting only 25% of the time each.
If there are a lot of jobs in the queue already or if the job takes longer than a few minutes to run you probably don't want to block your terminal while you wait. Instead you can submit it with
sbatch scriptfile which will return immediately and give you a job number that can be used to check on the job later.
An alternative approach (and often preferred method), would have been to create four different scripts, each using just one core, and submitted them all at the same time with sbatch.
The larger test files are still pretty small so it doesn't take more than a few minutes to finish once the job gets through the queue. Once the job is finished we can get some information about it with the
jobinfo command. It will look something like this:
This shows when the job was started/finished, what was requested and so on.
The most important information is the maximum memory usage and used walltime. In this case we can see that we actually used 5.2GB per core (20.71GB for four cores), and not the 8GB that we asked for. If we were running a similar script on many different input files asking for 8GB would be an okay safety margin, while asking for 32GB is a waste of resources. The default 48 hour time limit is too large, as our jobs only take a few minutes to run. The more accurately everyone specifies their jobs the smoother the whole queue system is going to run.