Page tree
Skip to end of metadata
Go to start of metadata

Pegasus is the code-name for genomicsengland High Performance compute cluster that runs all production worthy workflows. Pegasus used IBM's Load Sharing Facility ( simply call spectrum LSF) as the workload management tool (Job Scheduler).       

Accessing the HPC

The HPC is acessed via ssh from the terminal. The following is an example of a gecip user, John Doe, connecting via ssh. The address will change depending on what group you belong to, see the table below for more information.

You will then be prompted for your password, and once entered, will be connected to the HPC.

If you do not want to enter a password each time you connect, you can create a ssh key and ssh config file that will make logging in easier.

-Create a ssh key in your .ssh folder, which is located in /home/<username>/.ssh

cd .ssh
ssh-keygen

Follow the prompts to name your ssh key (I suggest cluster as a good name) and leave the password blank.

Create a ssh config file in the .ssh folder with the following information and format:

Host cluster
	Hostname hpc-prod-grid-login-gecip-01
	User <your username>
	IdentityFile ~/.ssh/cluster

Copy your new ssh public key to the HPC

ssh-copy-id -i cluster.pub cluster

This will ask for your password, then copy the ssh key to the HPC.

Now, instead of having to type

You can connect by typing

ssh cluster

Login nodes access address

NameWhoLDAP groupDescription
hpc-prod-grid-login-gecip-01GeCips & Researchersgecip_lsf_access, research_lsf_access
hpc-prod-grid-login-discoveryforum-01Commercial (Discovery Forum)discovery_lsf_access

Using software on the HPC


Sofware on the Genomics England HPC is managed through the module system framework. A full list of the software available in the modules is available here: Software Available on the HPC.

Module system commands

Loading software:

module load R/3.4.0

Always specify the version of the software that you want to load, to avoid errors and unexpected results. Running module load R will load version 3.5.1 instead of my desired version of 3.4.0.

Unloading software

module unload R/3.4.0

Switching versions (required software to be loaded first)

module switch R/3.3.0

How to submit jobs to LSF

Before you can submit jobs to the cluster, you need to load the cluster module.

module load cluster/prod

To load the module automatically on ssh connection to the HPC, add the command to the end of your .bashrc file.

If the above does not work for some reason, run the following line:

source /lsf/prod/conf/profile.lsf


To Submit an LSF job, you'll used the command.bsub

bsub -q <queue_name> -P <project_code> -o <output.stdout> -e <output.stderr> <myjob>

For a list of all LSF queues and project codes see LSF Project Codes.

You will only be able to submit to queues that you have LDAP access to.


Please use the login node as a portal to the HPC to submit jobs and nothing else. Unauthorised tools will not be permitted to run on the login nodes, and if they are found to be running, will be terminated without warning.

Interactive Vs Batch Jobs

Interactive Jobs are jobs that you interact with

  • command line
  • GUI
  • Job stays connected to submission shell

Interactive jobs have dedicated queue (name of the queue is inter) with dedicated resources during core hours for faster dispatch

Batch jobs are jobs that you don’t interact with. Job is disconnected from submission shell.

Jobs are batch by default

Some Basic LSF Commands 

command

description

bsub

submits a job to the cluster

bqueues

shows info on the cluster queues

bjobs

shows info on the cluster jobs

bhosts

shows info on the cluster hosts

bhist

shows info on the finished cluster jobs

bacct

shows statistics and info on finished cluster jobs

bkill

removes a job from the cluster

lshosts

shows static resource info

lsload

shows dynamic resource info

bjobs (Display information about LSF job)

bjobs is a very handy command to view job information (both Pending & running jobs). Using the long option ( -l ), it shows high level view of why (in case of job Pending in the queue), where, turnaround time, resource usage detail (for Running jobs)

Usage:

bjobs -l <JOBID>

eg.

bjobs -l 513


Genomics England LSF setup


Each node has a fixed number of ‘Job slots’. A job consumes one slot [can be more for parallel jobs]. Standard policy is 1 Job slot per CPU (for CPU single core) or 1 Job slot per Core (for CPU multi core).

Consequently, the cluster has a maximum concurrent job limit to allow a fast dispatch.
In our estate, this means each compute node can accommodate 22 concurrent jobs (batch jobs).

For interactive jobs, we allow more job slots per node with the assumption that interactive workloads are not resource intensive (mostly a way for users to submit to the cluster than running day-to-day activities directly into the submission node)

Pegasus

This is our main Production grid. All workloads are expected to be submitted to this grid targeting the right queue. (Total number of Job slots will increase overtime)

Cluster NameTotal number of CPU coresTotal number of Job slotsAvailable queues

pegasus

21122136

inter

high

bio

cip

gecip

research

discovery

low

 

Split of execution nodes based on resources bucket (cores, memory and localdisk /scratch)

Number of lsfexecution node(s)

Cores per nodeMemory per node/scratch per nodeQueue's available from
12812370 GB1.4 TB (1420 GB)ALL ( bio, high, gecip, research, discovery, cip)
2424740 GB2.9 TB (2910 GB)bio, high


View cluster information

To view cluster information (LSF version, Cluster name, Master host) & check if your environment is correctly setup, run command lsid

lsid

IBM Spectrum LSF Standard 10.1.0.0, Jul 08 2016

Copyright International Business Machines Corp. 1992, 2016.

US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

 

My cluster name is pegasus

My master name is hpc-prod-grid-lsfmaster-01.gel.zone

You'll be able to see which cluster you are connected to. "My cluster name is pegasus" refers to the Production cluster 

Queues Available

There are separate queues available in the grid that is setup for each groups & depending on the type of jobs intended to run . Eg. For Interactive jobs (viz, xterm, GUI tools) you are suppose to submit to inter queue. For Batch jobs you are suppose to target the right queue that belongs to your department, i.e for Bioinformatics team this will be bio queue. For gecips it is the gecip queue. To see all available queues in the grid, run bqueues

QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP 
high             50  Open:Active       -    -    -    -     0     0     0     0
inter            50  Open:Active       -    -    -    -     0     0     0     0
bio              40  Open:Active       -    -    -    -     0     0     0     0
cip              40  Open:Active       -    -    -    -     0     0     0     0
gecip            40  Open:Active       -    -    -    -     0     0     0     0
research         40  Open:Active       -    -    -    -     0     0     0     0
discovery        40  Open:Active       -    -    -    -     0     0     0     0
low              30  Open:Active       -    -    -    -     0     0     0     0

 

Queue nameWhoLDAP groupDescription
interALLN/AThis queue is for light weight interactive or GUI tools.The queue has a per user concurrent job limit of 5
highLimited access with prior approvalpipelinepriority batch job with approval to used. Currently only approved for fasttrack samples
bioBioinformatics group (internal)bio, pipelineInternal Genomics England staff
cipIllumina

bio-cip-illumina-share-rw

illumina queue
gecipGecipsgecip_lsf_accessqueue for gecips
researchResearchersresearch_lsf_accessqueue for researchers
discoveryDiscovery Forumdiscovery_lsf_accessqueue for Discovery forum group
lowALLN/Alow priority jobs that can risk pre-emption

Resources in LSF

LSF tracks resource availability and usage. LSF Jobs can use defined resources to request specific resource.
All hosts have static numeric resources. e.g
maxmem      total physical memory
ncpus           number of CPUs
maxtmp        maximum available space in /tmp
cpuf              CPU factor (relative performance)
as well as all hosts have dynamic numeric resources. e.g
mem             available memory
tmp              available space in /tmp
ut                 CPU utilisation 
Additionally resources can be, OS and ARCH boolean resources per host. This allows easy targeting of correct platforms. Example generic and specific OS resources
ub1604        host is running Ubuntu 16.04.

dsk              host has local disk /scratch with 2TB space

 
Ways to specify resources strings requirement (-R option)

   Select: It is a logical expression built from a set of resource names
   Order: The order string is used for host sorting and selection
   Usage: It is used to specify resource reservations for jobs
   Span: A span string specifies the locality of a parallel job.
   Same: The same string specifies that all processes of a parallel job must run on hosts with the same resource


  • No labels