Document Actions

Tutorial for OKA cluster users

by portal administrator — last modified July 5, 2006 at 18:37

environment setup, usage policy

Sun Grid Engine - a facility for executing UNIX jobs on remote machines

In accordance with SGE site:

"The Grid Engine project is an open source community effort to facilitate the adoption of distributed computing solutions. Sponsored by Sun Microsystems and hosted by CollabNet, the Grid Engine project provides enabling distributed resource management software for wide ranging requirements from compute farms to grid computing."

As of local installation the SGE is a batch monitor developed and released in open sources under license SISSL by SUN. It features standard batch monitor facilities and user utilities.

Environment Setup

For bash/zsh login shell add to $HOME/.profile:

> export SGE_ROOT=/okas/sgeadmin/sge/root
> . $SGE_ROOT/OKA/common/settings.sh

For csh/tcsh login shell add to $HOME/.login:

> setenv SGE_ROOT /okas/sgeadmin/sge/root
> source $SGE_ROOT/OKA/common/settings.csh

Example

If you aren't familiar with batch submission systems then it's highly recommended to read man of sge_intro at first:

> man sge_intro

Basic commands provided by SGE are:

qsub	submit a batch job to Grid Engine
qstat	show the status of Grid Engine jobs and queues
qdel	delete Grid Engine jobs from queues
qmon	GUI front-end to user's and administrator's utilities

In accordance with qsub manual:

"qsub submits batch jobs to the Grid Engine queuing system. Grid Engine
supports single- and multiple-node jobs. Command can be a path to a
binary or a script (see -b option) which contains the commands to be run
by the job using a shell (for example, sh(1) or csh(1)). Arguments to
the command are given as command_args to qsub . If command is handled
as a script then it is possible to embed flags in the script. If the
first two characters of a script line either match `#$' or are equal
the prefix string defined with the -C option described below, the line
is parsed for embedded command flags (man qsub for more info):
> cat > test_scripts.csh
#!/bin/csh

# Which account to be charged cpu time
#$ -A santa_claus

# date-time to run, format [[CC]yy]MMDDhhmm[.SS]
#$ -a 12241200

# set memory and job CPU time limits to 128MB and 5 hours respectively,
# man queue_conf ("RESOURCE LIMITS" section) to list all available
# parameters

#$ -l h_vmem=128,h_cpu=5:0:0

# If I run on dec_x put stderr in /tmp/foo, if I
# run on sun_y, put stderr in /usr/me/foo (-o for stdout, by default
# stderr and stdout are put into home dir of user on execution host)
#$ -e dec_x:/tmp/foo,sun_y:/usr/me/foo

# Send mail to these users
#$ -M santa@heaven,claus@heaven

# Mail at beginning/end/on suspension
#$ -m bes

# Export these environmental variables
#$ -v PVM_ROOT,FOOBAR=BAR
# to export all environmental variables use `-V' option

a.out
^D
> qsub test_script.csh

Place and run executable only from your file server directory /okas/<your directory>, user home directories are not visible on batch nodes in current cluster configuration:

> cd /okas/filin/test
> qsub test_script.csh

Q-commands

By users requests some extra commands were created to monitor system and processes on batch nodes. Names of the commands start with character Q. Here is list of the commands:

Qcat, Qchattr, Qchgrp, Qchmod, Qcp, Qfree, Qgrep, Qkill, Qls,
Qmv, Qps, Qpstree, Qrm, Qstat, Qtop (in batch mode), Qvmstat

The commands are equivalents of common used UNIX utilities and reproduce their behaviour and options. Use -h option to get help, -l option to get list of available batch nodes. It is safe to use them because all commands are run under uid and gid of user started the commands, so a user can damage only its own jobs and their environment. Examples of usage:

> Qps afx @okaf003

> Qtop -p 26971 @okaf001

If you have any comments or proposals to extend the list of commands contact cluster administrator.

Usage Policy

There are two common queues on the cluster:

short	high priority, CPU time limit 3 hours (job in the queue suspends job in `long' and personal queues on the same node), nice 2, 1 job slot per host. The queue is to be used for debugging and short-time tests. Users using the queue for other purposes will be removed from list of cluster users
long	low priority, there is no CPU time limit, nice 15, 1 job slot per host

Some users have personal queues which are used by the users instead of `long' queue. In every case any user has permission to run one long job on each node either by `long' queue or by personal queue. Personal queues are:

ioucht	nice 12	1 job slot per host
roma	nice 15	1 job slot per host
kolosov	nice 15	1 job slot per host
kdatsko	nice 15	1 job slot per host
tchikil	nice 15	1 job slot per host
polarush	nice 15	1 job slot per host
slava	nice 15	1 job slot per host

Personal queues provide some extra flexibility in job submission policy. Regard such a case:

each node can run up to three jobs simultaneously,
long queue has one job slot on every node;

So if a user submited the number of jobs much more the overall number of slots (number of nodes multiplied by number of slots on each node), then jobs of all other users submited jobs after the first user have to wait while jobs of the user will finish. Personal queues provide a way to overcome the limitation. Ideally we would like to have the situation:

each user has a personal queue,
each personal queue has one slot on each node,
the number of slots on each node is equal the number of users;

In the case:

every user can use all nodes simultaneously, e.g. can load whole cluster, so no one node will wait if only one user want to use cluster,
every user can run the number of jobs he/she needs at any time at least not less than number of nodes;

Due to hardware limitations (memory available on each node) the number of jobs running simultaneously is limited to three because:

( 512Mb RAM + 1.5Gb swap ) / 3 = 667Mb

667Mb is the max memory available for one job.

So all queues have hard limits:

600Mb virtual memory limit,
150Mb RSS (Real Segment Size - maximum size of process memory kept in RAM simultaneously, the rest is swapped out);
2.2 load threshold prohibits running more three jobs on each node simultaneously to guarantee hard memory limits.

Job scheduler submits job to least loaded queue, so it's required to set cpu time limit during job submission to escape killing of job exceeded cpu time limit or point suited queue explicitly by `-q' option:

> cd /okas/filin/test
> qsub -q long loop.sh

or with batch node name pointed explicitly:

> cd /okas/filin/test
> qsub -q long@okaf002 loop.sh

Schedule interval is set to 5 seconds so submission is performed job by job every 5 seconds. Decay time is set to 3 minutes.

Each job is provided by a local temporal directory with unique name being passed by TMP (is equal to TMPDIR) environment variable. Overall size of files placed in the directory by job can't exceed 9.5GB.

By default job working directory is set to directory the job is started from, so STDOUT and STDERR are saved to files in the directory.

Use qstat to list info about jobs, queues:

> qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
61 0.56000 loop.sh    filin        r     12/21/2004 23:10:50 long@okaf002.ihep.su               1        

> qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
long@okaf001.ihep.su           BIP   0/3       0.00     lx24-x86      
----------------------------------------------------------------------------
long@okaf002.ihep.su           BIP   0/3       0.00     lx24-x86      
----------------------------------------------------------------------------
long@okaf003.ihep.su           BIP   0/3       0.00     lx24-x86      
----------------------------------------------------------------------------
short@okaf001.ihep.su          BIP   0/1       0.00     lx24-x86      
----------------------------------------------------------------------------
short@okaf002.ihep.su          BIP   0/1       0.00     lx24-x86      
----------------------------------------------------------------------------
short@okaf003.ihep.su          BIP   0/1       0.00     lx24-x86      
...
> qstat -g c
CLUSTER QUEUE                   CQLOAD   USED  AVAIL  TOTAL aoACDS  cdsuE  
-------------------------------------------------------------------------------
long                              0.00      0      9      9      0      0 
short                             0.00      0      3      3      0      0

Documentation

In addition to SGE manual:

> man sge_intro

there is User's Guide in PDF:

> gv $SGE_ROOT/../../docs/UsersGuide.pdf

In wich they state that the job status is one of:

d(eletion),
t(ransfering),
r(unning),
R(estarted),
s(uspended),
S(uspended),
T(hreshold),
w(aiting),
h(old);

"The state d(eletion) indicates that a qdel(1) has been used to ini-
tiate job deletion. The states t(ransfering) and r(unning) indicate
that a job is about to be executed or is already executing, whereas
the states s(uspended), S(uspended) and T(hreshold) show that an
already running jobs has been suspended. The s(uspended) state is
caused by suspending the job via the qmod(1) command, the
S(uspended) state indicates that the queue containing the job is
suspended and therefore the job is also suspended and the T(hresh-
old) state shows that at least one suspend threshold of the corre-
sponding queue was exceeded (see queue_conf(5)) and that the job has
been suspended as a consequence. The state R(estarted) indicates
that the job was restarted. This can be caused by a job migration or
because of one of the reasons described in the -r section of the
qsub(1) command.

The states w(aiting) and h(old) only appear for pending jobs. The
h(old) state indicates that a job currently is not eligible for exe-
cution due to a hold state assigned to it via qhold(1), qalter(1) or
the qsub(1) -h option or that the job is waiting for completion of
the jobs to which job dependencies have been assigned to the job via
the -hold_jid option of qsub(1) or qalter(1)."

Queue state is one of or combinations thereof:

u(nknown) if the corresponding sge_execd(8) cannot be contacted,
a(larm),
A(larm),
C(alendar suspended),
s(uspended),
S(ubordinate),
d(isabled),
D(isabled),
E(rror);

The status of the parallel task is one of:

r(unning),
R(estarted),
s(uspended),
S(uspended),
T(hreshold),
w(aiting),
h(old),
x(exited);

OKA Portal

Sections

Personal tools

Document Actions

Tutorial for OKA cluster users

Sun Grid Engine - a facility for executing UNIX jobs on remote machines

Environment Setup

Example

Q-commands

Usage Policy

Documentation

OKA Portal

Sections

Personal tools

Local resources

External links

Document Actions

Tutorial for OKA cluster users

Sun Grid Engine - a facility for executing UNIX jobs on remote machines

Environment Setup

Example

Q-commands

Usage Policy

Documentation