Personal tools
You are here: Home Members zopeadmin guides & specs OKA cluster Tutorial for OKA cluster users
Document Actions

Tutorial for OKA cluster users

by portal administrator last modified July 5, 2006 at 18:37

environment setup, usage policy

Sun Grid Engine - a facility for executing UNIX jobs on remote machines

In accordance with SGE site:
"The Grid Engine project is an open source community effort to facilitate the adoption of distributed computing solutions. Sponsored by Sun Microsystems and hosted by CollabNet, the Grid Engine project provides enabling distributed resource management software for wide ranging requirements from compute farms to grid computing."
As of local installation the SGE is a batch monitor developed and released in open sources under license SISSL by SUN. It features standard batch monitor facilities and user utilities.

Environment Setup

For bash/zsh login shell add to $HOME/.profile:
> export SGE_ROOT=/okas/sgeadmin/sge/root
> . $SGE_ROOT/OKA/common/
For csh/tcsh login shell add to $HOME/.login:
> setenv SGE_ROOT /okas/sgeadmin/sge/root
> source $SGE_ROOT/OKA/common/settings.csh


If you aren't familiar with batch submission systems then it's highly recommended to read man of sge_intro at first:
> man sge_intro
Basic commands provided by SGE are:
 qsubsubmit a batch job to Grid Engine
 qstat show the status of Grid Engine jobs and queues
 qdel delete Grid Engine jobs from queues
 qmon GUI front-end to user's and administrator's utilities
In accordance with qsub manual:
"qsub  submits batch jobs to the Grid Engine queuing system. Grid Engine
supports single- and multiple-node jobs. Command can be  a  path  to  a
binary or a script (see -b option) which contains the commands to be run
by the job using a shell (for example, sh(1) or csh(1)).  Arguments  to
the  command are given as command_args to qsub .  If command is handled
as a script then it is possible to embed flags in the script.   If  the
first two characters of a script line either match `#$' or are equal
the prefix string defined with the -C option described below, the  line
is parsed for embedded command flags (man qsub for more info):
> cat > test_scripts.csh

# Which account to be charged cpu time
#$ -A santa_claus

# date-time to run, format [[CC]yy]MMDDhhmm[.SS]
#$ -a 12241200

# set memory and job CPU time limits to 128MB and 5 hours respectively,
# man queue_conf ("RESOURCE LIMITS" section) to list all available
# parameters

#$ -l h_vmem=128,h_cpu=5:0:0

# If I run on dec_x put stderr in /tmp/foo, if I
# run on sun_y, put stderr in /usr/me/foo (-o for stdout, by default
# stderr and stdout are put into home dir of user on execution host)
#$ -e dec_x:/tmp/foo,sun_y:/usr/me/foo

# Send mail to these users
#$ -M santa@heaven,claus@heaven

# Mail at beginning/end/on suspension
#$ -m bes

# Export these environmental variables
# to export all environmental variables use `-V' option

> qsub test_script.csh
Place and run executable only from your file server directory /okas/<your directory>, user home directories are not visible on batch nodes in current cluster configuration:
> cd /okas/filin/test
> qsub test_script.csh


By users requests some extra commands were created to monitor system and processes on batch nodes. Names of the commands start with character Q. Here is list of the commands:
Qcat, Qchattr, Qchgrp, Qchmod, Qcp, Qfree, Qgrep, Qkill, Qls,
Qmv, Qps, Qpstree, Qrm, Qstat, Qtop (in batch mode), Qvmstat
The commands are equivalents of common used UNIX utilities and reproduce their behaviour and options. Use -h option to get help, -l option to get list of available batch nodes. It is safe to use them because all commands are run under uid and gid of user started the commands, so a user can damage only its own jobs and their environment. Examples of usage:
> Qps afx @okaf003
> Qtop -p 26971 @okaf001
If you have any comments or proposals to extend the list of commands contact cluster administrator.

Usage Policy

There are two common queues on the cluster:
 high priority, CPU time limit 3 hours (job in the queue suspends job in `long' and personal queues on the same node), nice 2, 1 job slot per host. The queue is to be used for debugging and short-time tests. Users using the queue for other purposes will be removed from list of cluster users
long low priority, there is no CPU time limit, nice 15, 1 job slot per host
Some users have personal queues which are used by the users instead of `long' queue. In every case any user has permission to run one long job on each node either by `long' queue or by personal queue. Personal queues are:
nice 121 job slot per host
romanice 151 job slot per host
kolosovnice 151 job slot per host
nice 151 job slot per host
nice 151 job slot per host
nice 15
1 job slot per host
nice 15
1 job slot per host
Personal queues provide some extra flexibility in job submission policy. Regard such a case:
  1. each node can run up to three jobs simultaneously,
  2. long queue has one job slot on every node;
So if a user submited the number of jobs much more the overall number of slots (number of nodes multiplied by number of slots on each node), then jobs of all other users submited jobs after the first user have to wait while jobs of the user will finish. Personal queues provide a way to overcome the limitation. Ideally we would like to have the situation:
  1. each user has a personal queue,
  2. each personal queue has one slot on each node,
  3. the number of slots on each node is equal the number of users;
In the case:
  1. every user can use all nodes simultaneously, e.g. can load whole cluster, so no one node will wait if only one user want to use cluster,
  2. every user can run the number of jobs he/she needs at any time at least not less than number of nodes;
Due to hardware limitations (memory available on each node) the number of jobs running simultaneously is limited to three because:
( 512Mb RAM + 1.5Gb swap ) / 3 = 667Mb
667Mb is the max memory available for one job.

So all queues have hard limits:

  1. 600Mb virtual memory limit,
  2. 150Mb RSS (Real Segment Size - maximum size of process memory kept in RAM simultaneously, the rest is swapped out);
  3. 2.2 load threshold prohibits running more three jobs on each node simultaneously to guarantee hard memory limits.
Job scheduler submits job to least loaded queue, so it's required to set cpu time limit during job submission to escape killing of job exceeded cpu time limit or point suited queue explicitly by `-q' option:
> cd /okas/filin/test
> qsub -q long
or with batch node name pointed explicitly:
> cd /okas/filin/test
> qsub -q long@okaf002
Schedule interval is set to 5 seconds so submission is performed job by job every 5 seconds. Decay time is set to 3 minutes.

Each job is provided by a local temporal directory with unique name being passed by TMP (is equal to TMPDIR) environment variable. Overall size of files placed in the directory by job can't exceed 9.5GB.

By default job working directory is set to directory the job is started from, so STDOUT and STDERR are saved to files in the directory.

Use qstat to list info about jobs, queues:
> qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
61 0.56000 filin r 12/21/2004 23:10:50 1

> qstat -f
queuename qtype used/tot. load_avg arch states
---------------------------------------------------------------------------- BIP 0/3 0.00 lx24-x86
---------------------------------------------------------------------------- BIP 0/3 0.00 lx24-x86
---------------------------------------------------------------------------- BIP 0/3 0.00 lx24-x86
---------------------------------------------------------------------------- BIP 0/1 0.00 lx24-x86
---------------------------------------------------------------------------- BIP 0/1 0.00 lx24-x86
---------------------------------------------------------------------------- BIP 0/1 0.00 lx24-x86
> qstat -g c
long 0.00 0 9 9 0 0
short 0.00 0 3 3 0 0


In addition to SGE manual:
> man sge_intro

there is User's Guide in PDF:

> gv $SGE_ROOT/../../docs/UsersGuide.pdf
In wich they state that the job status is one of:
  • d(eletion),
  • t(ransfering),
  • r(unning),
  • R(estarted),
  • s(uspended),
  • S(uspended),
  • T(hreshold),
  • w(aiting),
  • h(old);
"The  state d(eletion) indicates that a qdel(1) has been used to ini-
tiate job deletion.  The states t(ransfering) and r(unning) indicate
that  a job is about to be executed or is already executing, whereas
the states s(uspended), S(uspended) and  T(hreshold)  show  that  an
already  running  jobs  has been suspended. The s(uspended) state is
caused  by  suspending  the  job  via  the  qmod(1)   command,   the
S(uspended)  state  indicates  that  the queue containing the job is
suspended and therefore the job is also suspended and  the  T(hresh-
old)  state  shows that at least one suspend threshold of the corre-
sponding queue was exceeded (see queue_conf(5)) and that the job has
been  suspended  as  a  consequence. The state R(estarted) indicates
that the job was restarted. This can be caused by a job migration or
because  of  one  of  the reasons described in the -r section of the
qsub(1) command.

The states w(aiting) and h(old) only appear for  pending  jobs.  The
h(old) state indicates that a job currently is not eligible for exe-
cution due to a hold state assigned to it via qhold(1), qalter(1) or
the  qsub(1)  -h option or that the job is waiting for completion of
the jobs to which job dependencies have been assigned to the job via
the -hold_jid option of qsub(1) or qalter(1)."
Queue state is one of or combinations thereof:
  • u(nknown) if the corresponding sge_execd(8) cannot be contacted,
  • a(larm),
  • A(larm),
  • C(alendar suspended),
  • s(uspended),
  • S(ubordinate),
  • d(isabled),
  • D(isabled),
  • E(rror);
The status of the parallel task is one of:
  • r(unning),
  • R(estarted),
  • s(uspended),
  • S(uspended),
  • T(hreshold),
  • w(aiting),
  • h(old),
  • x(exited);

Powered by Plone, the Open Source Content Management System