Manual for the TORQUE Resource Manager with a Plan-Based Scheduler

From MetaCentrum
Jump to: navigation, search

In this manual we describe major extension of the open source TORQUE Resource Manager system. We have replaced the naive scheduler provided in the TORQUE distribution with the complex scheduling system that allows to plan job execution ahead and predict the behavior of the system. It is based on the application of job schedule, which represents jobs' execution plan. Such a functionality is very useful as the plan can be used by the users to see when and where their jobs will be executed. Moreover, created plans can be easily evaluated in order to identify possible inefficiencies. Then, repair actions can be taken immediately and the inefficiencies can be fixed, producing better schedules with respect to considered criteria.

This manual expects that you are familiar with the original TORQUE Admin Manual available on: http://www.clusterresources.com/torquedocs21/. First we describe the princip of the plan-based approach. In the next this manual describes the difference to the the TORQUE Resource Manager used in Czech Nation Grid - MetaCentrum. Another part deals with an installation of this software and how to set up the software. We also describe the way how to submit a job in to the plan-based scheduler. And finally for those who want to know more about a plan-based scheduling this manual describes how the scheduler works.

About a Plan-Based Scheduling

We have developed working implementation of job schedule that can be used to plan job execution onto one or more computing sites such as computer cluster. The schedule is created subject to dynamically arriving jobs. Then it is used to schedule jobs on the available computing resources. The schedule is continuously maintained in order to remain consistent with the changing situation in the system. Thus all important events such as job arrivals and (early) job completions or machine failures and restarts are reflected in order to keep the schedule up-to-date with respect to the changing situation. Moreover, the users can now query the scheduler to get information from the job schedule. It means that they can ask the scheduler when and where their jobs will be executed and the scheduler provides them such a prediction according to the current job schedule. Also, several evaluation criteria has been implemented allowing the use of schedule optimization techniques.

The usual schedulers are so called queueing systems. It means that these systems typically follow the queue-based scheduling approach, using one or more incoming queues where jobs are stored until they are scheduled for execution. A scheme of a basic (local) queueing system is shown in the following figure (left).

Plan-based systems represent significantly different approach with respect to the queue-based solutions. Plan-based systems use job schedule as a more complex data structure that maps jobs onto available machines in time. This schedule represents de facto a plan of future job execution. In the same fashion as, e.g., backfilling algorithms require information about expected job runtime, also the plan-based solutions need such information to construct the schedule. A scheme of the general plan-based system is shown in the figure (right).

There are no incoming queues that would store the jobs (actually there is one default queue, but the scheduler ignore it). Instead of that a two dimensional data structure is being built that assigns to each job its own space and time slot on given machine(s). The x-axis represents system time and the y-axis represents particular machine on a particular cluster. There are other aspects that differentiate the design of plan-based systems from the queue-based systems. The major difference is that nontrivial scheduling decision must be taken every time some new job arrives. Such immediate decision is necessary to find suitable place for the job in the schedule. Queue-based methods usually perform such decision when the job is selected for execution, i.e., at the "last possible moment" (when machine(s) become available). An exception represent those queue-based methods that use reservations. When machine(s) become available, scheduling decisions are trivial for plan-based methods. At that point in time, scheduler simply sends on such machine(s) those jobs that are stored in the schedule on corresponding coordinates.


Planbased-queuebased systems.png

Optimization

One of the advantage of the plan-based apprach is that it is possible to evaluate the plan and to make decision which plan is better. In order to use this advantage we regulary lunch an optimization algorithm. This algorithm optimize the plan with respect to several criteria together. The criteria are folowing:

  • Response time - The avg. response time represents the average time a job spends in the system, i.e., the time from its submission to its termination.
  • Wait time - The avg. wait time is the mean time that the jobs spend waiting before their execution starts.
  • Bounded slowdown - The avg. bounded slowdown is the mean value of all jobs' bounded slowdowns. Slowdown is the ratio of the actual response time of the job to the response time if executed without any waiting. If a job has a very small runtime it often means that the job ended prematurely due to some error. As a result, its slowdown can be huge, which may seriously skew the final mean value. Therefore, bounded slowdown is applied, where the minimal job runtime is guaranteed to be greater than some predefined threshold, e.g., 10 seconds.
  • User-to-user fairness - This criterion is optimized by minimizing the mean and standard deviation of Normalized User Wait Time (NUWT). For a given user, NUWT is the total user wait time divided by the amount of previously consumed system resources by that user. Multiple consumed resources are reflected when calculating NUWT.

All the criteria are to be minimized.

Local search is used for optimization. Local search is a heuristic algorithm which improves the solution step by step. An initial solution is available for a run of the local search algorithm. The local search algorithm improve the solution in an iterative way. It finds an improvement and then the algoritm decides if this improvement is accepted or declined in each iteration. In our solution there is the improvement always accepted.

The following figure describes the applied metaheuristic:

Optimization.png

The difference to the TORQUE scheduler in MetaCentrum

The whole scheduler is different and some features are not supported yet.

Not supported features

  • The plan-based scheduler is not able to handle with admin jobs.
  • There is no P2P support opposed to the scheduler in MetaCentrum.
  • The plan-based scheduler doesn't check whether a node has a "no_multinode_jobs" property set to "true". This property can forbid multinode jobs on some node.
  • Cloud jobs are also not supported.
  • Suspend and resume of jobs (preemption of jobs) is not supported yet.
  • The plan-based scheduler doesn't check whether the user's account is present on the target node.


Not supported qsub options

  • If you use -a option in the qsub command then the job will not be added to the schedule until the job become eligible for execution.
  • Job priority is not supported. This means that it is useless to use the -p option in the qsub command.
  • It is possible to ask for licenses but the scheduler can not guarantee that licenses will be available at planned start time (There is no reservation service for allocating licenses). The job will wait for the licenses. The job will run when all licenses will be available (after planned start time).
  • It is not possible to use complex submit arguments. The scheduler is not capable to handle following submit arguments:
... -l nodes=1:ppn=1+2:ppn=4
... -l nodes=1:ppn=4:manwe3.ics.muni.cz
  • It is not possible to ask for a specific node or exclude a specific node.
  • Planning scratch is not supported the scheduler can not guarantee that scratch will be available at planned start time.
  • Some differences occur when you use qsub option: "-W additional_attributes". It is not possible to use some of options for depended jobs:
synccount:count 
syncwith:jobid
on:count
  • On the other side, following options works but the job is not added into the schedule until the dependency is meet.
after:jobid[:jobid...]
afterok:jobid[:jobid...]
afternotok:jobid[:jobid...]
afterany:jobid[:jobid...]
before:jobid[:jobid...]
beforeok:jobid[:jobid...]
beforenotok:jobid[:jobid...]
beforeany:jobid[:jobid...]

Compilation, installation and setup

On Debian and Ubuntu you need to install following packages first:

apt-get install krb525 libtool

Once you have unpacked a source code you can can customize the scheduler. Edit:

/src/torque/src/scheduler.meta/sched_config.h
  • The option OPTIM_CYCLES_LIMIT is an expression which determines how many attemps to find an improvment will run in one optimization cycle. num_queued is number of jobs in the schedule which are waiting for run.
#define OPTIM_CYCLES_LIMIT (num_queued * 40)
  • The option OPTIM_TIMEOUT is a timeout in seconds which says how often the optimization will run.
#define OPTIM_TIMEOUT 30
  • The option OPTIM_DURATION_LIMIT is a limit in seconds and after this limit the optimization is stopped in spite of the limit OPTIM_CYCLES_LIMIT.
#define OPTIM_DURATION_LIMIT 10
  • The option OPTIM_MINIMAL_QUEUED says how many jobs have to be waiting for run in order to do optimization.
#define OPTIM_MINIMAL_QUEUED 3

Once you are finnished you can compile the TORQUE with following commands:

cd torque
autoreconf -ivf
mkdir _presunuto
cd magrathea/pbs_cache
make
cd ../..
mv magrathea/pbs_cache/api.h _presunuto/api.h
mv magrathea/pbs_cache/libpbscache.a _presunuto/libpbscache.a

If you are going to use TORQUE without cerberos you shoud use following CFLAGS:

CFLAGS="-D_GNU_SOURCE -ggdb3 -std=gnu99 -D_XOPEN_SOURCE=700"  CXXFLAGS="-D_GNU_SOURCE -ggdb3" ./configure --with-debug --with-rcp=scp --with-pbs-cache=`pwd`/_presunuto --enable-maxdefault --enable-meta-fork --prefix=/usr --disable-syslog --disable-gcc-warnings

or if you want to use cerberos:

CFLAGS="-D_GNU_SOURCE -ggdb3 -std=gnu99 -D_XOPEN_SOURCE=700"  CXXFLAGS="-D_GNU_SOURCE -ggdb3" ./configure --with-debug --with-rcp=scp --with-pbs-cache=`pwd`/_presunuto --enable-maxdefault --enable-meta-fork --prefix=/usr --enable-syslog --disable-gcc-warnings --with-gssapi_bin=/usr/bin/ --with-gssapi_lib=/usr/lib/heimdal --with-gssapi_inc=/usr/include/heimdal

and then run the compilation:

make

after that you can install the TORQUE:

make install

or prepare a packages and install the packages:

make packages

The next step is to set up the torque server and moms. Please, use only one queue. This procedure is described in the standard TORQUE Admin Manual but there is one more thing to do. You need to set up the scheduler's server. Please edit this file and write down the pbs server's fqdn on the appropriate line:

/var/spool/torque/sched_priv/sched_config

How to submit a job

Once you have finnished the setup and the pbs_server, pbs_scheduler and pbs_moms are running you can try to run a job.

  • PLEASE ATTENTION! Never use a queue when you submit a job. There is only one queue which is default. The scheduler ignores the queues (an exception are priority queues dedicated to ser groups according to explicit agreements).
  • PLEASE ATTENTION! Always use a walltime and mem when you submit a job. This is mandatory for the plan-based scheduler.

It is possible only simple nodes specification like following:

qsub -l nodes=1:ppn=1:mem=409600kb,walltime=1000
qsub -l nodes=2:ppn=1:mem=40gb,walltime=10d
qsub -l nodes=1:ppn=2:mem=800mb,walltime=36h

It is also possible to use interactive job:

qsub -l nodes=1:ppn=2:mem=4gb,walltime=2h -I

The nodes are organized in to the clusters according to their properties. All nodes have the same properties within cluster. You can demand the properties. For example (linux and x86):

qsub -l nodes=1:ppn=1:linux:x86:mem=409600kb,walltime=1000


In this version of TORQUE the qstat command has been modified. Once the scheduler get the information when and where a job will run you can list this information using qstat command. There are two new lines: "planned_start" which says when the job will run and "planned_nodes" which says where the job will run. The example of the command is on the following figure:

Qstat.png

We also use a graphical interface for displaying the schedule overview (http://metavo.metacentrum.cz/schedule-overview/) but it is more complex tool and its description is beyond this manual. The examle is on figure:

Sc-schedule-overview.png

How the scheduler works

In order to describe how the scheduler works we need to know how the data structure for the plan looks like. In reality, the Grid typically consists of one or more sites such as computer clusters. For each cluster we would create a separate instance of schedule. For simplicity let us assume that the Grid consists of a single cluster so there is only one schedule instance. Intuitively, the schedule defines when and where jobs will be executed, introducing a two dimensional rectangular representation such that for each job the set of assigned CPUs and the time interval is specified. Here the x-axis represents the time and the y-axis represents the CPUs of the system. An example of such a schedule is shown in figure (left).

Schedule plan-based-scheduler.png

In the figure (middle and right) is depicted the reprezentation of the intuitive schedule. The schedule is represented as a linear list of jobs (job_list) and a list of so called gaps (gap_list) that represents unused computational time. Gaps and jobs in these lists are ordered according to their expected start times. Each job in this list stores its start time, completition time, used CPUs... Simply, complete information are stored in a single cell of the list. All these parameters are computed when the job is added into the schedule. Of course, as the schedule can change dynamically in time, the parameters can be lately updated according to the new situation. Gaps represent unused periods of CPU time and the amount of free RAM across nodes within a cluster. Gaps appear every time the number of available CPUs in the existing schedule is greater than the number of CPUs requested by the job(s) in the given time period. Similarly, gap can appear when the amount of requested RAM is higher than it is currently available in the given time period. Similar to jobs, gaps hold information about their start time, duration, usage and free RAM. Here, the usage expresses the number of available (idle) CPUs in this gap. It is stored an array of such available CPUs with a pointer to appropriate node and the node's free RAM. Moreover, each gap has a pointer to the nearest following job. If there is no following job (gap at the end of the plan) this pointer is set to null. Two successive gaps in the gap_list have either different usage, contain different nodes, contain different amount of free RAM for the given node or the completion time of the earlier gap is smaller than the start time of the later gap. Otherwise, such gaps are merged into a single, longer gap.

New job arrival

A backfill-like procedure is used to build the schedule, i.e., to add newly arriving job into the currently known schedule. It finds the earliest suitable gap for the new job. In this case, the applied data representation represents major benefit as all gaps in the current schedule are stored in a separate list (gap_list) which speeds up the whole procedure. Two or more adjacent gaps can be used for the new job if a single gap is not large enough. In such case following constraints must be met. All gaps must at least have the same usage and available RAM memory as the job requires. Also the completion time of the preceding gap must be equal to the start time of the next gap. Finally, the duration of the job must be less than or equal to the total duration of the considered gaps. This "gap-filling" approach is very useful as it significantly increases system utilization while respecting the start times of previously added jobs.

Update schedule

Often the schedule must be updated as events such as (early) job completions, machine failures or arrival of job appear. In such situation the shceduler launches a time efficient schedule update procedure that updates the internal jobs' parameters in the job_list while recomputing the gap_list as well. Commonly, job finishes earlier than expected as the schedule is built using processing time estimates which are typically overestimated. In such case, the schedule is immediately compressed. The compression shifts jobs into earlier time slots that could have appeared as a result of an early job completion. During the compression, the relative ordering of job start times is kept.

Publications releated to this plan-based scheduler

  • CHLUMSKÝ, Václav, Dalibor KLUSÁČEK a Miroslav RUDA. Planning, Predictability and Optimization within the TORQUE Scheduler. Planning, Predictability and Optimization within the TORQUE Scheduler. In Antonín Kučera, Thomas Henzinger, Jaroslav Nešetřil, Tomáš Vojnar, David Antoš. MEMICS 2012. první. Brno: Novpress s.r.o., 2012. s. 96-97, 2 s. ISBN 978-80-87342-15-2.
  • KLUSÁČEK, Dalibor, Václav CHLUMSKÝ a Miroslav RUDA. THE EXTENSION OF TORQUE SCHEDULER ALLOWING THE USE OF PLANNING AND OPTIMIZATION IN GRIDS. THE EXTENSION OF TORQUE SCHEDULER ALLOWING THE USE OF PLANNING AND OPTIMIZATION IN GRIDS. Computer Science Journal, Krakow, Poland: AGH University of Science and Technology, 2012, roč. 13, č. 2, s. 5-19. ISSN 1508-2806. doi:10.7494/csci.2012.13.2.5.
  • CHLUMSKÝ, Václav, Dalibor KLUSÁČEK a Miroslav RUDA. The Extension of TORQUE Scheduler Allowing the Use of Planning and Optimization Algorithms in Grids. The Extension of TORQUE Scheduler Allowing the Use of Planning and Optimization Algorithms in Grids. In Cracow Grid Workshop. 2011.
  • KLUSÁČEK, Dalibor, Václav CHLUMSKÝ a Hana RUDOVÁ. Optimizing User Oriented Job Scheduling within TORQUE. Poster on SuperComputing 2013 at Colorado.

Contacts

  • Václav Chlumský, email: vchlumsky@cesnet.cz
  • Dalibor Klusáček, email: xklusac@fi.muni.cz