The following sections all detail different approaches to using some form of prediction to improve the quality of load distribution. Workload classification is a common research topic [CF86, BB94, Raa93], however there has been less research on how it can be applied to load distribution, two examples are discussed in sections 4.2.2 and 4.2.3.
Prediction need not be complex, and in some situations, quite simple schemes can prove useful. If all processes in a system are homogeneous, or at least similar, then a static constant can be used to estimate the resource requirements. The simplest dynamic solution is persistence, where the resources required by a process during the most recent execution are predicted for the next execution. A running average is another simple approach for providing predictions, and can be combined with persistence to favour the values from more recent executions.
The use of averages is a common technique, examples include Bond [Bon93] who used averages of CPU, IO and memory in a prototype of the Simple Task Allocation using Resource Selection (STARS) system, and Osser [Oss92] who used average lifetime to exclude short jobs from remote execution.
Goswami et al. [GDI93] used the state transition prediction mechanism developed by Devarakonda and Iyer [DI89] to provide estimates of process resource use. This estimate is then used to compute the host loads for placement.
The prediction system uses states and state transition probabilities to estimate the future resource requirements of a process.
A k-means
clustering algorithm was used to identify seven high density
clusters in a 3 dimensional space (CPU, memory and IO). The centroid
of each cluster was defined as a state representing process resource
usage. The states in general agree with the categories of process
found in Leland and Ott [LO86], with CPU bound, IO bound and
neutral categories. In particular, an approximate total of 20%
of all process instances were found to fall into the CPU bound
and IO bound categories (roughly 10% each) and an additional
category of heavy memory use was identified that consisted of
around
% of all instances. The processes falling
into the remaining four categories were all light resource consumers.
With the observation that a process may change the state it occupies between separate executions, transition probabilities are defined between states for each program, and the last N program executions are used to estimate the state transition probabilities tPr:
Given the state-transition model for a program, its future resource usage is predicted by:
The program's resource requirements r, are computed by multiplying
each of the transition probabilities
, from the state
occupied during the process's last execution l, with the centroid of each
subcluster
:
where N is the number of clusters (seven).
Goswami et al. argue that conventional load distribution systems require too great an overhead if the load information is communicated frequently, and note that performance of the load distribution falls as the frequency of load updates is reduced. They propose that proper initial placement of processes based on predicted resource requirements will produce better response times with lower communication overheads.
The essential idea is that the load on a host need not be collected, but may be computed from the resource requirements of scheduled processes. This unconventional approach means that the only updates required by the load distribution mechanism are notification of process termination.
The scheduler now takes on the responsibility of providing the load on each host, without measuring the state of the host. This is achieved by treating predictions as `perfect' and recording the estimated resource usage from each placement made to a host. These values are then used to compute the `load' on each host before another placement is made:
where
Goswami et al. developed four load distribution policies, two centralised and two distributed. The centralised policies are sufficient to explain the approach, and as discussion of the distributed policies will not contribute to the explanation, they are omitted.
The first placement policy was a simple scheme that used the computed
CPU load to select the least loaded host. The second policy
estimates the expected response time
that a
will
receive at each
, and then selects the most favourable:
where
is the number of processes on
.
This equation sums the CPU requirements of all processes on
that are smaller than the new process, and for each process larger, the
new process's CPU requirement is summed. The estimates the response
time experienced with a round robin scheduling algorithm.
This is primarily to avoid problems with round robin scheduling and too long a quantum.
Goswami et al. use a prediction system for providing a form of a priori information about processes before execution. The information is used to compute the load on the hosts in the system, rather than obtaining this information from the hosts themselves. No attempt is made to perform any kind of candidate selection, other than a simple age filter in one of the distributed algorithms.
Due to the CPU load computation from equation 4.1, the load estimates will be exaggerated for interactive processes or processes waiting on timed events. This could reduce the load distribution to random, as no checking is performed to ensure that the load estimates are representative of the actual system load. The load computation method also requires that the load distribution is pervasive, as load that is not under direct scheduler control will invalidate the estimates.
Bond [Bon93] also employs classification in STARS to
predict the requirements of a job prior to execution, but exploits this
information to select the best site for executing the
process in a system of heterogeneous
hosts.
The prediction scheme developed for STARS, classifies tasks incrementally and seeks to overcome shortcomings evident in the prediction mechanism from Devarakonda and Iyer [DI89]. In particular, Bond claims that additions to the state transition model require regeneration of the entire model, and this alone makes it unsuitable for incremental updating. In addition, Bond suggests that prediction can be improved by considering additional predictors, such as group and time rather than simply relying on program name.
STARS begins by forming a set of compatible hosts for executing a new job. This set of eligible hosts is then ranked by capability, using a set of weights that reflect the relative importance of resources to the particular job. A typical weight vector for a CPU bound program may consist of:
which gives
weighting to an idle CPU,
weighting
to a powerful processor and
weighting to a minimum load.
These weights are then applied to the resource availability of each
eligible host, and the highest value indicates the most suitable host
for execution of the job.
STARS and the prediction system were not completely integrated, and the system tested in Bond's thesis used moving averages to tailor the resource weights for each job.
In a novel approach, Ju et al. [JXY95] trace a process through the first second of its life, to identify processes suitable for remote execution. If the process is suitable, being non interactive and non IO bound, then the process is terminated and restarted on an appropriate host.
Processes are only traced for one second from the start of execution. Therefore Ju et al. require the assumption that a process will exhibit the same behaviour over its entire lifetime as it did in that first second.
The data collected during the trace period is intended to determine if the process is suitable for remote execution. Ju et al. collect the total memory requirement, the length of all opened files, the total CPU time and the amount of data read and written to memory, and to files. These values are used to estimate the CPU and IO utilisation of the process, and the speed of memory and file access.
The expected lifetime
, of a process, is estimated from the
time required to access all memory pages allocated to the
process and to read all opened files:
If the program has been executed before, then the values obtained during previous executions are available to adjust the lifetime estimate from equation 4.2.
An eligible candidate is preferably CPU bound, has an
expected lifetime
of at least ten seconds and is non interactive.
The process is considered interactive if any of the open
files are stdin
or stdout. The recorded
CPU and IO utilisation is used to classify the process as CPU
(
utilisation) bound, IO (
utilisation) bound
or neutral.
If the candidate is eligible, it is terminated and restarted on the least loaded host.
Cena et al. [CCG95] take a priori information as given, and concentrate on the actual distribution of load to hosts. Although a neural network is used, load distribution is explicit, as the a priori information is provided to rather than by the neural network.
This location selection policy is designed for a heterogeneous
distributed system, and as
a consequence, focuses on the capability of a particular host
to supply resources required by a job.
The prototype system divides available system resources into available memory and available CPU (normalised). Processes are characterised by the required amount of memory and the desired response time. A conventional best fit algorithm, as described in section 8.3.2, is used to allocate processes to the available resources in the following order:
If the memory request can not be satisfied, then the process can not be executed. If however there is insufficient CPU, the request is satisfied with the best available (as the response time is desired, not mandatory).
There is no requirement for the fitting to be implemented by a neural network, indeed, the training set was computed conventionally. However a stated goal of LAHNOS (Local Area Heterogeneous Operating System) for which the scheduler was designed, is to increase the `automation' of the system and therefore a brief description of the implementation is included for completeness.
The distribution algorithm was implemented by training a feedforward neural network using the Back Propagation (BP) learning rule [HN90].
The input vector encodes the requirements of the job and the current availability of the system (restricted to three machines in the prototype system). The output is a single value representing the selected host.