Each component of a load distribution scheme represented in the taxonomy could be dependent on the system for which the load distribution is intended, or may simply be a choice made for performance reasons or personal preference. All can however, exhibit a wide variety of interesting approaches and unique solutions due to the large design space.
This section is intended both as an overview of the general field of load balancing, and as an exercise of the taxonomy.
The participation policy decides whether any particular host should be involved in a load distribution activity. The reasons for participation can be varied and some approaches are detailed in the following subsections.
A two level threshold policy has a lower threshold that defines at which point a host may begin to initiate load distribution to offload any excess load. An upper threshold determines the point at which an overloaded host may start to refuse incoming foreign load.
This gives two levels of participation: a restriction on the minimum amount of local load before any is offloaded, and a symmetric restriction on the minimum amount of local load before incoming load can be refused. Outside these limits a host may only participate as either a receiver or source of load. One example of this approach is Holzer[Hol96].
Sprite [DO91] has an interesting participation policy, based on the concept of ownership. A fundamental view of the workstation model is that each workstation is owned by the user currently logged in from the console - whom I shall call the primary user. This concept of ownership is considered absolute, and Sprite immediately evicts all foreign processes when activity from the primary user is detected. The intention of this eviction policy is to encourage primary users to share their personal workstation, with little detriment to their own work.
Thus the participation policy is: only hosts that are considered idle (as described in section 7.1), may be selected as a destination for load distribution. This policy goes even further, in that any foreign processes must be migrated from the current host on detection of an active primary user.
The V system described in Theimer et al. [TLC85] also preserves the concept of ownership, but requires the user to manually initiate the eviction of foreign processes.
Another system supporting ownership rights is Butler [Nic87]. Butler does not support migration, and as a consequence, foreign processes are terminated on detection of an active primary user. This rather ruthless policy was seen as the only option in a system with single user workstations and without migration.
Security can also be a factor in any participation policy. Powell and Miller [PM83] suggest hosts could refuse to accept a migration request if the source host is not trusted, such as one from a different administrative domain.
The location policy is responsible for selecting the source and or destination hosts between which load is to be transferred. There are four main approaches to determining the location of source and destination hosts.
In a centralised location policy, a single agent is responsible for deciding the source and destination hosts. This system uses load values reported to the agent to determine say, the least and most loaded hosts and initiate a load transfer between them. A prime example of the centralised approach is Condor [BLL92], which is based on the workstation model and therefore has ownership constraints.
In Condor, there is a single central agent and a local agent on each host. The central agent is responsible for sharing the resources of the system, and when it discovers an idle host B, it decides which host A may execute processes on it. The central agent then contacts the local agent on A and gives it permission to execute processes on B. After this point the local agents on A and B negotiate the transfer of processes.
The participation of host B is not delegated to the central agent as host B may still refuse a transfer from host A, if the user has returned.
The dictatorial approach is used by Harchol-Balter and Downey [HBD95] in their simulated load distribution scheme. Here a centralised agent periodically identifies the most and least loaded host and orders a migration between them. There are no ownership constraints, and the only way a migration is avoided is if there are no suitable candidate processes on the source host.
In dictatorial policies such as this, it is likely that the participation policy is also delegated to the central agent, which would have to resolve security issues and participation thresholds.
Sender initiated is the most common form of load distribution [Ber85, Rub87, DO91], and occurs when a source host decides to participate in load distribution, perhaps due to its exceeding a local load threshold. In this case, the source host is fixed and the location policy is only required to identify a destination host which will participate. The actual policy may be to locate the least loaded host, or a random idle host from a pool of idle hosts, but it is not significant how the information from which the decision is made is obtained.
All of the initial placement algorithms implemented in the experimental part of this thesis are sender initiated, and are described in section 8.1. It is therefore redundant to include an example of a sender initiated policy in this section.
Receiver initiated load distribution [LM82, LO86] occurs when a host decides to participate in load distribution, perhaps due to finding itself below a minimum load threshold. In this case, the destination host is fixed as the initiating host, and the location policy is used to find a source host from which load may be transferred. This is a rare form of load distribution, of which early examples are the broadcast idle (BID) and poll when idle (PID) algorithms from Livny and Melman [LM82].
For the BID algorithm, when a host becomes idle, it broadcasts a status message. All hosts with more than one process wait a time inversely proportional to the number of processes on that host. The first host to broadcast a reply is the host with the most processes, and therefore obtains the idle host.
The PID algorithm avoids the broadcast facility by polling a randomly selected subset of hosts. The hosts are polled in order of their selection, and the polling host will either receive a reply consisting of a job or a negative response. If no busy hosts were in the subset, the idle host gives up.
In these two examples, the location selection is completely dependent on the communication mechanism, but the actual location selection policy of the BID algorithm is to select the most loaded host, and that of the PID algorithm is to select a random busy host.
Eager et al. [ELZ86a] compared sender and receiver initiated location policies, and concluded that sender initiated policies are superior during light to medium system loads, and receiver initiated policies are more effective at high system loads. Thus the sender and receiver initiated policies can be combined to take advantage of the different characteristics exhibited at high and low loads.
The Periodic Symmetrically-Initiated (PSI) Algorithm from Benmohammed-Mahieddine and Dew [BMD94] operates in both receiver and sender modes. If the host is above threshold T, sender mode is initiated and a single request is sent to a random host. If an acceptance message is returned, the sender transfers a job to the replying host. If the host is below threshold T, receiver mode is initiated and a single request is sent to a random host. If the contacted host is overloaded, it will return a job to the receiver. If there was no reply to the original request (equivalent to a negative response) for either sender or receiver modes, the algorithm simply gives up and waits until the next period.
A similar threshold approach is from Krueger and Shivaratri [KS94], except that rather than random polling, potential senders and receivers consult lists of hosts that have identified themselves as busy or idle. The sender or receiver lists are used to poll candidates for status confirmation, until either a partner is found or a probe limit is reached.
There are other considerations that the location policy may have to contend with. A good load metric is required to enable a suitable host to be selected from a set of eligible destination hosts, see section 2.3.5. The system may also be heterogeneous - in performance or configuration, in which case a multiplicity of concerns can intrude in the simple selection of a destination. Hosts of different power need to be ranked (preferably by a load metric that accounts for this) and if migration is possible between `binary incompatible' hosts as in Theimer and Hayes [TH91] and Bishop et al. [BVW95], then this cost must be weighed against the loads on the hosts that are compatible.
There are three main approaches to selecting candidates for load distribution:
Some systems require a process or object itself to initiate load distribution and in this case, there is no need for a global candidate selection policy. However, if a host condition triggered migration, then a candidate needs to be selected.
The following sections assume that all candidates are eligible for remote execution. In reality there are jobs that are not able to execute remotely, such as those that must access frame buffers or local filesystems. Consider these jobs excluded from the following discussion.
In this policy there is no attempt to filter jobs, and all jobs are considered eligible for load distribution. Examples are the random and least loaded initial placement policies, detailed in sections 8.1.1 and 8.1.2 respectively.
It is clear that the previous method of selection may choose jobs or processes that are characteristically unsuitable for distribution [JXY95]. The most obvious subset of unsuitable jobs and processes are those which do not execute for a sufficient length of time to repay the cost of moving them.
Two ways in which these jobs and processes can be rejected as distribution candidates are:
The previous approach avoided distributing jobs and processes that are poor candidates, however, the criterion can be tightened further and the selection of a good candidate made. Initial placement and process migration, detailed in section 2.3.4, impose different restrictions in the selection of a candidate. Choosing a good candidate may require a great deal of knowledge about the job or process as well as the potential destination hosts. Indeed, it is possible that the location policy is secondary to the candidate policy, as in the Emerald system where Lehrmann [Leh93] moves heavily communicating objects so they reside on the same host.
Migration systems can obtain information about current process behaviour by monitoring each process as it executes, and good candidates can be selected by referring to this behaviour.
Initial placement is restricted to placing the job before execution, and as a consequence is not privy to the future activities of the job. In this case, good candidates can be selected using prediction or classification. This approach will be examined in section 3.1.
Barak and Shiloh [BS85] implemented a set of criteria to select good migration candidates in MOS. Each host periodically runs a daemon that asks processes, in a round robin fashion, if they wish to migrate. The process itself then evaluates whether:
If these factors are favourable, the process agrees to be migrated.
There are systems where a process or object itself may initiate load distribution. The MOS selection scheme described in the previous section is one such system, as it is the process that decides if it wishes to migrate. The only direction from the host is to dictate when the process may consider itself as a candidate, to prevent it from spending too much time considering migration.
Another form of self selection is that used in the original Sprite system. In this system there was no automatic load distribution, and all load distribution was initiated by user level applications. One example is pmake, which uses as many idle workstations as it requires for a parallel version of make.
Both of these approaches fix the candidate (or forked child) and consequently fix the source host. The Chorus Object Oriented Layer (COOL) is a system that emulates a distributed shared address space over a set of loosely coupled hosts, see Lea et al. [LJP93] and Amaral et al. [AJJ 92]. Objects are the logical unit of distribution, and may move between local address spaces. If an object references a remote object, the system has two options:
If the first option is not possible, as when referencing an object representing a system resource, then the second option of migrating the referencing object is taken. In both of these cases, the source and destination hosts are fixed, and the candidate depends on the option.
In addition to the candidate selection criteria outlined above, there is also a `fairness' principle incorporated into some candidate selection policies. Essentially, there are two ways of considering any benefit offered by load distribution:
Greedy implies altruistic, but altruistic need not imply greedy. Namely, the migrated process may increase its response time due to the overhead involved in moving it, and never reclaim the loss, but the processes on the host from which the process was moved will benefit, and reduce their response time. The best situation is that both greedy and altruistic occur, but if they don't, then at the very least a load distribution scheme should provide altruistic.
Greedy is often embodied as a `fairness' principle as in [HBD95], where a candidate is only eligible if it can expect better service on the new host. However, load distribution may still be worthwhile on a global scale, even if the moved processes must be purely altruistic.
The transfer mechanism physically moves the processes from source to destination host. There has been a great deal of research in this area, including Artsy and Finkel [AF89], Powell and Miller [PM83], Douglis [Dou90] and Theimer [TLC85], concentrating on issues of transparency, residual dependencies, performance and complexity. These issues are rooted in the design and implementation of the mechanism and are outside the scope of this thesis, as are the arguments supporting the different approaches. The significant aspect of the transfer mechanism in the terms of this thesis, is the stage at which load distribution may occur.
Distribution of workload can occur in either of two phases: tasks can be allocated to a processor before they begin execution, or they can be moved after they have begun execution. These two phases are known as Initial Placement and Process Migration respectively and are in principle independent.
Each of the two phases offer different opportunities to the policy components of load distribution. Policies that utilise migration can monitor the runtime behaviour of processes and make decisions that may involve moving existing workload as the situation demands. Migration is often associated with the high cost incurred in transferring the address space of an active process, however fewer migrations may be required to balance the load. Policies that are restricted to initial placement have the advantage of a lower transfer cost, but once a process is assigned to a host it must remain there until it completes or is terminated. In depth analyses of the issues of initial placement and process migration are available in [ELZ86b, DHB95, KL88].
The load metric is important in a load distribution scheme, as it represents the workload being distributed. There are a wide variety of possible load metrics, a few examples are:
Participation, location and even candidate selections are made on the load, and it is therefore critical that the load metric is relevant. For example, Martin [Mar94] found that the number of processes on a host is a completely irrelevant load metric as it does not provide any information about the contention for resources on that host.
The V migration system discussed in Theimer et al. [TLC85], used a delightfully simple ranking metric based on the group communication facility of V. When a transfer is initiated, potential hosts are polled via a multicast request. All hosts which meet an availability criteria respond, and the first response is selected. The reasonable assumption is that the first host to respond is generally the least loaded. This `ranking' mechanism also neatly solves the problem of load communication.
If there is a reasonable proportion of idle hosts at any time in a system, then these hosts represent an allocable resource. This approach is central to many load distribution schemes that are based on the workstation model, as the `low power' of individual workstations, combined with a high degree of idleness makes anything other than a boolean idle-busy metric unnecessary.
The idleness criteria from Sprite [Dou90], is fairly typical. An idle host must have had less than one runnable process, averaged over the last minute, and no keyboard or mouse input for the past 30 seconds.
Communication may also be a measure of load, and minimisation of this requires communicating entities to reside together on the same host. A migration policy developed by Lehrmann [Leh93] in the object based Emerald system, combined two load balancing strategies. The first transfers objects from overloaded to underloaded hosts, the second moves heavily communicating objects to the same host. The second policy clusters communicating objects, and thus through communication, certain objects express an affinity for each other.
A common approach is to consider the resources available on a host, and express them in a way that enables suitable hosts to be ranked.
Most systems simply consider the CPU resource, and neglect all others, while others use a more complex combination of resources, see Bond [Bon93]. Ferrari and Zhou [FZ87] explore variations on this form of load metric, with load metrics ranging from the instantaneous CPU queue length, through to a linear combination of CPU, memory and IO queue lengths. They found that for their system and set of assumptions, that the linear combination of exponentially smoothed CPU, IO and memory queue lengths produced the best performance. In a contrary finding, Kunz [Kun91a] found that single resource queue lengths were as good as a combined metric in his study with a stochastic learning automaton. These two results confirm that the load metric is important, but also indicate that the suitability of any load metric depends on the system, workload and load distribution scheme for which it is used.
Once the load on a host has been measured, it must be communicated to whatever agents make load distribution decisions. This poses a difficult problem in most distributed systems, as there is a cost involved in both the collection and distribution of the load data. There are also problems of reliability, update delays and even locating the state, any of which can result in out of date information being used to make current distribution decisions.
A number of clever solutions have been proposed for load communication, most of which are variations on one of the following five methods.
Polling is a message directed at only one host to return its current load. If polling is performed on demand it results in the most up to date information, but at a potentially high cost. The PID algorithm from Livny and Melman [LM82], the PSI algorithm from Benmohammed-Mahieddine and Dew [BMD94], and the approach from Krueger and Shivaratri [KS94], all of which are discussed in section 2.3.2, use polling to obtain current host conditions.
A form of undirected communication, where all hosts exchange information by broadcasting over the network. There are a number of problems with this approach:
The primary advantage is also that all hosts on the network have access to all the broadcast information, and this can be used not only to update recorded host loads on all hosts, but also to synchronise a distributed load distribution policy. A good example of this is the BID algorithm implemented by Livny and Melman [LM82], already covered in section 2.3.2.
This is a form of broadcast constrained to the members of a group. This reduces some problems associated with the broadcast method, as only members of the migration group receive the load information, but there still may be a lot of traffic. The primary advantage of the group communication system is that identification of participating hosts is simplified, making the location of suitable sources or destinations a much simpler problem. The group communication facility of the V system is used to considerable advantage in the process migration system of Theimer et al. [TLC85].
Four load communication alternatives were described by Zhou [Zho87] of which three used a central collection agent:
Barak and Shiloh [BS85] use a worm in a rather creative solution to the distribution of load information between hosts in the MOS system. The idea is, that providing correct load information about all host loads is an unacceptable expense. In particular, there is a conflict between the availability of load information and the overhead required to obtain it. Thus Barak and Shiloh introduce the worm, which periodically sends half of its information vector L, of arbitrary length n, to a random host i.
Send : at each period,
Receive : on host i and merge into own load vector by:
where Li is the local load vector on host i and Lr is the received load vector. This merge ensures only the most recent information is held, by interleaving the local and received vectors. The order of a load in the load vector implies the age of the information, and dropping half the load vector when sharing the load vector is a form of aging. There is no mechanism to remove repeated hosts from the load metric during the merge, and this may reduce the pool of possible distribution partners. Repeated elements also provide conflicting information for which there is no way of determining absolutely, which load is the more recent.