The three fundamental resources modelled in the compute server are CPU, memory and file IO. In addition to the provision of these resources, each compute server is also responsible for the sharing of them between competing processes.
The compute server model is designed to allow a considerable degree of resource heterogeneity in both availability and rate of service. A configuration file is used to specify the service rate of the CPU, the length of the quantum, the local and remote file IO service times, and the amount of physical memory. The cost of state collection can also be adjusted.
Processes flow individually from the
ready queue into the CPU, which then provides CPU time until the process
requires some other system resource, or has its quantum expire. Processes
moving to other system resources are returned to the end of the ready
queue after servicing, as seen in figure
Figure 6.9: The compute server model. Each process decides which resource
to visit, and the compute server shares the resources between competing
processes. The IO subsystem illustrated is simplified, showing only the
local system.
Each compute server maintains an independent microsecond clock, used to sequence events within each compute server and over the distributed system. The server clocks are synchronised by the arrival of new processes and the migration of processes.
The CPU is the core of the simulator, and all processes must pass though it before entering any other service. Thus the CPU controls the advancement of time within each compute server, and is in one of two states, idle or running. In the running state the CPU advances time to the next event, be it a service request, page fault or quantum expiry, and allocates the elapsed time towards the current process's CPU requirement. In the idle state there are no processes available to run, and in this situation the clock is advanced immediately to the next event, when either a new process enters the compute server, or a process returns to the ready queue from a resource.
The rate at which the CPU can service processes is specified relative to the service rate of the CPU on the machine from which the trace was recorded. Thus if a CPU service rate is specified as 2.0, the CPU will service processes twice as fast as the original machine. All experiments in this thesis were conducted with a CPU service rate of 1.0.
The CPU time of a process is divided into n+1 chunks, where n is the total number of visits that a process needs to make to the other resources (IO, paging and user). Each visit to the CPU consumes one CPU chunk, after which the process decides if it wishes to access another system resource or continue with the CPU. If the process reenters the CPU, the chunk size is recomputed to ensure that there is always a CPU visit before any other service. If the quantum expires, the process returns directly to the ready queue. For more detail, see section 6.2.4.
To provide a realistic allocation of processes to the CPU, a scheduling system is needed. There are a number of simple methods of single CPU scheduling that can be considered as candidates for the scheduler, including first come first served (FCFS), shortest job first (SJF), round robin (RR), and multi-level feedback queue scheduling. Round robin and the multi-level feedback queue are both preemptive systems and therefore preferable candidates for simulating modern interactive computer systems. The multi-level feedback queue is likely to add unnecessary complexity to the simulation, as the provision of fast response times for interactive users is not a priority. Thus processes are scheduled in a RR fashion for simplicity, and preempted on the expiry of an adjustable quantum (arbitrarily chosen as 50 milliseconds for all experimental work).
The memory system is often neglected in load distribution simulations with the assumption of infinite memory being common, as in Eager et al. [ELZ86b] and Harchol-Balter and Downey [HBD95]. This is acceptable, if the load distribution algorithms concentrate on the CPU as the basis for load distribution. However, as the resources of the system are vital to full test the process mix hypothesis, careful modelling of this primary system resource is required.
The memory information available from the trace is an estimate of the
average virtual memory used by a process over its lifetime. This is
rather limited as it gives no insight into the rate at which memory
was acquired by the process, or the extent to which it could be
released in times of memory drought. Figure
Figure 6.10: The amount of allocated memory claimed by the accounting
log for Cantina on the first of May 1995. The actual period is from 2 am
to 10 pm, with 600 minutes corresponding to 12 noon.
The system memory model must interact with the processes in an
intelligent way, yet remain simple. As there is no dynamic
memory data available, I will adopt a
static memory allocation model, where the amount of
memory recorded for a process is allocated to that process for
its entire lifetime. This immediately leads to the observation
from figure
An infinite swap space is a reasonable assumption, as none of the traces recorded a compute server being so overloaded that it could not start a process.
Physical memory is a finite resource just like CPU time, and therefore must be shared between competing processes. Ready processes and those waiting on the CPU or IO queues must be active, while processes waiting on user or external events, as described in section 6.2.8, may be inactive. Active processes require a presence in physical memory, while inactive processes may be swapped out.
When there is a physical memory shortfall, processes that are waiting on user or external events may be removed from memory and placed in the swap space. Also, if there is no memory available to start a new process then it must still be initiated, but placed in the swap space until there is room to begin execution.
The decision of when demand is sufficient to result in exile to the swap space is an arbitrary two tiered scheme and swapping is initiated when either:
The process will stay in the swap space until it has completed its delay period, and it then becomes eligible for returning to physical memory.
Swapping in is accomplished by placing eligible jobs onto a priority queue, ordered by the time of eligibility. When memory becomes available (another process finishes, or is swapped out), then processes that are small enough to fit are removed from the front of the queue. With this heuristic it is possible that large processes will accumulate at the front of the queue, and will suffer starvation in times of low memory availability. In practice this does not appear to be the case, and the response times of even very large processes are close to those experienced on the original system.
The memory swapper produces very realistic physical memory results,
as shown in figure
Figure 6.11: The physical memory allocated in the simulator over the
same period as figure 6.11.
The memory swapper is critical to the simulator if the assumption of infinite memory is to be avoided, and a realistic memory statistic is to be reported for the load distribution algorithms.
The Disk IO subsystem services both the logical file and paging IO requests from processes. Two different sets of assumptions are required for file and paging IO:
There are two types of file IO present in processes running on in a distributed system: the accesses to the compute servers local disk (local file IO), and those that access disks on remote compute servers (remote file IO).
There is no distinction between local and remote file IO in the accounting logs, and therefore some assumptions are required if there are to be realistic costs for local and remote execution:
This assumption mirrors any real system which has a network file system.
As there is no distinction made in the accounting logs between local and remote file IO, any decision must be arbitrary. The assumption that the host on which the job was intended to execute (prior to load distribution) holds all of the required files locally is both simple and reasonable. There are two main implications resulting from this assumption. Firstly, file access will always be less expensive when a process is run locally, and secondly, all accesses made by a single process are of the same type (i.e., all local or all remote).
Caching is a thorny problem when processes are redistributed across the system. For example, a migrated process will have to populate the cache on the new compute server, and therefore experiences a low hit rate. A fixed hit rate is misleading, and modelling anything else would add additional complication to the model with little expected benefit.
This last assumption is supported by the measurements made in section 6.3.4.
All paging access are to the local disk for processes executing locally and remotely, under the following assumption:
The only exception to this ,is when a process has been migrated, in which case all pages are fetched from the remote disk, see section 7 for more detail.
Local and remote disk IO are fundamentally independent when considering disk IO on a single machine, one is serviced by the local disk, and the other by the network. The immediate observation is that these are parallel events, with no interference between them. The situation is more complicated, when the distributed system as a whole is considered, as every remote disk IO access will result in a local disk IO operation on the compute server that physically holds the file or page.
The simulator models disk IO on each machine with two queues as shown
in figure
Figure 6.12: The Disk IO subsystem.
Both of the disk IO queues operate on a simple first in first out (FIFO) basis, with processes leaving the queue at a maximum rate determined by the service time of the device.
Table
Processes performing file and paging IO are treated in exactly the same way by the disk IO subsystem, except that, a file IO transfer is a 2 kilobyte byte block and a page is an 4 kilobyte block. The two different block sizes both have the same disk service time due to the seeking assumption.
To ensure that the disk IO service times are realistic and more
importantly that local and remote access times are in proportion, the
disk IO performance of a Sun IPX workstation was measured with the
Unix command iostat
, and the
results are summarised in table
Table 6.4: Maximum disk IO performance on a SUN IPX.
Table
The maximum of 82 accesses per second on local reads and writes from table
The test runs were conducted on a freshly booted system with no other users present. The measurements for writing are included only for completeness, as reads tend to dominate disk IO and hence it is the value for the read operation that is used a basis for the simulator service times.
There is little difference in the performance of the local read and write operations and these results support the file access assumption, made in section 6.3.4 for local accesses.
The results indicate that remote writing is more costly than remote reading, but there is the possibility of outside interference, as it was not possible to isolate the remote compute server from other users.
The disk IO test program exhibits high access locality, as the single test file is read sequentially. The SunOS 4.X file system attempts to arrange files in rotationally contiguous strips, thus minimising head latency and seeking. However, modern disks perform remapping of logical tracks and sectors to physical tracks and sectors, and such a scheme may no longer perform its desired function. Thus even though the file is logically contiguous it may not be physically contiguous, and a sequential file access may involve seeking and head latency. Thus this is a reasonable approximation to the situation where real processes have more than one file open, and are quite likely to make use of seeking inside them. Even if this were not the case, the important feature is the proportional cost of local and remote disk IO devices, since both local and remote systems experienced the same locality, the proportions are preserved.
Each compute server also needs to collect load information for both load distribution activities and for measuring the performance of the system.
The system load is collected on each compute server by periodically running a system load process that also updates the globally accessible state. The information collected by the system load process is:
The utilisations are taken over a period equal to half the
update period
. This ensures that the utilisation statistics are at
most,
old. All other values are purely instantaneous.
The system load process is treated in exactly the same way as normal processes, and incurs all normal queueing delays. The resource requirements of the system load process are specified in a compute server configuration file, but for these experiments no cost was attributed to either the collection or dissemination of the load information. This lack of cost does not invalidate any comparisons made between different load distribution techniques, as all of the algorithms share the same load metric and communication architecture.
One major difference between the system load process, and all other processes is that it may not be migrated. If it were, the source host would no longer make state updates, and the destination host would make multiple updates.