A process is a program in execution and as such consists of code, data and machine state such as the stack register. This model is more complex than is needed, and can be simplified by considering a process to be characterised by the resources it uses or visits during execution. The significant resources are:
Section 5.3 presented the three main options for providing a workload to drive the simulator, with the conclusion that a system trace offered the most suitable data. The most accessible form of a system workload trace is the accounting log. The accounting log also has the advantage that it is collected automatically by the system, and therefore does not cause any additional interference to the system being measured.
The accounting log provides the following summary of a process's activities in the system:
The simulation also requires the following process information that is not provided by the accounting system:
The only other caveat is that the reported values are statistically gathered by the accounting software, and are therefore not exact. This is of minimal concern, as the level of simulation is quite coarse, however, in addition to this inaccuracy, section 6.3.2 finds that the reported memory values do not reflect the savings made from code sharing. I will, defer discussion on this point to section 6.3.2.
As the objective is not a complete reconstruction of previous events, but rather the representation of a realistic load, these small inaccuracies are not expected to be critical to the validity of the simulation. The lack of paging, file IO and user information is more serious, but these can be reconstructed by modelling these events from the information that is present in the accounting logs.
To provide a representative sample of system activity for the system trace, the accounting logs from five Sun IPX compute servers used by second and third year computer science undergraduates were recorded over a one month period. These five compute servers and an associated fifty X-terminals follow the Xterminal-Server model, and therefore it is this model that is used in the simulation. All five compute servers are peers, with the students' home file systems divided evenly amongst them.
In addition to the accounting log information, the size of the program file was included in the trace.
The start and finish times from the accounting log have a resolution of one second. This results in multiple jobs with the same apparent starting time, and up to thirty `simultaneous' jobs occurring during some peak periods. This is an unrealistic situation, as in reality the jobs would be distributed within the interval. To correct this, the arrival times were modified by adding a two digit random number to represent hundredths of a second. This increases the resolution of the starting times, distributing the jobs bursts over the entire one second interval.
To restate, processes are consumers of resources such as CPU time, and each compute server is a provider of a finite set of resources. The compute server is responsible for controlling access to its resources, however the process determines which resources it wants to consume, and in what order. This means that the process may be in any one of the following states:
The resource states represent the process accessing or waiting to access a specific resource during its execution. The control states represent the states were the compute server dictates the actions of the process. The following sections are dedicated to explaining the entry into each of the resource states. The swapped control state will be discussed in section 6.3.3, the migrating control state is explained in section 7.3.2, and the finished control state is described in section 7.4.
The processes within a compute server move between the states defined in the last section. The actual flow of a process is determined by it's probability Pr of visiting each state, which is calculated from the remaining resource requirements of the process.
Figure 6.1: A process's view of it's compute server is as a set of resources which are visited until all of the process's resource requirements are satisfied.
The process must pass through the CPU state before any other state. In particular, a process may only initiate a page transfer, file IO or user event after first obtaining the CPU. A ready process enters the CPU and at the end of each CPU chunk, the probability Pr of the process returning to the CPU or utilising some other system resource, is computed in the decision node from the remaining resources required by the process p.
see section 6.2.6.
see section 6.2.8.
As a process consumes its resources, the relative probabilities will change to reflect the proportion of remaining resources. So these probabilities must be recomputed every time the decision node is entered, to ensure that the correct number of visits are paid to each of the resources, preceded by a CPU visit.
Given the probabilistic flow of processes within the compute server, the rest of this section will detail the computation of the resource totals, from which the initial probabilities are computed.
The total time a process spends on the CPU is the sum of the user and system times from the accounting log. A process terminates when its CPU time is fully satisfied, regardless of any other unconsumed resources. To minimise the possibility of resources remaining at the end of a processes execution, the time that a process spends on the CPU during each visit is adjusted to allow one more visit to the CPU than there are remaining resource visits. The length of the current CPU visit is computed every time the process obtains the CPU with:
The only file IO data available from the accounting log is the total number of bytes transferred. There is no data as to when the file IO accesses were performed, how many bytes were transferred during each access, or even if the access was to a local or remote fileserver. Modelling these is difficult as they all depend on either the user, the process or the compute server.
It is reasonable to assume that there is a minimum physical block that is the actual unit of transfer from the physical device to a memory buffer. Each compute server specifies a standard transfer size, currently an arbitrary two kilobytes. Therefore the total number of physical file IO transfers by a process p is:
In Unix, the paging system is used to load the code required for the execution of a program, and is also used on occasion, for mapping files into the address space of a process. This is termed background paging as it is required for the execution of the process, rather than for the management of the compute server's memory, termed forced paging.
In extreme cases forced paging will degrade the performance of the system as a whole, but in practice, these situations are rare. However, forced paging still needs to be modelled in the simulator, as it represents memory resource contention, a potentially important factor in the evaluation of the process mix hypothesis.
Thus the paging activity in which a process is involved consists of two distinct components:
The pages requested by the process are logical pages, while the pages transferred from a device to memory by the system are physical pages. The information on the number of logical or indeed, physical pages transferred is not recorded by the accounting logs. The physical paging performed by a single process can be measured with the Unix command time. However, time only reports the physical pages transferred. The difference between the number of logical and physical pages transferred is due to:
These features of a real system obscure the logical paging that was carried out by individual processes, and modelling these features is difficult. To simplify matters, the simulation models logical paging, where all logical requests result in a physical page transfer.
Once a paging event has been triggered, it is serviced by the local file IO system as a normal file operation, see section 6.3.4.
To model background paging, it needs to be related to some known quantity. The most obvious choice is as a proportion of the program's binary file size, as the background paging reflects the loading of the program. However, as the binary file also contains infrequently executed code such as, exception handlers, only a proportion of the binary file should be paged in.
To determine a reasonable value for this proportion, the paging from a small set of diverse test programs was measured on a single SUN IPX machine. All users were excluded from the machine for the duration of the measurements, to avoid the prospect of code sharing and page reclamations. Page reclamations between presentations of the test set were avoided by rebooting the server between the test runs. Due to the rather long cycle between experiments, only a small number of measurements were taken and are summarised in table
Table 6.1: Paging as a proportion of file size.
There is only a small variance between the runs of each of the programs, and the final average is larger than two standard deviations. These results indicate that a background paging rate of half the binary file size is a reasonable estimate, and this will be used in the simulation.
The page size on the SUN IPX is 4 kilobytes, and is the same as that used in the simulator. Thus the number of pages brought in during background paging for a process p is:
The stochastic distribution of page references detailed in section 6.2.3 is somewhat contrary to the traditional single process paging model presented in Deitel [Dei84] which is shown in figure
Figure 6.2: The percentage of a typical process's pages referenced from the time the process begins execution.
However, the stochastic model is useful as it eases the modelling of forced paging activity, and although it is inconsistent for a single process, on average the paging across all processes on a compute server will average to produce a similar degree of paging activity to the stochastic model.
A process may be required to do more than just background paging when there is a shortfall in available physical memory. Modern memory management systems are good, and there is usually little forced paging, however there are extreme situations where forced paging does occur. To determine the magnitude of this degradation, the response of the virtual memory system on a standard Sun IPX running SunOS 4.1.3 was subjected to intense overload, and the results were monitored with the Unix command vmstat.
The experiments were again run on a Sun IPX with no other users. Code segment sharing was prevented by giving each process its own copy of the binary.
The program selected to exercise the paging system was a modified version of lastcomm, extended to gather statistics. The program featured a high degree of file access from reading the logs, a large hash table used to store the statistics - giving the program a very large working set size, and a small binary file size of 35K
The performance of the paging system was monitored with vmstat set to the fastest update rate of 5 seconds. Typical output of vmstat is shown in figure
Figure 6.3: Sample output from vmstat, the interesting fields are fre (free list in kilobytes), pi and po (kilobytes paged in and out) and fr (kilobytes freed by the paging daemon).
As the amount of free memory is not under direct experimental control, the best that can be achieved is a relationship between the free memory and the amount of memory paged. In order to sample a reasonable cross section of paging system performance, experimental runs were taken with five, ten, fifteen, twenty and twenty five concurrent copies of the test program.
The paging in and out totals from each run were summed, normalised and averaged against the other runs. These measurements were repeated over three days and the results combined to give figure
Figure 6.4: The paging rate as measured from an overloaded system with 5 and 10 concurrent test processes. The area under the bars sums to 1, giving around 40% of low memory paging between 200 to 225 kB. The source of interest in this graph is the increase over the background rate as the amount of available memory decreases.
The SunOS 4.X paging system described by Cockcroft [Coc93] uses three values to control the paging rate in the system. Firstly; lotsfree is the threshold (256 kilobytes) below which the paging daemon is invoked; desfree, which is the desperation level, a point above which the paging daemon tries to maintain free memory (100 kilobytes); and lastly minfree, which is the minimum free memory (32 kilobytes) considered tolerable before involuntary swapping begins. The scanning rate of the paging daemon increases linearly below lotsfree as shown in figure
Figure 6.5: The parameters that control the page scanning rate in a SunOS 4.X system.
There are considerable limitations to the data provided by vmstat, firstly fre is an average over five seconds, and secondly the paging figures are of the kilobytes transferred in only the last second - thus losing all but the most recent value. The averaging of the free list prevents direct demonstration of the SunOS paging mechanism as any memory shortage will usually be resolved before it can have any noticeable effect on the average. The results obtained via vmstat are still useful in building a model however, as the proportion of paging is still much the same even with missing paging activity and as the simulator uses average memory figures, averaged free memory is sufficiently accurate.
The frequencies of forced paging in figure 6.4, agree with the action of the paging daemon described in the previous subsection. Essentially, even though the page scanning rate increases linearly freeing more pages, the pages freed are more likely to be needed by the process from which they were reclaimed. This would indeed result in the non linear growth observed.
The forced paging in a system can be modelled on a per process basis, namely the effect of a physical memory shortage is to increase the rate at which processes must page by altering the background rate. Figure bg, over certain ranges of free memory, as in table
Table 6.2: Paging penalties as memory approaches the desperation threshold.
Free memory values below 200 are not permitted in the simulator to represent the minfree threshold where swapping occurs.
Increasing the rate of paging will simply cause a process to do all of its paging at the start, as pages brought in are subtracted from the total number of pages required by the process. In this case, a notion of paging success is required, where a successful page transfer is one that does not result in the paging process losing another page before it next claims the CPU. The arbitrary values in table
Table 6.3: The probability of a successful page transfer decreases with memory availability.
When a process starts on a compute server, it consumes the average memory value recorded by the accounting log. This amount of memory in use does not change dynamically in relation to the the paging activities of the process as there is no direct link between the memory and the paging models. It is unclear that integrating both models would be beneficial, especially in light of the increase in complexity.
There is a class of processes delayed by factors other than the speed with which they can be provided system resources. This class typically includes interactive processes, that must wait on user responses, and periodic processes, which are activated by timed events. I will call these processes externally delayed, as the source of the delay is independent of the state or power of the machine.
Externally delayed processes have been ignored in previous simulation studies, but greatly effect the speed with which processes progress through the system. If the external delay is omitted from the simulation, the time spent by processes within the system is too short, decreasing memory residency and therefore contention for other resources as well. Counterintuitively, it does not produce the same effect as the infinite memory assumption, as longer stay, externally delayed processes have a greater chance of causing contention with more processes than if they had completed their execution and departed.
Given that external delays are significant and require modelling, a model is required that copes with both periodic and user based delays. However, there is no record in the system log on how much time was spent waiting on user responses or timers. From this point, for brevity I will refer to the time spent waiting on all external delays as user time.
The obvious approach of subtracting an estimated time for performing the processes CPU, IO and paging, from the elapsed time is unworkable due to extremely variable elapsed times. Code sharing, page reclamation, caching, and the system load are all factors contributing to this variability, of which the net result is, that many known interactive processes appear to have negative user times, a quite unrealistic situation.
A Better User Model
Broadly speaking, processes can be classified into one of two categories, there are the externally delayed processes identified earlier, and then there are batch processes which will proceed through the system as quickly as resources allow. Thus if batch processes are party to the same degree of caching, queueing time, code sharing and page reclamation as externally delayed processes, then the only difference between the two types is the user time.
Thus, if a model can be constructed for batch processes relating known parameters from the trace such as CPU and IO to the elapsed time, then this model can be used to compute the estimated batch time for an externally delayed process. The difference between this estimated time and the actual elapsed time is the user time.
Figure 6.6: The distributions of df, the leftmost is a measure of how CPU bound the process is, the right, how IO bound. Note the bimodalily, reflecting two different behaviours depending on the command line option.
Figure f shown in figure
Figure 6.7: Relationship f between IO-boundness and CPU-boundness.
Thus, the estimated batch time of a process is computed by the ratio of the CPU and IO times:
giving the mean of the CPU-boundness. To preserve some variation, the final value is picked randomly from a normal distribution with the mean set to the value and the standard deviation set empirically to one third of the mean.
Once the user time has been estimated, another model determines how the time is used over the lifetime of the job. The basic assumption is:
The simulator recognises two types of user delays, keystroke delays, and those due to think times. The keystroke delays where measured from two students typing quickly giving a distribution with a mean . To model the longer think times and coffee breaks, the number of user events before the next event of any other type are computed from the probability, see section 6.2.3, generating a new mean value :
If is the same or less than , a keystroke delay is randomly selected from the measured distribution, otherwise the forms the median of a new distribution from which the longer user delay is taken. The user delays in figure
Figure 6.8: The distribution of user delays generated by the user model. It is bimodal, the first mode at 110 milliseconds represents the pause between keystrokes in normal typing and the second mode at 650 milliseconds represents the longer thought pauses. The tail on this distribution is very long to include the possibility of coffee breaks, conversations etc.