Multics Technical Bulletin MTB-635
Disk Volumes
To: Distribution
From: Benson I. Margulies
Date: 10/19/83
Subject: Management of Large Physical Disk Drives -- an Overview
1 ABSTRACT
This MTB describes a new design for the
management of physical and logical volumes,
intended to support larger devices than can
be supported today. This effort is necessary
to offer a better administrative interface on
3380 class devices, and to support the next
generation of disks after that at all.
Readers of this MTB should be familiar with
the existing disk DIM, physical volume
management, and logical volume management
subsystems, at least in broad outline.
Comments should be sent to the author:
via Multics Mail:
Margulies.Multics on either MIT Multics or System M.
via telephone:
(HVN) 261-9333, or
(617) 492-9333
or via the >udd>m>meetings>Disk_Support.forum (disks) forum
meeting on System M.
_________________________________________________________________
Multics project internal working documentation. Not to be
reproduced or distributed outside the Multics project without the
consent of the author or the author's management.
Multics Technical Bulletin MTB-635
Disk Volumes
2 INTRODUCTION
There are two problem areas in storage system disk volume
management. The first is administrative. When storage system
volume management was designed for NSS (the new storage system),
disk drives were small. The goal was to relieve administrators
of the need to divide up their storage systems into pools of
quota which would fit on a DSU190. The solution was to construct
logical volumes out of multiple physical volumes.
Today, the situation is reversed. Typical disk drives are so big
that they cannot be assigned to a single functional pool of
quota. The administrative problem is further complicated by
FAMIS, which plans to offer a disk access method that does not
use storage system segments. To dedicate an entire 3380 device
to a database, or even a series of databases, is an unreasonable
restriction.
Further, all of the drives available when volume management was
designed had removable packs. The changes made to support the
5xx devices made the least changes possible to handle
non-removable packs. Now, when more and more sites have only
non-removable packs, the assumption of removability makes for an
unreasonably clumsy administrative and operational interface.
The second problem is one of implementation. Within the
supervisor, storage system records are addressed by record number
within the physical volume. There are currently 17 bits of
record number available in several crucial data structures. This
is not enough to represent all the records on the next generation
of disk drive after the 3380. A design is needed that not only
finds enough bits for this new generation of disk drives, but
does not require us to repeat the exercise every few years.
This MTB offers a description of the issues involved in
redesigning disk support, and an initial overview of a suitable
design. Once these general features of the design are agreed
upon, future MTB's can address the details.
Tom Oke (of the U. of Calgary), has made an extensive study of
the design of the disk DIM, proposing (and implementing) an
alternative strategy for queueing and seek optimization. The
redesign of disk support proposed here is the appropriate
framework for an implementation of Oke's design. His paper on
the disk DIM is attached to this MTB as an appendix.
MTB-635 Multics Technical Bulletin
Disk Volumes
3 ISSUES TO BE ADDRESSED
The preceeding section listed the problems with disk support that
compel a new design. This section shows those issues in a little
more detail, and also describes some other, less urgent, problems
that can be addressed at the same time.
3.1 The Disk DIM
The disk DIM is the lowest layer of disk support. Its contract
is to take a Physical Volume index (PVTX) and a Multics record
number or sector number and do the necessary I/O to read or write
it. It is the only program with knowledge of the translation of
a PVTX to a device number and of a record number to a sector
number. It is responsible for seek optimization and error
recovery. The following subsections describe changes that should
be made in this area:
3.1.1 MORE SECTORS
The disk DIM's queue entries currently are completely packed, and
have 22 bits available to record a sector number. They should be
restructured to have enough sector bits to cover the next few
generations (doubling in capacity) of disk drives. Note that the
disk queue is only used by the disk DIM, so that a occasional
change in format to get more sector bits is not unreasonable.
3.1.2 BIGGER SECTORS
All Multics storage system disk I/O is currently done in 64 word
sectors. This has a number of problems. First, it inflates the
number of bits needed to describe the desired sector. Second, on
writes it requires the controller hardware to do a
read-alter-rewrite sequence. It is not clear that we can depend
on this feature being available in the indefinite future.
Note, though, that to use bigger sectors we will have to
reorganize the VTOC. Thus while support for 512 word sectors
should be added to the disk DIM, support for 64 word sectors may
not be removable just yet.
3.1.3 BETTER QUEUEING AND LOCKING STRATEGY
Tom Oke's paper describes this issue in complete and glorious
detail and is attached to this MTB.
Multics Technical Bulletin MTB-635
Disk Volumes
3.1.4 DIFFERENT HARDWARE CONNECTABILITY
IBM 3380 subsystems will not support the current Multics concept
of a subsystem, since the paths to disk drives are differently
organized. To get full performance out of these disks we have to
make our configuration and seek optimization more complex.
3.1.5 ERROR RECOVERY
The existing error recovery strategy has three problems. First,
it does not have knowledge of controllers, adaptors and logical
channels. All it can do in a "bad path" error is delete the
channel, which may not solve the problem. Second, it can print
tremendous volumes of messages on the system console, which
brings the system to a standstill. Third, its handling of
offline disks is primitive. If a process detects a disk offline
while it has a crucial paged system lock locked, it will hold the
lock until the disk comes back online.
3.1.6 FAMIS SUPPORT
The disk DIM currently multiplexes disks amongst two access
methods, VTOC I/O and Paging I/O. A third is being added for
Bootload Multics, though it is effectively just Paging I/O with a
different posting mechanism. The first design decision to be
made is how to support further access methods. One possiblility
is a general scheme resembling io_manager, in which each access
method would register itself, presenting an interrupt procedure
and receiving a handle. The other possibility is to add each
access method to the disk DIM individually. While the second
approach is less modular, it allows seek optimization and the
like to take the particular access method into consideration.
With either design, it is important that access methods be able
to make effective use of prior knowledge of their access
patterns, by reading ahead or pre-seeking.
3.2 Page and Segment Control
In page and segment control, there are two issues that will be
addressed. First, any change in the VTOC format to use 512 word
sectors will be reflected here. Second, the problem with offline
disks described above must be addressed here as well as in the
disk DIM.
MTB-635 Multics Technical Bulletin
Disk Volumes
3.3 Disk Administration
3.3.1 DIVIDING UP STORAGE INTO SMALLER CHUNKS
A 3380, or even a 451, is too large a chunk of disk to be
assigned to an autonomous administrator in many circumstances.
It is necessary to allow sites to dedicate portions of a disk to
different uses.
3.3.2 THE ADMINISTRATIVE INTERFACE
The current "three ring circus" command (xxx_volume_registration)
are clumsy, confusing, and only work in the initializer process.
We will design something far easier to use.
3.3.3 "SWEET SPOT" ALLOCATION
A study has shown that 10% of the disk storage at a typical site
accounts for 90% of the disk accesses. This suggests that it
would be worthwhile to allocate the per-process information (the
10%) in the middle of the disk by default, and the permanent
information on the outside. A more sophisticated strategy would
be to try to automatically migrate segments back and forth
between "high use" regions and "low use" regions.
Note, also, that the VTOC is a popular region, and would benefit
from this treatment.
3.3.4 ALTERNATE TRACK MANAGMEMENT
This area effects the disk DIM and page and segment control as
well. Current support of alternate tracks is minimal at best.
It is excruciatingly difficult to clear all of the data off of a
track so that an alternate can be assigned. We make no effort to
automate the process of detecting a failing track, inhibiting
allocation of new pages on it, and moving existing pages
elsewhere. As disk drives get bigger and fewer, requiring sites
to do extensive tape saving in order to do routine maintenance
becomes less and less reasonable.
4 THE PROPOSED DESIGN
This section is a sketch of a proposed design, followed by a
discussion of resource requirements and phasing possibilities.
Multics Technical Bulletin MTB-635
Disk Volumes
4.1 Disk I/O
The disk DIM will be reimplemented to use Oke's queueing strategy
and address the other issues described above.
4.2 Volume Organization and Management
There will be a new layer of organization in volume management.
Physical volumes will be divided into one or more "logical
regions", and logical volumes will be constructed from logical
regions, rather than physical volumes. Record addresses that are
currently interpreted as offsets within a physical volume will
become offsets within a logical region. A logical region may be
administratively assigned to some storage system logical volume,
or to some other disk access method, such as FAMIS. The change
will be largely transparent to page and segment control. They
will continue to work with (pvtx, vtocx, record number)
addresses, and the disk DIM will map the PVTX to the correct
physical device.
Bad track information will be maintained, so that records for bad
tracks can be assigned to place-holder vtoces until an alternate
can be assigned. Track formatting algorithms will be coded in
Multics (rather than just in T&D) to allow automatic alternate
assignment.
4.3 Administrative Interface
Disk administration will be a subsystem accessible in any
administrator's process. The design goals of this subsystem are
to make it easy to specify the usual layout of a disk pack, while
offering the option of more complicated, specialized cases. In
particular, any limitation in the maximum size of a logical
region will be hidden by the administrative software. If an
administrator requests a logical region too large for the current
software, multiple logical regions will be defined transparently.
A video system application will be used to make the
administrative subsystem easy to use. A graphical representation
of the disk pack will be shown in one window while requests to
change the layout will be accepted in the other.
4.4 Volume Dumper
The volume dumper and reloader will be made knowledgeable of the
logical region strategy. Further, the volume dumper will start
MTB-635 Multics Technical Bulletin
Disk Volumes
dumping information in the pack other than VTOCE's and records,
like bad track lists, partitions, and the line.
4.5 Disk Format
This design requires changes to the disk layout. The obvious
change is to replace the partition map with a map of defined
logical regions. However, this more complex organization of a
pack will make it more vunerable to damage to the label.
Therefore, the basic label will be recorded in several places on
the pack. This will reduce the chances of data loss. To avoid
another flag day like that for record stocks, the current disk
format will be supported for several releases after the new
format is released.
5 IMPLEMENTATION CONSIDERATIONS
This design is clearly more than we can implement in a single
release cycle. Even if the resources were available, the debug
and qualification effort involved in reimplementing all these
things at once would be tremendous.
The design divides up into a number of disjoint phases.
5.1 Disk DIM
Reimplementation if the disk DIM is a self-contained project.
Any changes to its interfaces can be trivially tracked in its
callers. However, the new features should be thoroughly tested
with test stubs rather than waiting for the following phases to
exercise them.
When this phase is done, the system will:
* have far better disk performance (see Oke's paper),
* be able to support the FAMIS access method (for an entire
disk),
* will have better error recovery,
* and be able to support 3390 drives in an interim
compatability mode in which Multics software splits each
3390 actuator into two "devices."
This project is a person-year, assuming that 4 to six months of
Tom Oke are available for the disk DIM proper.
Multics Technical Bulletin MTB-635
Disk Volumes
5.2 New Volume Format I
In this phase, the new volume layout is used to transform the
current partition strategy into a set of logical regions. All of
hardcore logical volume management is converted to the logical
region design. The "three ring circus" is gutted. However, the
current administrative interface is preserved. Multiple logical
regions on a device are not yet supported. This is a one
person-year project.
5.3 New Volume Format II
In this final phase, the administrative interface is implemented.
Automatic creation of logical regions is implemented. Multiple
logical region support is announced. This is a 3 month project.
Note that these phases do not necessarily correspond to release
boundaries. In particular, the latter two phases may well be in
a single release. The time estimates here are generous, allowing
for a good deal of test exposure, qualification, and
documentation.
MTB-635 Multics Technical Bulletin
Disk Volumes
Disk System Modification MTB
Tom Oke
January 12, 1983
ABSTRACT
This MTB describes a three phase modification to the
existing Multics disk management system. The result of these
modifications will be a net decrease in the processor and system
overheads necessary to manage the disk system, and a net increase
in the throughput and responsiveness of the computer system as a
whole.
The total project is broken into three sub-phases. Each
sub-phase is necessary to supply groundwork and background upon
which to base the next phase, but each phase has its own goals
and benefits. It is possible to halt the project at completion
of any phase and have a functional system to that level of
support. Phase One removes fixed limitations from the disk
sub-systems, provides better utility with the same resources, and
permits full utilization of the channel capabilities of the
hardware. Phase Two reduces locking and queueing overheads,
permits on-cylinder optimizations, and better metering and
statistics. Phase Three introduces an efficient dynamic system
optimizer aimed at optimizing total system resources as they
apply to the storage system to achieve better system throughput
and responsiveness and to dynamically manage these resources
according to site defined optimization desires (effectively
stated as simple desire rules).
This MTB outlines the conceptual basis for these design
modifications, and the expected benefits of each phase of the
project. Each phase is outlined in terms of reason, cost,
benifit, and expended manpower. Much of the work necessary to
implement PHASES ONE and TWO has already been done on an MR7
level of the disk system software, but this would have to be
forward-fitted to the MR10.2 level in order to abe useful. All
work necessary to implement the changes could be done on the
Calgary system and then moved to Phoenix for final integration
checkout.
PREMISE
The basic premise of these modifications is that two general
forms of storage access operations exist: blocked and un-blocked
IO. Blocking IO occurs when either a user process, or the
operating system must wait for the completion of physical
activity to be able to continue execution. Un-blocked IO occurs
Multics Technical Bulletin MTB-635
Disk Volumes
with situations in which process actions are not dependant upon
the completion of IO operations, a normal occurrance for things
like page writes, and VTOCE writes which are bufferred and not
directly requested by a process.
For example, a process becomes blocked when it makes
reference to a page which is not in main memory, and causes a
demand page read. In this case the process must cease execution
until the page becomes available. The operating system typically
will encounter blocking situations only when its paging system
processing resources are saturated and it must wait for the
completion of some physical activity before resources become
freed and the system can continue.
ALLOCATION LOCKS are a typical example of paging system
saturation. In this situation the system cannot do ANY paging
activity until the allocation lock situation is cleared and queue
resources become available, so at the point where there are no
processes left to execute which do not need pages, the system
becomes idle.
Un-blocked IO typically occurs if a queue of IO is output to
the storage system which is not a dependancy of any process and
can continue without causing any process to block.
The danger lies in the transformation of un-blocked to
blocked IO when the queue resource becomes saturated, as occurs
with ALLOCATION LOCKS, INTERRUPT LOCKS, RUN LOCKS or the
attainment of the WRITE-LIMIT threshold. This is particularly
important in that it is possible to shut down operations of the
entire system, rather than individual processes.
In alleviating these situations the basic intent is that
un-blocked IO can be ignored, since it does not halt a process or
processor, until sufficient system resources become consumed that
the un-blocked IO may potentially be turned into blocking IO due
to saturation.
It is further seen that there are levels of blocking, for
example a VTOC demand read is a primary block, since it must
occur before a page demand read can possibly complete. A VTOC
write is of slightly lesser importance from the viewpoint of
blocking processes, but is important to the consistency of the
storage system in case of failure. A page demand read is more
important than a page demand write, since it causes a process to
become ineligible for execution until the page is in main memory.
A page demand read may well be more important for system response
than a VTOCE write, but this must be balanced with file system
consistency and the loading of the VTOC buffers.
MTB-635 Multics Technical Bulletin
Disk Volumes
Such optimizations have an effect, not only upon the storage
system, but upon the efficiency of the operating system itself.
If there is a high degree of blocking occuring, this must be
countered by a high level of multiprogramming in order to attain
sufficient executable processes to statistically fill the
available processor cycles. A high level of multiprogramming
directly translates into a higher level of system overhead in
process management, queue searching and scheduling, and process
switching.
To this end a system works most efficiently if it has the
minimal degree of blocking possible, and further the users see a
more responsive system when blocking is minimized.
PHASE ONE - Alleviation of Allocation Locks, Elimination of
Channel Limits
FEATURES:
Considerable reduction in ALLOCATION LOCKS over existing
system functionality, makes queue resources site tunable,
provides better utilization of the same amount of queue
resources, permits site declaration of a variable number of disk
channels per sub-system, permits declaration of a sub-system to
be only a declaration of physical connectability, and not skewed
to attempt to optimize allocation of queue or channel resources.
PROBLEM
Allocation locks are a problem which has consistently
plagued a loaded Multics system, this is aggravated by unequal
disk sub-system loadings. Many sites have sub-split disk
sub-systems in an attempt to alleviate this problem by
introducing more queuing resource, but this has lead to channel
allocation and connection problems. In addition there is a fixed
limit of 8 logical channels to each disk sub-system, in current
large disk configurations this limits the level of seek overlap
which can be maintained and limits degraded configuration
capabilities by failing to permit full exploitation of one of the
nicer features of the HIS hardware.
CAUSE
The Multics disk system is split into one or more
sub-systems. Each sub-system is given a queuing resource which
is sufficient to hold 64 IO requests for the collection of drives
Multics Technical Bulletin MTB-635
Disk Volumes
which may be attached to that sub-system, and is limited to 8
channels through which to access all the drives.
Due to the burst effect of page writes, as a large number of
writes are emitted in cleaning up modified pages, it is quite
common for a large number of requests to be queued to a single
drive, which may be sufficient to saturate the queue resource for
the entire sub-system. When this occurs, the IO for an entire
sub-system waits for the completion of IO for a single drive, a
high degree of bottlenecking. In addition there will probably be
a number of writes as yet un-emitted at the point of blockage,
which will simply extend the period of blockage, since these will
be emitted as soon as space in the queue is available for them.
In addition to being a queue resource, the sub-system also
describes the physical makeup and connection of the paths
(channels and drives) with which to utilize the storage system
resources. By being bastardized to allocate necessary queue and
channel resources according to the drive loadings the connection
problem is made more complex and connection and use is no longer
straight-forward, and in some connection cases it is made more
error prone.
CORRECTION
The first phase of the disk project is to reduce or
alleviate these problems, and further to make the queuing and
channel resources site tunable parameters to account for the
varying characteristics of individual site workloads.
The primary cause of allocation lock problems is the
allocation of free queue resources as a fixed function of a disk
sub-systems. Since drive loadings are typically a function of
logical volume content and transitory system loadings the
corresponding queue loadings are also typically transietory. The
allocation of a fixed resource to a dynamic load, and the
constraint of the size of that resource takes its toll in system
operation. In most cases where ALLOCATION LOCKS occur only one
sub-system, and sometimes only one drive, is significantly
loaded, this can be seen as a poor utilization of resources.
The first phase of the disk project removes the 'free_q' in
each sub-system structure, and makes it a system wide resource,
with the number of queue elements being a CONFIG DECK tunable
parameter. Thus if a site requires a larger queue resource it
can be created at boot time with a CONFIG DECK parameter change.
By the removal of the queue resource limit as a function of
sub-systems one is now able to express the true connectability of
MTB-635 Multics Technical Bulletin
Disk Volumes
a sub-system in its definition without clouding the issue through
attempt to alleviate a completely distinct problem.
Another problem exists in the fixed limit of eight disk
channels per disk sub-system. Since in the HIS hardware there
can be as many as eight disk channels per Physical Link Adaptor
(LA) this does not permit full exploitation of either full disk
seek overlap, nor the hardware limits without mis-declaring the
disk sub-systems to being sub-sets of the true connectability.
The first phase of the disk project also makes this channel
table a configuration dependant table, which can grow to the size
necessary to handle all the channels configured logically to the
MPC's which connect a set of disk drives. Thus if more logical
channels can be configured according to the physical makeup of
the IOM/MPC combinations, then they can also be configured in the
software to the sub-system which defines the drives they connect.
This will be able to typically increase the level of seek
overlap, and hence the sub-system IO throughput rates, as well as
permitting over-comittment of channels to a sub-system (more
channels than drives) to handle degraded operation situations
without a corresponding degradation of service capabilities.
AFFECTED ROUTINES
The following list of routines may be incomplete, it is from
MR7 information and has not been updated as yet to the necessary
level of an MR10.2 baseline.
Routine Function
------- --------
dskdcl.incl.(alm pl1) Defines the data structures involved.
get_io_segs.pl1 Initializes size of disk_data for
queue so segment can be wired.
disk_init.pl1 Generates and initializes queue
entries according to CONFIG DECK
parameters.
dctl.alm ALM disk driver, must manage queue
allocation and free.
disk_control.pl1 PL/I disk driver, must manage queue
allocation and free, and situations of
ESD.
EXPECTED EFFECT OF MODIFICATION
Multics Technical Bulletin MTB-635
Disk Volumes
It is expected that the effect of the modification will be a
great reduction in ALLOCATION LOCKS even in the busiest system.
It will further permit tuning of the 'free_q' resource according
to the requirements of a site, and full declaration of the
hardware channel capabilities of the hardware.
Since the 'free_q' resource is no longer tied to a single
sub-system there will be much better utility obtained per queue
element allocated, and one will not have to sub-split disk
sub-systems to attempt to allocate more queue resource. Since
one is no longer limited in the declaration of channels to a
sub-system, there will no longer be a need to sub-split
sub-systems to permit sufficient channels per drive. Thus a
sub-system will be a true declaration of the connectability of
disk drives.
EXPECTED COST OF MODIFICATION
It is expected that this modification will take roughly a
man-week from its present state to be ready for testing. A
further week should be allocated for testing and production of
statistics either confirming or denying the above expectations,
and quantifying the actual results.
REQUIRED TESTING
To be valid testing should include ALL expected, and
emergency, disk situations. A system should be load tested both
in the current state, and after modification to determine
bottleneck points.
Further testing should include shutdown situations, of
normal shutdown, ESD and shutdown situations in which an MPC or
drive failure is sponsored to test robustness of the
modifications. Testing should also include salvaging situations
on system startup and should include salvage of both the root and
public drives.
PHASE TWO - Modification of queuing and locking
FEATURES:
Reduction of queue scanning overheads, reduction of lock
contention and delays, better sub-system metering and statistics,
on-cylinder seek optimizations under all situations.
MTB-635 Multics Technical Bulletin
Disk Volumes
PROBLEM
The current method of locking and the grouping of all
requests into two common queues artificially constrains access
and increases overheads in management of disk sub-systems.
CAUSE
The current method of queuing and locking of sub-systems
uses a pair of queues per disk sub-system, with a common lock
controlling access to the entire sub-system.
This has the effect of creating an artificial bottleneck
with the constrained access through the lock to all functions to
be performed on the sub-system. This lock is necessitated by the
use of two queues, common to all drives of the sub-system, a high
priority queue and a low priority queue.
CORRECTION
There are two situations to correct, but one is dependant
upon the other. The correction in this case is to make a
separate queue for each drive, rather than a set of queues for
the entire sub-system. This reduces the number of requests which
get scanned to determine a nearest-seek candidate for IO on a
single drive. Further, by eliminating the complete separation
between the high priority IO queue and the low priority IO queue
within the sub-system, and combining them into a single queue per
drive, we will be able to optimize situations of on-cylinder
mixes of high and low priority requests.
One method of retaining the logical separation between high
priority requests (which should be done first if they require a
head seek) and low priority requests (which are non-blocked) is
through the use of a multiplier to make low priority requests
look much longer than high priority seek requests. A normal
nearest-seek physical seek-length is calculated, and transformed
into a logical seek length by multiplying it by the separation
factor for that type of IO.
Another method is through the use of a seek offset, to
increase the logical length of a seek to a low priority IO
request.
Using a multiplier of the same value as the number of
cylinders on the spindle will give complete separation between
high and low priority seeks, while retaining on-cylinder
optimization. Using an offset of the same value as the number of
Multics Technical Bulletin MTB-635
Disk Volumes
cylinders on the spindle will give exactly the same effect as
having completely separate queues for high and low priority IO
and will lose on-cylinder optimization.
Once the common queue has been split into queues per drive,
then the locking strategy becomes simpler. Each sub-system will
then consist of three resources:
1. The drives of the sub-system. These contain the actual
requests which need to be done and are a description of the
requests for the drive complete in themselves. They are not
dependant upon acquiring any other sub-system resource.
2. The channels of the sub-system. These contain access to the
physical path necessary to actually effect the IO. These are
a simple blocking mechanism. Once a free_q element has been
acquired, and a drive has been acquired, then a channel is
requested. If it is not available the IO is simply queued
(as it is if the drive is doing IO). The operating system
does not block due to unavailability of a channel.
3. The metering base per sub-system. This consists of two
parts:
a. The immutably updateable meters for the sub-system. This
will be meters updated with simple immutable instructions
(i.e. 'aos') which implicitly lock through the
processor/memory access.
b. Meters requiring locked instruction sequences, such
things as timers and error counts which cannot be updated
with immutable instructions, and situations which require
a sub-system wide lock. Typically this lock access will
be straight-line and will not control physical resources
which could introduce realtime delays.
In situations requiring the sub-system-wide resource, such
as channels and drives, the locking procedure would be to lock
the sub-system lock, then the individual drive locks. In this
manner one would not require any individual wait on the normal
sub-system lock.
AFFECTED ROUTINES
The following routines will be affected by this level, in
addition to all the routines already affected by PHASE ONE, as
listed above.
MTB-635 Multics Technical Bulletin
Disk Volumes
Routine Function
------- --------
disk_queue.pl1 A metering routine to check
disk queue loading and channel use.
disk_meters.pl1 A metering routine to provide
sub-system use statistics. It will
need updating for the metering
structure changes.
device_meters.pl1 As for disk_meters.
spg_fs_info_.pl1 As for disk_meters.
ioi_assign_disk_channels.pl1 Changes for channel locking etc.
EXPECTED EFFECT OF MODIFICATIONS
These split-outs will decrease the throughput demands upon
the individual locks and will reduce situations which require
realtime based lock delays. The net effect of the combined
modifications will be faster locking, reduction of locking
overheads and queue management overheads, and the introduction of
on-cylinder IO optimization. Further, moving some of the meters
which are really drive specific, rather than sub-system specific,
into the drive structure will lessen sub-system lock
requirements, and introduce new statistical collection situations
which will provide better metering and meterability of disk
sub-systems and drive bottleneck situations.
EXPECTED COST OF MODIFICATIONS
To do a thorough job of this portion of the modifications
there should be some consulation with the current CISL developers
and a definition of possible interactions of the modifications
with future system development and planning.
This would take in the range of one to two man-weeks to
complete the modifications of this stage and to verify none of
the error recovery/reporting functionality has been lost.
Testing should cover the same basic range as for PHASE ONE,
attempting to validate the modifications in all possible and
impossible situations. In addition there should be again a
thorough collection of statistics, this time basing against a
normal system, the PHASE ONE modified system, and the PHASE TWO
Multics Technical Bulletin MTB-635
Disk Volumes
system. This will provide a metering base to track the sucess of
the design and design criteria. It will also provide information
to be finally placed into site tuning documentation.
PHASE THREE - Adaptive optimization of IO
PROBLEM
The optimization of disk IO keeps a complete separation of
high and low priority requests. This produces bottlenecking of
IO optimization, which has in the past produced ALLOCATION LOCK
problems. Even with PHASE ONE and TWO modifications major burst
IO characteristics will remain unoptimized for situations
requiring throughput, by always favoring attempting best demand
IO response.
While favouring best IO demand response is a desireable
characteristic in disk system management, the blindness by which
it is followed sometimes produces inoptimal system response to
storage system demands, and requires higher multiprogramming to
attain full system efficiency. Further, it does not necessarily
produce maximum user/system responsiveness, since it ties up
large amounts of memory to hold buffers for the poorly optimized
IO.
CAUSE
The current method of disk optimization is invariant to the
changing demands of the system. As a result it is an attempt at
a fixed general solution to a rapidly changing dynamic situation.
VTOCE IO is highly optimized since it is important, but
there is no differentiation of its importance from the viewpoint
of the computer system, rather than the disk system. Thus the
demand read characteristic of VTOCE read is no better optimized
than the lower priority VTOCE write. (In this case the
priorities are on a blocking basis, rather than the requirement
for a consistent IO system.)
The current complete separation between page read and write
will be somewhat optimized by PHASE ONE and TWO changes, but
there will still be no method to improve the IO throughput
optimization of disk page writes as the IO system starts to
saturate with them. Thus ALLOCATION LOCKS are still possible.
If they are avoided by simply increasing the size of the queue,
and WRITE_LIMIT, the system will degrade by the removal of a high
MTB-635 Multics Technical Bulletin
Disk Volumes
degree of available pages. In other words the inability to
respond the changing requirements of a storage system as the
inherent priority situations change will increase system
overheads and delays beyond what is necessary.
CORRECTION
The final stage of the disk system modifications is termed
ADAPTIVE DISK optimization. This optimization depends upon a
site setting tuning parameters which define the site's view of
the importance of two situations on an IO type by IO type basis.
The two situations are:
1. Maximum response. This is the degree of optimization to
give to IO requests of this type in situations where
there is no IO of this type queued up. It
essentially defines the importance of doing this IO
with respect to any other IO without regard to queue
loadings.
2. Maximum throughput. This is the number of IO requests of
this type which can be allowed to queue up to which
the system should respond with the maximum possible
throughput optimization. It essentially defines a
limit of resource allocation at which the system must
protect itself by attempting to speed the throughput
of IO requests of this type to clear the queue.
Though these two values are simple points, they are taken as
the definition of a straight line which determines the
optimization to be afforded IO requests of each IO type at any
point within the x-y space of optimization and queue loading.
NOTE that these optimizations are per IO type and IO types do not
necessarily have a relationship to each other.
The optimization value assigned is a multiplier and is the
inverse of the degree of optimization. Where PHASE TWO separated
the page read and page write IO's by weighting the physical seek
length by a multiplier to make it a logical seek length, this
modification simply makes this weighting factor a function of the
queue loading and the desired initial optimization. The use of a
logical seek length permits the existing nearest-seek-first
algorithm to produce a true optimization according to the desired
criteria.
As queue loadings increase, the desireability to the system
as a whole to increase the throughput of a certain IO type
increases. (This desirability is evaluated by the site and set
as a tuning parameter.) The specific situation of loading and
Multics Technical Bulletin MTB-635
Disk Volumes
priority is represented as a point along the defined line in the
loading/optimization plane. The relationship of this IO type to
other IO types is defined by the produced optimization value for
this type in relation to the produced optimization values for the
other IO types. So the situation is widely dynamic and operates
beyond the bounds of the two dimensions input as tuning
parameters.
By indicating different optimizations and loadings for each
IO type it is possible to have their optimizations cross each
other at different points during normal system operation. For
example:
A site sets tuning parameters such that VTOC read is always
maximally optimized by indicating optimization = 1, loading =
1.
VTOC write is seen as a clearing operation which will not
block the system, but which should not be left too long due to
the constrained resource of VTOC buffers and storage system
consistency. So parameters are set to have an optimization of
the number of cylinders of a drive for the first VTOC write,
but to fully optimize if 3 VTOC writes are outstanding. This
gives complete separation for non-on-cylinder VTOC operations.
A page read is seen as a high priority operation, since it
blocks, but as less demanding than a VTOC read, which is
necessary to unlock access to a number of potential page
reads. So the site sets initial optimization at 1/4 of a
drive's cylinders (200) but requires full optimization if more
than 1/2 of 'maxe' process's are waiting for pages to optimize
multi-programming.
A page write is seen as the lowest priority of all, but it
will cause blocking if too many are queued up. So the site
sets initial optimization as the number of cylinders of a
drive, but requires full optimization at 1/2 'free-q'
allocation.
As can be seen a number of factors have been considered and
are in effect. The instantaneous optimization of the system
will take into account all the above situations dynamically. For
example, VTOC reads will fly through, but if we get up to 3 VTOC
writes per drive they will get fully optimized too. Page reads
will get nearly maximal throughput, and will fully optimize if
too many processes get bottlenecked on any particulary drive.
But if we get up to 3 VTOC writes outstanding they will surpass
page reads in optimization till the demand slacks off. Finally
page write will be allowed to queue up to a high degree, but not
high enough to start to block system operation.
MTB-635 Multics Technical Bulletin
Disk Volumes
What is perhaps not totally obvious in addition is the
effect of grouping which will occur through this optimization
technique. For example the optimization of any IO type not only
depends upon the optimization factor applied, but also the
nearness of the true physical position of an IO seek of its type,
in relation to the nearness of the true physical position of an
IO seek of another type. Thus we may hold off doing writes for a
while til they build up, but when we start to do them the
statistics are fairly good that we will be able to do a high
degree of local seek length optimizations through the buildup of
candidates within that area. When the span between areas, in
relation to the current queue loadings, reaches a dynamic
separation point, we will return to doing optimization of the
higher priority IO and will probably be able to do group
optimization of them too.
So the optimizations afforded by the above method go well
beyond the simple possibilities of a non-dynamic method, and in
fact out-reach the imaginations of those entering the parameters.
It is a means to put extra intelligence into the managing of a
computer system as a whole, and not just the storage system, but
an intelligence which follows exactly the dictates given to it,
though the final effect may well surpass the generality that was
presumed for it. In other words, it will do what you want, even
in situations you might not have accounted for, and which you do
not have to account for.
HISTORY
Some history of these proposals is appropriate. About three
years ago they were first conceived, though in a rougher form.
Over the suceeding three years they have been put into effect to
a slightly limited extent on a UNIX system owned by the
Department of Computer Science, running on a VAX 11/780. On this
system, which had a difference queuing method without the locking
and 'free_q' problems of MULTICS, only the adaptive optimization
technique, and a correctly functioning 'nearest-seek-first'
algorithm needed to be created, and this was done according to a
design document similar to this which was supplied to the systems
programmers of the UNIX system.
To this point the adaptive optimization has performed
without flaw, and appears to be quite robust, with a high degree
of tolerance to a wide range of tuning parameters. The UNIX
system has also benefited from the extra statistics and meters
which the modifications made possible.
Multics Technical Bulletin MTB-635
Disk Volumes
To date there is no one thing which can be pointed to with
flag waving, there are no spectacular situations in which the
optimization really becomes apparent. However they have noted
that it is much more difficult, while running 'emacs' to
determine that the system is loaded, and for the first several
months of existence of the optimization the ability of the
systems programmers to sense the loading of the system by their
old performance measures always produced much lower loading
levels than were actually the case when meters were consulted.
Through rough testing with thrashing programs it is easily
possible to bring the disk drives to individual busy levels of
80-92% without significant queue buildup, and in most cases
system responsiveness is maintained much better than without the
optimization.
It is very infrequent when any significant queue buildup of
writes can be noticed, but some situations have occurred where a
queue buildup of 150 elements was maintained for any prolonged
period, with a reportedly good system response.
As a result it is quite desireable to be able to produce
better measures of sucess and tuning than have been available,
certainly we should progress beyond the seat-of-the-pants feeling
and get quantitative measures. Indications to this point are
that the optimizations should produce a better system for total
system throughput than can be achieved by previous methods,
including disk combing, but no hard numbers stand to attest this.
Though the above sections appear to enter into the world of
science fiction/fantasy and intelligent machines, this is not
really the case. It is merely a situation where the statement of
the rules provided by the system are interpreted to be able to
provide a similacrum of thought in the optimization of the
system. The driver does not originate anything, it simply
follows the rules provided. The fact that the rules are in some
sense a valid mixture of different critera (apples and oranges?)
provides much of the groundwork to enable the system to work. In
essance the tuner is not stating 'do this at this time', but
instead is laying down conditions which must be fulfilled by the
driver, and is able to state these conditions in terms of disk
seek priority and queue loadings.