Multics Technical Bulletin MTB-563
DM: Ordering of Disk I/Os
To: Distribution
From: John J. Bongiovanni
Date: 12/03/81
Subject: Data Management: Ordering of Disk I/Os
1 INTRODUCTION
The Multics Data Management Architecture will provide for
integrity of data across system crashes, even in the case where
Emergency Shutdown (ESD) does not succeed. The mechanism for
this protection is a set of journals which record before and
after images of modified data (reference MTB-xxx for a full
discussion of this capability). In order for this mechanism to
function correctly, the sequence of I/Os to the various files
involved must be controlled, as the system could crash during an
update of a data base. In the initial implementation of the
Multics Data Management Architecture, all files will be
implemented as Multics segments, and hence the I/Os to these
files will be done by Page Control. Page Control at present
provides no general mechanism for sequencing I/Os; at best,
update protocols using the force_write capability could be used
to this effect (at some expense in efficiency). This paper
describes a design approach for explicit sequencing of I/Os by
Page Control. A brief description of the functionality desired
is presented first. Next, the design is outlined in sufficient
detail to allow an estimate of its efficiency. Some alternative
design approaches are discussed, followed by implementation
considerations.
Send comments on this MTB by one of the following means:
By Multics Mail, on MIT or System M:
Bongiovanni.Multics
By Telephone:
HVN 261-9314 or (617)-492-9314
_________________________________________________________________
Multics project internal working documentation. Not to be
reproduced or distributed outside the Multics project without the
consent of the author or the author's management.
CONTENTS
Page
1 Introduction . . . . . . . . . . . . i
2 Functionality . . . . . . . . . . . 1
3 Design . . . . . . . . . . . . . . . 2
3.1 Overview . . . . . . . . . . . 2
3.2 Call-Side Services . . . . . . 2
3.3 Claim . . . . . . . . . . . . . 3
3.4 Done . . . . . . . . . . . . . 3
3.5 Find_Core . . . . . . . . . . . 4
4 Alternative Design Approaches . . . 4
5 Implementation Considerations . . . 5
Multics Technical Bulletin MTB-563
DM: Ordering of Disk I/Os
2 FUNCTIONALITY
Without loss of generality, there are two files of interest
(each may be considered a Multics segment): a data base and a
Before Image Journal. Prior to modifying any data in the data
base, the previous values of the data must be recorded in the
Before Image Journal. To guarantee recovery across any type of
system or process failure, it is necessary (although not
sufficient) that no modification to the data base be written to
the data base file on disk until the pre-image of the
modification has been written successfully to the before image
journal on disk.
Since the files of interest are Multics segments, data in
the files is written as pages. The functionality to be provided
by Page Control to user-ring programs is to link any two pages to
which the user has effective write access, such that Page Control
will not write the second page to disk until the first page has
been written to disk successfully. With this capability, the
following sequence will guarantee that a modification to a data
base is not written to disk before the corresponding before
journal image:
o Build pre-image into in Before Image Journal
o Call Page Control to link data base pages to Before Image
Journal page
o Modify data base pages
Note that the writing of pages to disk remains non-deterministic,
in the sense that pages are still written to disk by page control
based on the demand for real memory pages in the system.
Although the ordering of some of these writes is controlled, the
writes occur at random times.
It is possible that before images for a given data base page
may reside in several Before Image Journal pages, and this case
must be handled correctly. The sequence of events is as follows:
o Build pre-image of data base page A into Before Image
Journal page 1
o Call Page Control to link data base page A to Before Image
Journal page 1
o Modify data base page A
o Build pre-image of data base page A into Before Image
Journal page 2
MTB-563 Multics Technical Bulletin
DM: Ordering of Disk I/Os
o Call Page Control to link data base page A to Before Image
Journal page 2
o Modify data base page A
The difficulty with this case arises from the
non-determinism of writing pages to disk. At the time of the
second call to Page Control above, Before Image Journal page 1
may not have been written to disk (in fact, the write I/O may not
have been queued). In this circumstance, Page Control must
ensure that data base page A is not written to disk until both
Before Image Journal pages (1 and 2) have been written. No
additional information is required by Page Control, since the
case is easily detected (at the second call to Page Control, data
base page A would be linked to a different page).
3 DESIGN
3.1 Overview
There is a Core Map Entry (CME) for each physical memory
page on a system. Linked lists of CME's may be formed such that
no page in a list (other than the head) will be written to disk
until the head of the list has been written to disk successfully
(i.e., without errors). Intuitively, the head of a linked list
represents a Before Image Journal page, and the remaining
elements of the linked list represent data base pages whose
pre-images are in the Before Image Journal page at the head of
the list. CME's are linked into list at the request of
user-ring. When a write I/O for a page at the head of a list
completes successfully, the associated list is unlinked. In
general, pages are written to disk by the normal Page Control
mechanisms, except that the writing of certain pages is skipped.
3.2 Call-Side Services
There is one call-side service provided--viz., the
capability to link page A to page B so that page A will not be
written to disk until page B has been written to disk. This
capability is implemented as follows:
o Validate that the process has write access to both page A
and page B at the current validation level. If this
validation fails, return an error code.
o Touch page A.
Multics Technical Bulletin MTB-563
DM: Ordering of Disk I/Os
o Lock the Page Table Lock. If page A is not in memory,
unlock the Page Table Lock, and repeat the previous step.
This is non-deterministic, but eventually the Page Table
Lock will be held with page A in memory. In practice, page
A will already be in memory when touched in ring-0 (the
previous step), due to the protocol described above
("Functionality").
o If page A is at the head of a linked list of CME's, unlock
the Page Table Lock, and return an error code.
o If page B is not in memory and modified, unlock the Page
Table Lock and return.
o If there is a write I/O in progress to page B, unlock the
Page Table Lock, wait for the I/O to complete, and repeat
the previous steps.
o If there is a write I/O in progress to page A, unlock the
Page Table Lock, wait for the I/O to complete, and repeat
the previous steps.
o If page A is a member of a linked list of CME's, initiate a
write I/O to the page at the head of that linked list (if an
I/O is not in progress already), unlock the Page Table Lock,
wait for the I/O to complete, and repeat the previous steps.
An I/O should not be initiated if the page at the head of
the linked list is page B (that is, there have been
successive calls to link a data base page to the same Before
Image Journal page).
o Link page A to page B as the second element of the linked
list whose head is page B.
o Unlock the Page Table Lock and return.
3.3 Claim
Claim is the service of Page Control which writes modified
pages to disk (it is called during page fault processing). Claim
must be changed to skip pages which belong to a linked list of
CME's and are not at the head of the list.
3.4 Done
Done is the service of Page Control which is called to post
an interrupt for a page I/O. Done must be changed to recognize
MTB-563 Multics Technical Bulletin
DM: Ordering of Disk I/Os
the completion of a write I/O to a page which is the head of a
linked list of CME's. When this happens, the list is unlinked.
3.5 Find_Core
Find_Core is the service of Page Control which selects a
page to evict from memory when a page fault occurs and a page
must be read from disk to satisfy the page fault. Find_Core must
be changed not to evict a page which is a member of a linked list
of CME's.
4 ALTERNATIVE DESIGN APPROACHES
If it is frequently the case that pre-images of a data base
page are recorded in several Before Image Journal pages, the
approach outlined above could cause unacceptable delays.
Consider the following example:
o A pre-image for data base page A is recorded in Before Image
Journal page 1.
o Page Control is called to link page A to page 1.
o A pre-image for data base page A is recorded in Before Image
Journal page 2.
o Page Control is called to link page A to page 2.
If, at this point, page 1 has not been written to disk, the write
I/O will be requested, and the process will wait for the
completion of the I/O. The delay will be approximately the queue
time plus the device time for the I/O, which is dependent on the
system load. This delay can be reduced by the following means,
some of which are modifications to the design presented above.
o When Page Control links page A to page B, it could initiate
the write I/O for page B at that point. Since there are
(presumably) many data base pages associated with a given
Before Image Journal page, this approach would generate
extraneous I/Os.
o When the Before Image Journal overflows a page, the journal
manager could force-write the page. This would reduce the
delay without generating extraneous I/Os.
o A linked list could represent a strict ordering of I/Os.
That is, if a linked list of CME's contained pages A, B, and
C; page B would not be written until page A had been
Multics Technical Bulletin MTB-563
DM: Ordering of Disk I/Os
written, and page C would not be written until page B had
been written. In the example above, the linked list would
consist first of (page 1, page A), and then of (page 1, page
2, page A). This approach would reduce I/O simultaneity,
probably reducing efficiency.
o A more complex linked list structure could be developed,
under which a given page could be the head of one list and a
member of another list. In the example above, there would
be two linked lists, one consisting of (page 1, page 2), and
the other consisting of (page 2, page A). The problem with
this approach is that it would require the CME to be
expanded to contain the additional link information.
5 IMPLEMENTATION CONSIDERATIONS
The necessity to wire the ring-0 stack before locking the
Page Table Lock could introduce substantial delays. An
alternative implementation is to switch to the PRDS before
locking the Page Table Lock. This would require a more complex
implementation, since none of the waits involved could be done
using the PRDS as a stack. The additional complexity may not be
significant, as none of the waits could be done with the Page
Table Lock held.
The sofware which manages the Before Image Journal must take
care to modify pages in the Before Image Journal in such a way
that a crash at any time leaves the page image on disk in a
consistent state. Since I/O is asynchronous to the processing of
this software, modifications to Before Image Journal pages could
occur while a previously requested write I/O is in progress.
Careful programming can eliminate adverse effects of this
asynchrony.