Multics Technical Bulletin MTB-560
DM: Before Journal Design
To: Distribution
From: Andre Bensoussan
Date: 06/20/83
Subject: Data Management: Before Journal Manager Design
ABSTRACT
The services provided by the Before Journal Manager have been
specified in MTB 559, in terms of a set of primitives callable by
the rest of the system, with a detailed description of what they
do and how to call them.
This document describes in some detail how these primitives do
their job and how the data structures they use are organized.
The reader is supposed to be familiar with the following MTB's:
- MTB-509: Recovery overview
- MTB-511: File manager overview
- MTB-559: Before journal manager specifications.
Comments should be sent to the author:
via Multics Mail:
Bensoussan.Multics on System M.
via US Mail:
André Bensoussan
Honeywell Information Systems, inc.
575 Tech Square
Cambridge, Massachusetts 02139
via telephone:
(HVN) 261-9334, or
(617) 492-9334
_________________________________________________________________
Multics project internal working documentation. Not to be
reproduced or distributed outside the Multics project.
CONTENTS
Page
Abstract . . . . . . . . . . . . . . . i
1 Introduction . . . . . . . . . . . . 1
2 Before Journal Logical Records . . . 1
2.1 Logical Record Types . . . . . 1
2.2 Logical Record Contents and
Formats . . . . . . . . . . . . . 3
2.3 Summary of the Logical Records
Formats . . . . . . . . . . . . . 9
3 Before Journal File Characteristics 11
4 Before Journal File Format . . . . . 12
4.1 The before journal header in CI
zero . . . . . . . . . . . . . . . 12
4.2 The Control Interval Structure 13
4.2.1 The file manager ci_header
structure. . . . . . . . . . . 13
4.2.2 The before journal bj_ci
structure. . . . . . . . . . . 14
4.3 Elements . . . . . . . . . . . 16
4.4 Logical Record Identifiers . . 17
5 Logical Records Storage and
Retrieval Operations . . . . . . . . 18
6 Logical vs Physical Journalization . 19
6.1 Logical Journalization . . . . 19
6.2 Physical Journalization . . . . 20
6.3 File manager services . . . . . 21
6.4 Page Control Services . . . . . 22
7 Tables used by the Before Journal
Manager . . . . . . . . . . . . . . . 23
7.1 Per Process Information . . . . 23
7.1.1 The bj_ppt structure. . . 23
7.1.2 The bj_ppte structure. . . 25
7.2 Per System Information . . . . 26
7.2.1 The bj_pst structure. . . 27
7.2.2 The bj_pste structure. . . 29
7.2.3 The bj_pn_table structure 35
7.2.4 The bj_check_in_table
Structure . . . . . . . . . . . 36
7.3 Per Transaction Information . . 38
7.3.1 The bj_txt structure. . . 38
7.3.2 The bj_txte structure. . . 39
8 Table used by the Hardcore . . . . . 43
Multics Technical Bulletin MTB-560
DM: Before Journal Design
1 INTRODUCTION
The before journal manager is organized into a set of primitives,
callable as a set of before_journal_manager_$ entry points.
These primitives, in order to perform their job, have to write
some information into the before journal or read some information
from it. When writing, they prepare the data to be written in
what is referred to as a logical record; then they call upon a
storage utility module of the before journal manager to actually
store the data in the journal. Each time the storage module
stores a logical record into the journal, it returns a record
identifier which can be used later to retrieve that logical
record. When reading, these primitives call again upon the
storage module, with a record identifier; they get back the
logical record as it was presented to the storage module at the
time the record was stored.
2 BEFORE JOURNAL LOGICAL RECORDS
Logical records stored in a before journal are of various types
depending on their function in the journal. Each type of record
contains specific information stored according to a specific
format.
2.1 Logical Record Types
The following logical record types have been defined:
o The "begin" type record, stored in the journal at the beginning
of a transaction.
o The "committed" type record, stored in the journal at the end
of a transaction to indicate that the transaction has been
committed. The transaction is in fact committed at the very
instant the committed record physically appears on the journal.
o The "aborted" type record, stored in the journal at the end of
a transaction to indicate that the transaction has been
aborted. The transaction is in fact aborted at the very
instant the abort record physically appears on the journal.
MTB-560 Multics Technical Bulletin
DM: Before Journal Design
o The "checkpoint" type record, stored in the journal to indicate
that the transaction has performed a checkpoint and to keep (or
point to) the information describing the state of the
transaction at the time of the checkpoint.
o The "rolled_back" type record, stored in the journal after a
transaction has been rolled back, partially or completely. It
indicates up to which checkpoint the rollback has been
performed; it also indicates that all I/O's generated by the
rollback have been physically completed.
o The "before_image" type record, stored in the journal each time
a transaction is about to modify a protected file, to indicate
what file is about to be modified and how to undo the
modification if the transaction is to be rolled back.
o The "rollback_handler" type record, stored in the journal each
time a transaction is about to perform an action that could not
be rolled back using the standard before image mechanism. In
theses situations, the transaction stores in the journal the
name of a procedure which would have to be invoked at rollback
time to undo the action; the record containing the name of this
procedure is called a rollback handler record; it also contains
any input information needed by the procedure to do its job.
o The "fm_rollback_handler" type record, used by the file manager
as a specially tailored rollback handler. The file manager
could have used the general form of rollback handler record
described above. For ease of implementation, and for
robustness reasons, it uses a special form of record.
o The "fm_post_commit_handler" type record, used by the file
manager to postpone an action until the transaction commits. A
typical example of its usage is when a protected file is
deleted in the middle of a transaction. Suppose that the file
manager really deletes the file. If the transaction needs to
be aborted a little later, it would be difficult to recreate
the file as it was at the beginning of the transaction, unless
the entire file was saved, just in case of an abort. Instead,
the file manager does a logical deletion of the file, and makes
arrangement with the commit mechanism to get control when the
transaction is committed, in order to turn the logical deletion
into a real deletion, as a post_commit action.
Multics Technical Bulletin MTB-560
DM: Before Journal Design
There may be several post_commit actions awaiting for a
transaction to commit. For each of them, the description of
what is to be done is kept in a fm_post_commit handler record.
This list of records associated with a transaction must be such
that it can be available even after a system crash. Since
records in the before journal have this property, the journal
has been chosen as the place where to keep that list, even
though the list is not needed for rollback purposes.
o The "begin_commit" type record is used to indicate the
beginning of a commit which has post_commit actions associated
with it. The protocol is to write a begin_commit mark in the
journal, then do the post_commit actions and then write the
committed mark in the journal. As soon as the begin_commit
mark is physically in the journal, the transaction cannot be
aborted; it has no alternative other than executing the
post_commit actions. If the system crashes while executing the
post_commit actions, they will be re-executed by the recovery
system and the transaction will be committed. Committing a
transaction that has no post_commit actions pending is done
without a begin_commit record.
Additional record types may be defined later, if and when the
need for new types arises.
2.2 Logical Record Contents and Formats
All before journal records start with a header that has a
standard format regardless of the record type; some record types
consist only of the header; some other record types consist of
the header followed by information with a structure which is
specific to the record type.
o The logical record header
The header has the following structure:
dcl 1 bj_rec_hdr aligned based,
2 type char (4),
2 tid bit (36),
2 process_id bit (36),
2 prev_rec_id bit (36),
2 prev_rec_byte_size fixed bin (24),
2 tx_rec_no fixed bin (35),
2 n_txn fixed bin;
MTB-560 Multics Technical Bulletin
DM: Before Journal Design
where:
"type" is the type of the record, i.e., one of the types
described above.
"tid" is the transaction identifier of the transaction
which produced the record.
"pid" is the process identifier of the process which
produced the record.
"prev_rec_id" is the record identifier of the previous
record produced by the same transaction and still
useful to the transaction. This record identifier is
used by the rollback procedure to get before journal
records in reverse chronological order for a given
transaction. It consists of a [control interval
number, slot number] pair.
"prev_rec_byte_size" is the number of bytes of the
previous record produced by the same transaction.
"tx_rec_no" is the record number within the transaction.
This item is redundant and is used to perform some
consistency checking on the before journal.
"n_txn" is the number of transactions still in progress
and that have stored at least one record in this
journal, at the time this particular record is stored
in the journal. This item is used by recovery after
crash in the following manner: To find all
unfinished transactions, recovery after crash starts
with the last record written in the journal, and
examines all records in the reverse order of their
production. With the n_txn item in every record, the
last record indicates to the program how many
unfinished transactions there are; as soon as this
number of unfinished transactions has been found, it
is no longer necessary to examine the rest of the
journal. Without the availability of this
information, the journal would have to be read until
the origin is encountered.
Multics Technical Bulletin MTB-560
DM: Before Journal Design
o The "begin" record
The begin record consists only of the header. It has no type
specific information.
o The "committed" record
The committed record consists only of the header. It has no
type specific information.
o The "aborted" record
The aborted type record consists of the header only. It has
no type specific information.
o The "checkpoint" record
The checkpoint record consists of the header followed by some
information about the checkpoint. This information will be
defined later, since checkpoints will not be available in the
first release.
o The "rolled_back" record
The rolled_back type record has the following structure:
dcl 1 bj_rolled_back_rec aligned based,
2 header like bj_rec_hdr,
2 checkpoint_no fixed bin (35),
2 last_written_rec_id bit (36);
where:
"header" is the logical record header.
"checkpoint_no" is the number of the checkpoint to which
the transaction was rolled back.
MTB-560 Multics Technical Bulletin
DM: Before Journal Design
"last_written_rec_id" is the record id of the last record
produced by this transaction. It is kept for
debugging purposes.
o The "before_image" record
The before image record has the following structure:
dcl 1 bj_before_image aligned based,
2 header like bj_rec_hdr,
2 fm_uid bit (36),
2 fm_oid bit (36),
2 ci_no fixed bin (35),
2 n_parts fixed bin (17),
2 image_len fixed bin (24),
2 part dim (refer
(bj_before_image.n_parts)),
3 byte_offset fixed bin (24),
3 byte_length fixed bin (24),
2 image char (refer
(bj_before_image.image_len));
where:
"header" is the logical record header.
"fm_uid" is the file unique identifier of the file about
to be modified. This identifier will be used by the
rollback procedure in order to identify the file on
which the modification has to be undone when using
this before image record. Since the Multics file
system does not provide a search facility to find a
file by its unique identifier, the recovery mechanism
(and more specifically the file manager) maintains a
conversion table from file unique identifiers to file
pathnames, for all files that may be needed for
rollback.
"fm_oid" is the file opening identifier of the file about
to be modified. It is used by the rollback procedure
when executing in the same process as the transaction
to rollback.
"ci_no" is the control interval number of the control
interval which is about to be modified.
Multics Technical Bulletin MTB-560
DM: Before Journal Design
"n_parts" is the number of parts about to be modified in
the control interval. In this context, a part is a
bit string defined by its byte offset within the
control interval and its length in number of bytes.
"image_len" is the length of the record in number of
bytes.
"part" is an array, with one entry for each part about to
be modified.
"byte_offset" is the byte offset, in the control interval,
of the beginning of the part.
"byte_length" is the number of bytes of the part.
"image" is the concatenation of the bit string
representation of all parts to be modified, in the
order in which they are described in the part array.
o The "rollback_handler" record
This record has the following structure:
dcl 1 bj_rollback_handler_rec aligned based,
2 header like bj_rec_hdr,
2 name_len fixed bin (24),
2 info_len fixed bin (24),
2 proc_name char (refer
(bj_rollback_handler_rec.name_len)),
2 info_bits bit (refer
(bj_rollback_handler_rec.info_len));
where:
"header" is the standard header of any before journal
logical record.
"name_len" is the number of characters of the proc_name
item found below.
"info_len" is the number of bits of the info_bits item
found below.
"proc_name" is name of the procedure to be invoked.
MTB-560 Multics Technical Bulletin
DM: Before Journal Design
"info_bits" is the bit string representation of the input
information that this procedure needs to do its job.
o The "fm_rollback_handler" record.
This record has the following structure:
dcl 1 bj_fm_handler_rec aligned based,
2 header like bj_rec_hdr,
2 fm_uid bit (36),
2 fm_oid bit (36),
2 prev_fm_handler_rec_id bit (36),
2 info_len fixed bin,
2 info_bytes char (refer
(bj_fm_handler_rec.info_len));
where:
"header" is the standard record header.
"fm_uid" is the uid of the file for which this record is
relevant.
"fm_oid" is the file opening id of that file.
"prev_fm_handler_rec_id" is the record id of the previous
fm_handler_record in this transaction.
"info_len" is the length, in number of bytes, of the
info_bytes item below.
"info_bytes" is the information needed by the file manager
procedure to do its job.
o The "fm_post_commit_handler" record.
This record has the same structure and contents as the
fm_rollback_handler record described above.
o The "begin_commit" record.
This record consists of the header only. It has no type
specific information.
Multics Technical Bulletin MTB-560
DM: Before Journal Design
2.3 Summary of the Logical Records Formats
The formats of the various before journal record types is given
in terms of PL/1 declarations.
dcl 1 bj_rec_hdr aligned based,
2 type char (4),
2 tid bit (36),
2 process_id bit (36),
2 prev_rec_id bit (36),
2 prev_rec_byte_size fixed bin (24),
2 tx_rec_no fixed bin (35),
2 n_txn fixed bin;
dcl 1 bj_committed_rec aligned like bj_rec_hdr based;
dcl 1 bj_begin_commit_rec aligned like bj_rec_hdr based;
dcl 1 bj_aborted_rec aligned like bj_rec_hdr based;
dcl 1 bj_rolled_back_rec aligned based,
2 header like bj_rec_hdr,
2 checkpoint_no fixed bin (35),
2 last_rolled_back_rec_id bit (36);
dcl 1 bj_rollback_handler_rec aligned based,
2 header like bj_rec_hdr,
2 name_len fixed bin (24),
2 info_len fixed bin (24),
2 proc_name char (refer
(bj_rollback_handler_rec.name_len)),
2 info_bits bit (refer
(bj_rollback_handler_rec.info_len));
MTB-560 Multics Technical Bulletin
DM: Before Journal Design
dcl 1 bj_before_image aligned based,
2 header like bj_rec_hdr,
2 fm_uid bit (36),
2 fm_oid bit (36),
2 ci_no fixed bin (35),
2 n_parts fixed bin (17),
2 image_len fixed bin (24),
2 part dim (refer
(bj_before_image.n_parts)),
3 byte_offset fixed bin (24),
3 byte_length fixed bin (24),
2 image char (refer
(bj_before_image.image_len));
dcl 1 bj_fm_handler_rec aligned based,
2 header like bj_rec_hdr,
2 fm_uid bit (36),
2 fm_oid bit (36),
2 prev_fm_handler_rec_id bit (36),
2 info_len fixed bin,
2 info_bytes char (refer
(bj_fm_handler_rec.info_len));
Multics Technical Bulletin MTB-560
DM: Before Journal Design
3 BEFORE JOURNAL FILE CHARACTERISTICS
Logical records of the before journal are stored in a data
management file, accessed through the file manager. The contents
of the file is to be manipulated by the before journal manager
only. The type of file used for the before journal is referred
to as a "circular sequential" file.
Logical records are always appended to the end of the
journal. They are never inserted in the middle, modified, moved,
grown, shrunk or deleted. The basic retrieval modes are (a)
sequential in reverse order, used by recovery after crash to find
all unfinished transactions, and (b) random with a record
identifier, used by rollback.
When a transaction terminates, all before journal records
that it produced become useless since the transaction should no
longer be rolled back. As a result, after a certain time, the
first n records of the journal become useless.
When the file is full, the first n consecutive useless
records are "erased" from the beginning of the file and their
space is made available to continue to append new records. The
beginning of the file is redefined to be at the first useful
record. This is why the file used for before journal is said to
be a circular file.
Because of the special characteristics of Before Journals, a
new file format and new access methods, which are tailored for
circular sequential files have been defined, rather than using
those, more general, defined for the record and index managers
(MTB 552).
MTB-560 Multics Technical Bulletin
DM: Before Journal Design
4 BEFORE JOURNAL FILE FORMAT
A before journal is stored in a data management file, which
is made of Control Intervals (CI's). It consists of a header for
the journal, stored in CI zero, followed by a sequence of CI's
containing before journal logical records. These CI's start with
a CI header, which is very similar to the header used by the
record and index managers. In each CI a variable number of
variable length elements can be allocated, as for CI's used by
the record and index managers; but the structure of these
elements has been simplified for the before journal due to the
sequential nature of the journal.
4.1 The before journal header in CI zero
The before journal header is located in CI zero, after the
file manager standard ci_header. Its structure is identical to
the "bj_pste" stucture used by the before journal manager and
described in section 7.2.2 of this memo. Bj_pste stands for
"before journal per system table entry". The before journal
manager uses a per system table with one entry for each before
journal currently used in the system. The entry for a journal is
referred to as the bj_pste for that journal.
Some items found in the header are of permanent nature, such
as the journal identifier, the CI size, the maximum size of the
journal; they are used by the before journal manager to
initialize the "bj_pste" structure when the before journal is
made operational for journalization.
Some other items are very dynamic in nature and change
frequently while journalization is being performed. The values
of these items are always up to date in the "bj_pste" structure
and are updated in the header only periodically. They are used
after a system crash, when the "bj_pste" is no longer available,
as a starting point to find out where the end of the journal was
at the time of the crash.
Multics Technical Bulletin MTB-560
DM: Before Journal Design
4.2 The Control Interval Structure
Like all control intervals, control intervals of a before
journal start with a standard file manager 4-word header,
referred to as the "ci_header". It is followed by a structure
specific to before journal control intervals.
4.2.1 THE FILE MANAGER CI_HEADER STRUCTURE.
The ci_header structure is maintained by the file manager.
It is described in detail in MTB-554; it is given here for the
reader's convenience.
dcl 1 ci_header aligned based,
2 stamp,
3 version bit (9) unal,
3 bj_idx fixed bin (9) uns unal,
3 time_modified fixed bin (53) unal,
2 id,
3 uid bit (36),
3 size_code unal,
4 exponent fixed bin (6) uns,
4 addon fixed bin (3) uns,
3 num fixed bin (27) uns unal;
where:
"ci_header.stamp.version" is a version number for this
structure.
"ci_header.stamp.bj_idx" is always zero for unprotected
files; therefore it is zero for a before journal.
For protected files, it indicates the index of the
before journal used to modify the control interval.
This index is the index of the journal in the
per-system table (See section 7.2); it is also the
index of the entry, in the dm_journal hardcore table,
which contains the time stamp used by page control to
determine if the control interval should be held in
main memory or if it may be written to disk (See
section 8).
MTB-560 Multics Technical Bulletin
DM: Before Journal Design
"ci_header.time_modified" is clock value at the time the
file manager did the last modification to the control
interval. Since records are only appended to the
journal, control intervals in the journal have a
monotonically increasing time value. This property
is used by recovery after crash to determine where
the end of the journal was at the time of the crash.
"ci_header.id.uid" is the file unique identifier. The
file manager automatically stores it in all control
intervals of the file.
"ci_header.id.size_code" specifies the size of the control
interval in number of bytes, using "exponent" and
"addon", according to the following expression: Size
= (64 + 8 * addon) * 2 ** exponent.
"ci_header.id.num" is the control interval number of the
control interval. The ci_header contains the full
identifier of the control interval: file uid and ci
number.
4.2.2 THE BEFORE JOURNAL BJ_CI STRUCTURE.
dcl 1 bj_ci based aligned,
2 header1 like ci_header,
2 header2,
3 layout_type bit (36),
3 first_rec_id bit (36),
3 n_slots fixed bin (17) unal,
3 first_is_contn bit (1) unal,
3 last_is_contd bit (1) unal,
3 pad bit (16) unal,
3 n_bi fixed bin (35),
3 reserved bit (36) dim (4),
2 slot dim (1:1000),
3 offset fixed bin (18) uns unal,
3 length fixed bin (18) uns unal;
where:
Multics Technical Bulletin MTB-560
DM: Before Journal Design
"header1" is the file manager ci_header described above.
"header2" is the before journal manager header. Its items
are described below.
"layout_type" is a type indicating what kind of control
interval it is. The various types used so far are
before journal type and collection manager type. By
convention, this type is always found after the
ci_header.
"first_rec_id" is the record id of the logical record of
which the first element of this control interval is
the continuation. If the first element of the CI is
not the continuation of any record, this item is
irrelevant and its value is undefined.
"n_slots" is the number of elements stored in the control
interval. It is incremented by 1 only after the
element has been stored.
"first_is_contn" is a switch that, when ON, indicates that
the first element is the continuation of a record
that starts in a previous control interval. This
switch is allocated in the same word as the n_slots
item and is carefuly set in an atomic store operation
which brings into existence the first element, the
switch indicating it is the continuation of a record,
and the record id of the record in which this first
element belongs.
"last_is_contd" is a switch that, when ON, indicates that
the last element of the control interval is part of a
record which is continued in the next control
interval. Like the first_is_contn switch, it is
allocated in the same word as the n_slots item and it
is set in an atomic operation at the same time the
n_slots value is incremeneted.
"n_bi" is the number of before images that have been
stored in this control interval. It is used to keep
a count of how many pages may be held in main memory
until the control interval is known to be on disk.
"slot" is the element postion table, which describes the
position and length of all elements stored in the CI.
The protocol to add an element into a control
interval in an atomic manner is as follows: First,
write the contents of the element in the CI, then
store its offset and length in the next available
MTB-560 Multics Technical Bulletin
DM: Before Journal Design
slot entry; at this point the element still does not
exist. If the element is the first and is the
continuation of the previous element, set the value
of first_rec_id. Finally, prepare a word with the
incremented value of n_slots and the appropiate
values for the first_is_contn and last_is_contd
switches and store the word in its location in
header2.
"offset" is the byte offset of the beginning of the
element. An element may start at any byte, it does
not have to start at a word boundary. This offset is
relative to the "addressable portion" of the control
interval, that is, relative to the beginning of
header2.
"length" is the length of the element, in number of bytes.
4.3 Elements
The element_position_table describes all elements that are
allocated in the CI. An element in the file is identified by its
CI number (ci) and its index (ix) in the position table:
element_id = (ci, ix)
The first element in a control interval is the element with
ix equal to 1. The last element in a control interval is the
element with the highest ix.
The first element of the journal is the first element of the
first CI of the journal. The last element of the journal is last
element in the last modified CI of the journal.
Two elements are consecutive if (a) they are in the same CI
and their indices differ by one or (b) one is the last in a CI
and the other is the first in the next CI.
When a new element is allocated, it is always allocated in
such a way as to become the next element relative to the element
that was the last element.
Multics Technical Bulletin MTB-560
DM: Before Journal Design
An element is allocated in a CI in order to contain a
logical record or part of it. When a logical record needs to be
appended to the journal, a new element is allocated, in such a
way that it becomes the last element of the list and the next
element to the one which was last before the creation of the new
element. Since an element is, by definition, entirely in one CI,
it is possible that the size of the new allocated element is not
large enough to contain the entire logical record. In this case
the new element is used to store the first part of the logical
record, and one or several new elements are allocated to hold the
rest of the logical record. By convention, if more than one
element is needed to store a logical record, all these elements
are consecutive elements.
Like in control intervals used by the record and index
managers, whose storage is managed by the Collection Manager (MTB
552), elements come in fours flavors, depending on whether or not
they contain a full logical record and on what part of the
logical record they contain. An element may contain:
- The entire logical record or
- Only the first part of a logical record or
- Only the last part of a logical record or
- Only (one of) the middle part(s) of a logical record.
The simplicity of the operations performed on the journal
have led to the definition of an element structure simpler than
the standard structure used by the Record Manager. An element
contains no header, trailer or pointer to another element; it
only contains (all or some) bytes of a logical record.
4.4 Logical Record Identifiers
A logical record identifier is the element identifier of the
element in which it is stored. If the logical record is stored
in several elements, its identifier is the element in which its
first part is stored. The other parts are stored in consecutive
elements.
MTB-560 Multics Technical Bulletin
DM: Before Journal Design
5 LOGICAL RECORDS STORAGE AND RETRIEVAL OPERATIONS
The Before Journal Manager has a utility module which
provides a set of operations to store and retrieve logical
records in and from a Before Journal. This module does not
understand the contents of a logical record and views it merely
as a bit string.
This storage module and its primitives are described in
detail in MTB-567: "DM: Before journal storage operations". It
provides the following services:
1. Append a logical record at the end of a Before Journal and
return the record id assigned to the appended record.
2. Flush the journal to disk up to, and including, a given
logical record, and wait for I/O completion.
3. Get the logical record specified by its record identifier from
a Before Journal.
4. Get the logical record that precedes the logical record
specified by its record id from a Before Journal, with its
record identifier.
5. Get the last logical record that was appended to a Before
Journal, with its record identifier.
6. Recycle that portion of the journal which is no longer useful.
Multics Technical Bulletin MTB-560
DM: Before Journal Design
6 LOGICAL VS PHYSICAL JOURNALIZATION
When the Before Journal Manager is called to write some
information in the journal, say it is called to journalize a
protected file modification about to take place, it appends the
appropriate logical record to the journal and returns to its
caller with no guarantee that the appended record is effectively
on disk. Logical journalization is completed after the record is
appended to the journal and control is returned to the caller.
Physical journalization is completed when the logical record
physically appears on disk.
The Transaction Manager must be aware of this distinction in
order to correctly implement the commitment of a transaction.
The Before Journal Manager, in turn, must provide adequate
support for the Transaction Manager to deal with physical
journalization.
6.1 Logical Journalization
When the Before Journal Manager is called to store some
information in the Journal, it manufactures the logical record
and calls the storage utility primitive bj_storage_append, to
append it to the end of the journal. The append procedure
appends the record by storing it in a temporary buffer that looks
exactly like the control interval in which the record is to be
stored. When a new record needs to be appended, possibly by
another transaction, it is put in the temporary buffer, and so on
while the temporary buffer has enough room to accommodate the new
record. For all these records, logical journalization is
considered completed, even though they have not been stored in
the journal file yet.
When the temporary buffer is full, the append procedure
copies it in the journal file, in the appropriate CI, by calling
the file manager at its "put" entry point.
MTB-560 Multics Technical Bulletin
DM: Before Journal Design
6.2 Physical Journalization
The physical journalization of a logical record start when
the write to disk request has been queued; it is completed when
the disk I/O is completed. The Before Journal Manager provides
the "flush" primitive which allows the transaction manager to
request that physical journalization be performed for all logical
records appended to the journal by a given transaction:
call before_journal_manager_$flush_transaction (txn_id,...)
When issuing this call, the Transaction Manager requests that,
for all records appended to the journal by the current
transaction, physical journalization be started if necessary, and
completed. What happens when the flush procedure is executed is
described below.
The before_journal_manager_$flush_transaction procedure
determines the record id of the last logical record appended by
the transaction to the journal. It finds it in the "bj_txte" for
this transaction, which contains before journal information
related to that transaction (See section 7.3). It translates the
request into a call to its storage utility module, to cause
physical journalization of all records of the journal, up to and
including the last record appended by the transaction:
call bj_storage_flush (bj_oid, last_record_id,...)
The bj_storage_flush procedure keeps track of the last record for
which physical journalization is completed, the last record for
which physical journalization has been started, the last record
which was copied from the temporary buffer to the journal file
and the last record still in the temporary buffer. The
sequential nature of the journal makes this housekeeping easy.
By looking at the last_record_id passed by the caller, it can
determine which ones of the following steps have to be taken, if
any, in order to satisfy the caller's request:
- Copy the temporary buffer into the journal file, if necessary.
- Flush the necessary CI's.
Multics Technical Bulletin MTB-560
DM: Before Journal Design
In order to flush the journal, the bj_storage_flush procedure
calls upon a file manager primitive provided for that purpose,
which causes physical journalization of a specified number of
consecutive control intervals starting from a specified control
interval of a given file.
call file_manager$flush_consecutive_ci (pf_oid, first_ci, n_ci)
The file manager, in the first implementation on top of
multi-segment files, will call upon page control to really issue
the necessary I/O's and wait for them. When "non-segmented
files" are implemented, the file manager will probably be the
"non-segmented file" manager and will be able to issue I/O
requests and wait for them without going through page control.
6.3 File manager services
The before journal manager expects that the file manager
will provide the following services:
1. file_manager$flush_consecutive_ci (pf_oid, first_ci, n_ci)
This procedure flushed a number of consecutive CI's
specified by "n_ci", starting at the CI specified by "first_ci"
in the page file specified by its opening id "pf_oid", and waits
until all CI's specified by the caller are physically written to
disk.
2. file_manager$open_by_uid (file_uid, file_opening_id)
This function is needed by the rollback facility, when
executed in a process other that the process which started the
transaction. When a before image is recorded in the journal, the
file is identified by its unique id; storing the entire pathname
of the file in the before image would be too inefficient. So,
the file manager is responsible for keeping a uid-pathname table
with one entry for each file susceptible of being rolled back.
The open_by_uid procedure starts by searching the
uid_pathname table for the pathname associated with the file_uid
passed by the caller; then it uses the standard open mechanism.
MTB-560 Multics Technical Bulletin
DM: Before Journal Design
6.4 Page Control Services
In order to be able to rollback after a system crash with
ESD failure, it is necessary to enforce a protocol referred to as
the "Write Ahead Log" (WAL) protocol, which requires always
writing a before image to disk before the associated data base
modification is itself written to disk. In the current
implementation, all control intervals are pages written out by
page control. It is therefore necessary to have page control
cooperate with the data management system in order to enforce the
WAL protocol. This is described in detail in MTB-564: "Phasing
Page Control and Before Journal". A short description is also
given in section 8.
Multics Technical Bulletin MTB-560
DM: Before Journal Design
7 TABLES USED BY THE BEFORE JOURNAL MANAGER
During journalization, the before journal manager uses
several tables in addition to the journal itself. These tables
contain before journal information of 3 categories:
- Per process information
- Per system information
- Per transaction information
7.1 Per Process Information
The before journal manager uses a per process table, the
pj_ppt, which consists of a header followed by an array of
entries, one for each journal open in this process. This table
is created and initialized at dm_per_process initialization.
The bj_ppt table is read and modified by one process only;
no lock is necessary. Modifications to the table are made by
carefully written programs in such a way as to appear atomic,
even if a process fails in the middle of an update operation.
The PL/1 declarations for the bj_ppt and the bj_ppte are
given below, with a short description for each item in the
structures.
7.1.1 THE BJ_PPT STRUCTURE.
dcl 1 bj_ppt based aligned,
2 version fixed bin,
2 max_n_entries fixed bin,
2 n_entries_used fixed bin,
2 highest_ix_used fixed bin,
2 default_bj,
3 user_set_oid bit (36),
3 last_opened_oid bit (36),
2 process_id bit (36),
2 process_ix fixed bin,
2 mod_list_area (100) fixed bin (35),
2 e dim (dm_system_data_$
bj_max_n_journals refer
(bj_ppt.max_n_entries))
like bj_ppte;
MTB-560 Multics Technical Bulletin
DM: Before Journal Design
where:
"version" is the version of the bj_ppt structure.
"max_n_entries" is the maximum number of "e" entries in this
table. It is initialized with the value of
dm_system_data$bj_max_n_journals, at dm_per_process
initialization, when the bj_ppt table is created.
"n_entries_used" is the number of entries used, that is, the
number of journals opened by this process.
"highest_ix_used" is the highest index of all used entries.
"default_bj.user_set_oid" is the bj opening id of the journal
established as the default journal in this process by an
explicit call to the entry point set_default_bj in the
before journal manager.
"default_bj.last_opened_oid" is the opening id of the last
journal opened in this process, and still open.
"process_id" is the process id of the current process. It is
set at dm_per_process initialization, and is used in any
fast paths where it is desirable to save an external
call to get_processid_ ( ).
"bj_process_ix" is the index assigned to this process in a
system table of the before journal manager, the
bj_check_in_table. This table has 2 parts: The first
part consists of an array of process_ids, with one used
entry for every process registered in the before journal
manager; bj_process_ix is the index assigned to the
process in this array. The second part of the
bj_check_in_table consists of a 2 dimensional matrix
where each element (p, j) is a one bit element, equal to
1 if the journal with index j is opened by process with
index p (See section 7.2.4).
Multics Technical Bulletin MTB-560
DM: Before Journal Design
7.1.2 THE BJ_PPTE STRUCTURE.
dcl 1 bj_ppte based aligned,
2 version fixed bin,
2 bj_uid bit (36),
2 pf_oid bit (36),
2 n_opening fixed bin,
2 bj_pste_ptr ptr,
2 open_time fixed bin (71);
where:
"version" is the version number of this structure.
"bj_uid" is the unique identifier of the journal. By
convention, it is the same as the file uid.
"file_oid" is the opening id assigned by the file manager to
the data management file used to implement the journal.
This file_oid is how the file must be referred to by the
before journal manager when calling the file manager to
perform a get, put or flush operation on the file.
"n_opening" is the number of times this process requested
this before journal to be opened, by calling the "open"
entry point of the before journal manager. It is
incremented by 1 each time before_journal_manager_$open
is called, and decremented by 1 each time
before_journal_manager_$close is called. When it
reaches 0, the process is "checked out" of the per
system bj_check_in_table, that is, its process_id is
removed from the table of process_ids in the
bj_check_in_table, and the number of processes using
this journal is decremented in the bj_pste entry (See
section 7.2.2).
"bj_pste_ptr" is a pointer to the per system entry describing
this journal, in the per system table bj_pst.
"open_time" is the time this journal was open in this
process. It is used to determine what the default
before journal should be, when no default is in effect.
MTB-560 Multics Technical Bulletin
DM: Before Journal Design
7.2 Per System Information
The before journal manager uses a per system table which
consists of a header and an array of entries, one for each
journal opened by at least one process in the system. This table
is created and initialized at dm_per_system initialization. The
table is referred to as the bj_pst, and an entry in the table as
a bj_pste. Every bj_pste in use contains system shared
information describing a particular before journal currenly in
use by at least one process. If a given journal is open in
several processes, each process has its own bj_ppte, but all
these bj_ppte point to a unique bj_pste for this journal. Two
bj_pste's never describe the same journal.
The bj_pst and bj_pste's are shared by all processes
executing in the before journal manager. Concurrent access to
the same information is prevented by the use of exclusive locks.
These locks must be fast to acquire and to release and they must
not be held until the end of the transaction. The before journal
manager must be able to set them and release them as required by
its logic. Fast locks will be available for this purpose,
modeled after the locks provided by "set_lock" but implemented
with actual notification rather than with the calender clock.
Two classes of operations need synchronization: Operations
on the header of the table, to allocate or free an entry; and
operations on a specific entry, to perform journalization on a
given journal. A lock will be associated with the header, and a
lock will be associated with each entry.
Modifications to the header or to individual entries are
made by carefully written programs in such a way as to appear
atomic, even if a process fails in the middle of an update
operation.
The PL/1 declarations for the bj_pst and bj_pste structures
are given below, with a short description for each item in the
structure.
Multics Technical Bulletin MTB-560
DM: Before Journal Design
7.2.1 THE BJ_PST STRUCTURE.
dcl 1 bj_pst based aligned,
2 version fixed bin,
2 pad1 bit (36),
2 lock,
3 pid bit (36),
3 event bit (36),
2 time_of_bootload fixed bin (71),
2 max_n_entries fixed bin,
2 n_entries_used fixed bin,
2 highest_ix_used fixed bin,
2 pn_table_offset fixed bin (18) uns,
2 check_in_table_offset fixed bin (18) uns,
2 buffer_table_offset fixed bin (18) uns,
2 max_n_buffers fixed bin,
2 pad2 bit (36),
2 meters,
3 n_calls_begin_txn fixed bin (71),
3 n_calls_before_image fixed bin (71),
3 n_calls_abort fixed bin (71),
3 n_calls_commit fixed bin (71),
3 n_calls_rb_mark fixed bin (71),
3 n_calls_fm_pc_mark fixed bin (71),
3 n_calls_fm_rbh fixed bin (71),
3 n_calls_rollback fixed bin (71),
3 meter dim (9:50) fixed bin (71),
2 mod_list_area (100) fixed bin (35),
2 e dim (dm_system_data_$
bj_max_n_journals refer
(bj_pst.max_n_entries))
like bj_pste;
where:
"version" is the version number of the structure.
"lock" is a 2-word lock in the format assumed by the fast
lock facility of the lock manager. The first word
contains the process id of the process that has the
lock. The before journal manager uses this lock to
perform open and close operations. Any process doing an
open or a close of the before journal must acquire the
lock, and keeps it for the entire operation.
MTB-560 Multics Technical Bulletin
DM: Before Journal Design
"time_of_bootload" is the time the Multics system was last
initialized. It is set at dm_per_system initialization
using the value of system_control_$time_of_bootload.
"max_n_entries" is the maximum number of before journals that
can be "active" at the same time. This item is set at
dm_per_system_init time to the value found in
dm_system_data_$bj_max_n_journals.
"n_entries_used" is the number of entries in this table
actually used.
"highest_ix_used" is the highest index in this table of all
used entries.
"pn_table_offset" is the offset of the uid_pathname table
maintained by the before journal manager for all
journals that are active, i.e., that have an entry in
the bj_pst segment. This uid_pathname table is
allocated in the same segment as the bj_pst. It is used
by recovery after crash to determine what before
journals must be recovered.
"check_in_table_offset" is the offset of the
bj_check_in_table. This table is allocated in the same
segment as the bj_pst table; it is used to record what
process has what journal open.
"buffer_table_offset" is the offset to the bj_buffer_table.
This table is allocated in the same segment at the
bj_pst table. Each entry in this table is one page long
and starts at a page boundary. Entry i is associated
with the journal described in bj_pst.e(i), and is used
as the buffer for that journal.
"max_n_buffers" is no longer used.
"meters" are various meters maintained by before journal
manager procedures.
"mod_list_area" is not used currently.
"e" is an array of entries like bj_pste's. Each entry, when
used, describes an active before journal. The number of
entries of this array is specified by
dm_system_data_$bj_max_n_journals.
Multics Technical Bulletin MTB-560
DM: Before Journal Design
7.2.2 THE BJ_PSTE STRUCTURE.
dcl 1 bj_pste based aligned,
2 version fixed bin,
2 bj_ix fixed bin,
2 lock aligned,
3 pid bit (36),
3 event bit (36),
2 bj_uid bit (36),
2 ci_size fixed bin,
2 max_size fixed bin,
2 active bit (1) aligned,
2 time_header_updated fixed bin (71),
2 earliest_meaningful_time fixed bin (71),
2 update_frequency fixed bin,
2 last_rec_id bit (36),
2 n_processes fixed bin,
2 n_txn fixed bin,
2 last_ci_info aligned,
3 last_ci_buffered fixed bin (24) uns,
3 last_ci_put fixed bin (24) uns,
3 last_ci_flushed fixed bin (24) uns,
3 last_ci_on_disk fixed bin (24) uns,
3 stamp_for_last_ci_put fixed bin (71),
3 stamp_for_last_ci_on_disk fixed bin (71),
2 n_bi_still_unsafe fixed bin,
2 n_bi_being_saved fixed bin,
2 buffer_offset fixed bin (18) uns,
2 pad1 fixed bin,
2 cl aligned,
3 origin_ci fixed bin (24) uns,
3 lowest_ci fixed bin (24) uns,
3 highest_ci fixed bin (24) uns,
3 number_ci fixed bin (24) uns,
2 append_state aligned,
3 current_operation char (4),
3 pending_n_txn fixed bin,
3 pending_last_rec_id bit (36),
3 pending_last_element_id bit (36),
3 txte_rec_id_relp bit (18),
MTB-560 Multics Technical Bulletin
DM: Before Journal Design
2 pad_to_even_word1 bit (36) aligned,
2 meters aligned,
3 n_bi_written fixed bin (71),
3 n_bi_bytes_written fixed bin (71),
3 n_journal_full fixed bin (71),
3 n_successful_recycles fixed bin (71),
3 n_ci_recycled fixed bin (71),
3 n_txn_started fixed bin (71),
3 n_non_null_txn fixed bin (71),
3 meter (8:10) fixed bin (71),
2 pad_to_64_words (6) bit (36);
where:
"version" is the version number of the bj_pste structure. By
convention, if version = 0, the entry is not used.
While an entry is being initialized to describe a
journal being activated, the value of version is 0. It
is set to the non null value only as the last step of
the journal activation.
"bj_ix" is the index of this entry in the array of bj_pste's.
It is used for consistency checking and also to get the
index of the entry when one has a pointer to the
bj_pste.
"lock" is a 2-word lock in the format assumed by the fast
lock facility of the lock manager. It is used always as
an exclusive lock. All storage operations, i.e.
append, get and flush must acquire this lock. A process
can be working on only 1 bj_pste at a time, so there is
no possibility of deadlock due to locks on 2 bj_pste's.
The bj_pst.lock is set by a process that does an open or
close operation; this does not prevent another process
from acquiring the lock on a particular bj_pste for the
purpose of doing an append, get or flush operation on a
journal already active and not involved in an open or
close operation.
"bj_uid" is the unique id of the before journal described by
this entry. It is copied from the before journal header
kept in CI, when the journal is activated.
"ci_size" is the control interval size for this journal,
expressed in number of bytes.
Multics Technical Bulletin MTB-560
DM: Before Journal Design
"max_size" is the maximum size of the journal file, expressed
in the number of control intervals.
"active" is a switch showing that the journal is active, i.e.
that it is being used by at least 1 process. This
switch is always ON in the bj_pste. The before journal
header, bj_header in CI zero, has the same structure as
the bj_pste. When this switch is ON in bj_header, it
indicates that the journal is currently active, or that
it was active in a previous invocation of the data
management system, while the system crashed, and may
contain information about unfinished transactions.
"time_header_updated" is the calendar clock time at which the
before journal header was last updated. The header is
updated periodically to help finding the end of the
journal after a system crash.
"earliest_meaningful_time" is a calendar clock time also kept
in the header to help finding the end of the journal
after a system crash. Any CI whose time modified is
smaller than this "earliest_meaningful_time" contains no
useful information. This time is set to the current
time when the journal is activated.
"update_frequency" is a number indicating how often the
header is to be updated. Updating the header consists
of writing the bj_pste information into the bj_header
structure in CI zero. An update frequency of N means
that the header should be updated after N control
intervals have been written.
"last_rec_id" is the record identifer of the last record
successfully appended to the journal.
"n_processes" is the number of processes that have opened
this journal. The journal cannot be "deactivated" while
n_processes is greater than zero. The list of processes
using the journal is recorded in the bj_check_in_table.
Each process that has the journal opened has its
process_id registered in the check_in_table. Dead
processes have their process_id removed from the
check_in_table by a "garbage collector" which
automatically adjusts bj_pste.n_processes to the number
of alive processes.
"n_txn" is the number of unfinished transactions that have
stored at least 1 record in the journal. The journal
cannot be deactivated while this item is greater than
zero, even if n_processes = 0. The reason is that all
MTB-560 Multics Technical Bulletin
DM: Before Journal Design
processes that were using the journal may have died, but
some unfinished transactions are still to be rolled back
by the Data Management Daemon, using this journal.
"last_ci_buffered" is the CI number currently in the buffer.
When a CI is in the buffer, its info is not in the file.
When the buffer is full, it is written into the file, by
a "put" request to the file manager. When the put is
completed, the buffer is initialized as the next CI of
the journal, showing no element in it.
"last_ci_put" is the CI number of the last CI copied from the
buffer to the file. It is updated only after the buffer
is completely "put" in the file. For a short time, the
last_ci_put is equal to the last_ci_buffered. This is a
legitimate situation, indicating that the buffer must be
initialized with the next CI.
"last_ci_flushed" is the CI number of the last CI for which a
flush was requested. The I/O may still be in progress.
"last_ci_on_disk" is the CI number of the last CI on disk and
such that all previous CI's are also on disk.
"stamp_for_last_ci_put" is the time stamp associated with the
last CI put. It is needed by the flush function.
"stamp_for_last_ci_on_disk" is the time stamp associated with
the last CI on disk. This stamp is the stamp that is
stored in the ring zero dm_journal_seg, in the entry
associated with this journal (See section 8). It is
used by the flush function.
"n_bi_still_unsafe" represents the number of before images
that have not yet been secured to disk. This number
indicates how many data base CI's may be pinned in main
memory. (It actually is an upper bound.) It is used by
the "append" procedure to decide if the journal should
be flushed because too many data base CI's are held in
main memory, in order to allow them to go to disk.
"n_bi_being_saved" indicates how many before images are
actually being written to disk. It is used by the flush
function.
"buffer_offset" is the offset of the beginning of the buffer
associated with this journal. The size of the buffer is
the same as the size of the CI for this journal. In the
current design, all buffers are one page long and start
Multics Technical Bulletin MTB-560
DM: Before Journal Design
at a page boundary. They are all allocated in the same
segment as the bj_pst.
"cl.origin_ci" is the CI number of the CI which is currently
the origin of the cirular list. Each time the end of
the journal is reached, the origin is "advanced" as much
as possible, over all CI's that no longer contain useful
information.
"cl.lowest_ci" is the lowest CI number in the circular list.
In the current design it is always 1, and does not
change when the origin is advanced.
"cl.highest_ci" is the highest CI number in the circular
list. In the current design, it is aways equal to the
last CI number of the file, and does not change when the
origin is advanced.
"cl.number_ci" is the number of CI's in the circular list.
In the current design, it is always equal to the maximum
size of the file minus 1 (control interval zero is not
part of the circular list), and does not change when the
origin is advanced.
"append_state" contains information about the append
operation, while a record is being appended to the
journal. The information is used to make the append
operation "atomic". Regardless of where a process may
stop (or die) while in the before journal manager, if a
record has been completely stored in the journal, all
before journal manager control structures (bj_pste and
bj_txte) are automatically updated to reflect the
existence of this record.
"current_operation" is a character string which is null while
no record is being appended. It is equal to "appe"
while a record is being appended. It is restored to a
null character string after the record is completely
stored in the journal, and the various structures
(bj_pste and bj_txte) have been updated accordingly.
"pending_n_txn" is the value that bj_pste.n_txn should have
after the record is successfully appended. This allows
for potentially redoing any number of times the
adjustment of an interrupted append operation, even if
the process crashes during the adjustment.
"pending_last_rec_id" is the value that bj_pste.last_rec_id
should have after the record is successfully appended.
MTB-560 Multics Technical Bulletin
DM: Before Journal Design
"pending_last_element_id" is the element id of the element
about to be stored in the journal. Storing an element
is atomic: storing the last element of a record makes
the entire record appear in the journal in an atomic
manner. The protocol is as follows: All storage
operations are done under the bj_pste lock. If an
append operation is interrupted in the middle, the
process executes a cleanup handler which finds the
pj_pste lock set to the process; if it finds that an
append operation was in progress, it finds out if the
last element of the record was stored. If it was
stored, the relevant items in the bj_pste are updated
and the record id of the new record is stored in the
bj_txte, causing a more complete update of the bj_txte
items later. Then the bj_pste lock is relased. If the
cleanup handler is not given a chance to be executed,
the process must execute a crawl out procedure to exit
the data management inner ring. This crawl out
procedure executes what the cleanup handler would have
executed. In the worst case where the crawl out
procedure is not given a chance to run, the process must
soon be terminated. At a later time, another process
will try to lock the bj_pste and will find that it is
locked by a dead process; then it executes what the
cleanup handler would have executed.
"txte_rec_id_relp" determines the location, in the bj_txte,
which must be filled with the record id of the new
appended record, as explained above. It is a relative
pointer in the segment containing the bj_txte.
"meters" are various meters.
Multics Technical Bulletin MTB-560
DM: Before Journal Design
7.2.3 THE BJ_PN_TABLE STRUCTURE
The bj_pn_table is a table maintained by the before journal
manager to associate the pathname of a journal with its unique
id. For each journal which has a bj_pste, that is, which is open
in at least one process, there is an entry in the bj_pn_table,
with the pathname and the uid of the journal. For a given
journal, the index in the array of bj_pste's and the index in the
bj_pn_table array are the same.
The bj_pn_table is necessary to perform the recovery after
crash. It is the starting point, showing what before journals
were open at the time of the crash. Therefore, this table must
be modified using careful coding, so that it is never
inconsistent; in addition, the table must be flushed to disk
after every modification. It is the only table, with the
uid-pathname table maintained by the file manager, which is
needed by recovery after crash.
The bj_pn_table has the following structure:
dcl 1 bj_pn_table based aligned,
2 max_n_entries fixed bin,
2 bj_path_to_uid_relation dim
(dm_system_data_$bj_max_n_journals
refer (bj_pn_table.max_n_entries)),
3 dir char (168),
3 entry char (32),
3 bj_uid bit (36);
where:
"max_n_entries" is the maximum number of entries in the
table. It is needed by recovery after crash. It is set
at dm_per_system initialization. Its value is copied
from dm_system_data_$bj_max_n_journals, and is equal to
the maximum number of bj_pste entries.
"bj_path_to_uid_relation" is an array with one entry for each
journal. Entry i contains the uid and pathname of the
journal described by entry i of the array of bj_pste's.
"dir" is the directory name of the journal.
"entry" is the entry name of the journal.
"bj_uid" is the unique id of the journal.
MTB-560 Multics Technical Bulletin
DM: Before Journal Design
7.2.4 THE BJ_CHECK_IN_TABLE STRUCTURE
The before journal manager maintains a table showing, for
each journal that has a bj_pste entry, what are the processes in
which the journal is open. It is allocated in the same segment
as the bj_pst. It is not needed for recovery after crash and it
does not have to be flushed to disk after every modification. It
is initialized at dm_per_system initialization, with the rest of
the bj_pst segment.
The bj_check_in_table has the following structure:
dcl 1 bj_check_in_table based aligned,
2 max_n_processes fixed bin,
2 max_n_journals fixed bin,
2 process_id dim
(dm_system_data_$ bj_max_n_processes refer
(bj_check_in_table.max_n_processes))
bit (36),
2 cross_proc_bj dim
(dm_system_data_$bj_max_n_processes refer
(bj_check_in_table.max_n_processes),
dm_system_data_$bj_max_n_journals refer
(bj_check_in_table.max_n_journals))
bit (1) unaligned;
where:
"max_n_processes" is the maximum number of processes allowed
to do data management at the same time. This item is
initialized with the value found in
dm_data_$bj_max_n_processes.
"max_n_journals" is the maximum number of before journals
that can be described in the before journal tables, that
is the maximum number of bj_pste entries. This item is
initialized with the value found in
dm_data_$bj_max_n_journals.
"process_id" is an array of process ids. The number of
entries in this array is determined by
dm_data_$bj_max_n_processes. Each entry is either zero
or contains the value of a process id. The first time a
Multics Technical Bulletin MTB-560
DM: Before Journal Design
process uses the before journal manager, its process_id
is entered in this array. The index of the entry
assigned to this process is used to identify the process
in the cross_proc_bj table (See below).
"cross_proc_bj" is a 2 dimentional matrix. Each element
[p,j] has a one bit value; it is equal to "1"b if
process with index p in the table of process ids has
open the journal with index j in the array of bj_pste's.
One might think that a count of processes associated
with each journal would have been sufficent. In theory,
this is true; but in practice, it is not possible to
know, from a count, which processes are still alive and
which processes are dead. This is the reason why,
instead of a count, the list of process ids is kept for
each journal.
MTB-560 Multics Technical Bulletin
DM: Before Journal Design
7.3 Per Transaction Information
For each transaction in progress, the before journal manager
maintains some information relevant to the transaction and the
Before Journal associated with it. This information is kept in a
system shared table which consists of a header, referred to as
the bj_txt, followed by an array of entries, referred to as
bj_txte's, with one entry for each transaction. The array of
bj_txte's is an array parallel to the array used by the
transaction manager to describe transactions. In fact, entry
number (i) in the array of bj_txte's must be regarded as the
extention of entry number (i) in the array of tm_tdte's in the
transaction definition table used by the transaction manager.
The bj_txt is created and initialized at dm_per_system
initialization.
Modifications to the structure are made by carefully written
programs, in such a way as to appear atomic, even if a process
fails in the middle of an update operation.
The PL/1 declarartions for the bj_txt and bj_txte are given
below, with a short description of each item in these structures.
7.3.1 THE BJ_TXT STRUCTURE.
dcl 1 bj_txt based aligned,
2 version fixed bin,
2 max_n_entries fixed bin,
2 n_entries_used fixed bin,
2 pad_header_to_32_words bit (36) dim (29),
2 entry dim (dm_system_data_$
max_n_transactions refer
(bj_txt.max_n_entries))
like bj_txte;
where:
"version" is a version number associated with this stucture.
"max_n_entries" is the maximum number of entries in this
table. It is initialized to the value of
dm_system_data$max_n_transactions, when the table is
created, at Data Management system initialization.
Multics Technical Bulletin MTB-560
DM: Before Journal Design
"n_entries_used" is the number of entries currently used.
"entry" is an array each element of which has the structure
of bj_txte described below.
7.3.2 THE BJ_TXTE STRUCTURE.
dcl 1 bj_txte based aligned,
2 tid bit (36),
2 bj_uid bit (36),
2 entry_state aligned,
3 last_completed_operation char (4),
3 ok_to_write bit (1),
2 owner_info aligned,
3 process_id bit (36),
2 operator_info aligned,
3 process_id bit (36),
3 ppte_ptr ptr,
3 bj_oid bit (36),
2 records_info aligned,
3 curr_checkpoint_rec_id bit (36),
3 first_bj_rec_id bit (36),
3 last_bj_rec_id bit (36),
3 n_rec_written fixed bin (35),
3 n_bytes_written fixed bin (35),
3 last_fm_postcommit_handler_rec_id bit (36),
2 append_state aligned,
3 current_operation char (4),
3 pending_bj_rec_id bit (36),
3 pending_n_rec_written fixed bin (35),
3 pending_n_bytes_written fixed bin (35),
2 pad_entry_to_32_words bit (36) dim (13);
where:
"tid" is the transaction id of the transaction that has been
assigned this entry. It is initialized at the time a
transaction begins. The entry number as well as the
transaction id are assigned by the transaction manager
and passed as input argument to the before journal
manager.
MTB-560 Multics Technical Bulletin
DM: Before Journal Design
"bj_uid" is the uid of the before journal used by this
transaction.
"last_completed_operation" is the last operation the before
journal manager performed for this transaction. It is
used for consistency checking.
"ok_to_write" is a switch which is set ot 1 at the beginning
of the transaction and turned off after a committed or
aborted mark has been written.
"owner_info.process_id" is the process id of the process
which started the transaction. It is kept in the
bj_txte even though the process may be dead.
"operator_info.process_id" is the process id of the process
operating on this transaction. In general this process
id is the same as the owner's process id. However, if
the process that started the transaction dies or
abandons the transaction while the transaction is still
in progress, it is the Data Management Daemon's
responsibility to abort that transaction. When
performing its duty on the transaction, the Daemon is
said to be the "operator" of the transaction, and before
being able to do any before journal operation, the
bj_txte.operator_info items have to be initialized with
information relevant to the Daemon process.
"operator_info.ppte_ptr" is a pointer to the bj_ppte
structure associated with the before journal used by
this transaction, in the process which is currently
operating the transaction.
"record_info.current_checkpoint_rec_id" is the record id of
the current checkpoint record. At the end of a
checkpoint, a checkpoint record is written in the before
journal.
"records_info.first_bj_rec_id" is the record id of the
earliest record written by this transaction and which is
still of interest to the transaction. The before
journal manager uses the following optimization: when a
transaction begins, the bj_txte is initialized, but no
"begin" record is actually written in the journal. The
first_bj_rec_id is null. If the transaction commits or
aborts without having written in the journal, no
commit/abort mark is actually written in the journal.
The first time a transaction writes a record (other than
commit or abort), first_bj_rec_id is set to be the
Multics Technical Bulletin MTB-560
DM: Before Journal Design
record id of the record. When a transaction is rolled
back, a "rolled_back" record is written in the journal,
and its record id becomes the first_bj_rec_id, unless
the transaction was rolled back to a checkpoint, in
which case first_bj_rec_id retains its value.
"records_info.last_bj_rec_id" is the record id of the last
record written in the journal by this transaction. If
no record was written yet, this item is null.
"records_info.n_rec_written" is the number of records written
by this transaction. It is used for consistency and
metering purposes.
"records_info.n_bytes_written" is the number of bytes this
transaction has written in the before journal. It is
used for consistency and metering purposes.
"records_info.last_fm_postcommit_handler_rec_id" is the
record id of the last file manager post commit handler
record written by this transaction. For economy of
mechanism, the information needed at post commit time is
stored in the before journal as post_commit_handler
records. The before journal manager threads them
together and keeps the end of the thread in this item of
the bj_txte.
"append_state" contains some information to implement the
atomicity of the append operation, with respect to the
bj_txte items. When an append operation is interrupted,
either the record is in the journal or it is not. The
storage manager part of the before journal manager is
capable of finding out; if the record was written, it
updates the bj_pste items accordingly. The
bj_txte.items relevant to the appended record are also
updated using the same method as for the bj_pste.
"append_state.current_operation" is a character string which
is null when no append is in progress. It is non-null
when an append is in progress.
"append_state.pending_bj_rec_id" is the record id of the
record which has just been appended. It appears here
just after the storage manager has successfully appended
the last element of a new record, indicating to the
bj_txte manager that the record is in the journal and
that the bj_txte items relevant to this record must be
updated. If the append operation is interrupted in the
little window after the record is appended and before
its record id appears in this item of the bj_txte, the
MTB-560 Multics Technical Bulletin
DM: Before Journal Design
mechanism described for bj_pste.append_state causes the
bj_txte to be adjusted.
"append_state.pending_n_rec_written" is the value that
bj_txte.records_info.n_rec_written" should have after
the record is successfully appended. This item is
initialized before calling the storage manager to append
the record. If the append operation is interrupted
after the record appears in the journal, this item is
copied into bj_txte.records_info.n_rec_written by the
bj_txte manager adjustment mechanism.
"append_state.pending_n_bytes_written" is the value that
bj_txte.records_info.n_bytes_written should have after
the record is successfully appended. Used as the
previous item.
Multics Technical Bulletin MTB-560
DM: Before Journal Design
8 TABLE USED BY THE HARDCORE
The dm_journal table is a ring 0 system table maintained by
the hardcore support programs for data management. Its purpose
is to provide page control with some of the data needed to
implement the "Write Ahead Log" (WAL) protocol.
It consists of three parts:
o A header with various constants and counters for metering.
o An array with one entry per before journal; the number of
entries in this array is equal to the maximum number of before
journals that can be open at the same time in the system.
o An array with one entry for each page that can be held in main
memory to honor the "WAL" protocol. The number of entries in
this array is equal to the maximum number of pages that page
control is willing to hold at the same time. It is the
responsibility of the before journal manager to regulate the
number of these temporarily wired pages by flushing before
journals as required.
The dm_journal table is in a hardcore segment whose name is
dm_journal_seg and whose ring brackets are 0, 2, 2. The
declaration for the entire table is given below, with a
description of only those items that are relevant to the before
journal manager.
dcl 1 dm_journal aligned based,
2 lock bit (36) aligned,
2 wait_event bit (36) aligned,
2 notify_sw bit (1) aligned,
2 n_journals fixed bin,
2 n_journals_inuse fixed bin,
2 max_held_pages_mem fixed bin,
2 n_held_pages_mem fixed bin,
2 max_held_per_journal fixed bin,
2 per_aste_pool (0:3) aligned,
3 threshold fixed bin,
3 n_active fixed bin,
2 free_list_relp bit (18) aligned,
2 synch_write_calls fixed bin (35),
2 synch_write_holds fixed bin (35),
2 synch_write_invalid fixed bin (35),
MTB-560 Multics Technical Bulletin
DM: Before Journal Design
2 synch_write_tosses fixed bin (35),
2 unlink_calls fixed bin (35),
2 unlink_steps fixed bin (35),
2 activate_calls fixed bin (35),
2 deactivate_calls fixed bin (35),
2 activate_denied fixed bin (35),
2 set_stamp_calls fixed bin (35),
2 allocate_calls fixed bin (35),
2 free_calls fixed bin (35),
2 per_journal (n_dm_journals refer
(dm_journal.n_journals))
aligned
like dm_per_journal,
2 page_entry (max_dm_pages refer
(dm_journal.max_held_pages_mem))
aligned
like dm_page_entry;
dcl 1 dm_per_journal aligned based,
2 time_stamp fixed bin (71),
2 n_held fixed bin,
2 uid bit (36) aligned,
2 access_class bit (72) aligned,
2 entry_relp bit (18) aligned,
2 pad bit (36) aligned;
dcl 1 dm_page_entry aligned based,
2 fp bit (18) unal,
2 bp bit (18) unal,
2 cme_relp bit (18) unal,
2 journal_relp bit (18) unal;
where:
"dm_journal.n_journals" is the maximum number of before
journals that may be "active", i.e. that may have a
bj_pste entry, at the same time in the system.
"dm_journal.n_journal_inuse" is the actual number of before
journals currently active in the system.
Multics Technical Bulletin MTB-560
DM: Before Journal Design
"dm_journal.max_held_pages_mem" is the maximum number of held
pages that page control is willing to honor. It is a
constant defined at bootload time.
"dm_journal.n_held_pages_mem" is the number of pages
currently in main memory that page control would
normally have written out but actually did not, in order
to honor the "WAL" protocol.
"dm_journal.max_held_per_journal" is the result of the
division of dm_journal.max_held_pages_mem by
dm_journal.n_journals_inuse. The hardcore maintains
this item and updates it each time the number of
journals in use changes.
The before journal manager can read the dm_journal_seg_
from ring 2. It uses this item to make sure that the
total number of held pages never exceeds the threshold
allowed by page control. For each journal, it keeps a
count of the number of pages currently held by the
journal; when this count becomes greater than
max_held_per_journal, it flushes the journal.
"dm_journal.per_journal" is an array with one entry per
journal. By convention, for a given journal, the index
in this array is the same as the index of the bj_pste
entry. Each entry has a structure defined by the
"dm_per_journal" declaration.
"dm_per_journal.time_stamp" is a clock value indicating that
all before images produced in this journal up to this
time are known to be on disk; any before image produced
after this time is not on disk yet. Each time the
before journal manager flushes a journal, it knows the
value of this time stamp; it calls upon the hardcore to
enter the time stamp in the entry with the same index as
the bj_pste of the journal. When a file's control
interval is modified, a before image is first taken, say
at time T. This time T and the index of the journal are
then stored in the header of the CI to be modified.
Before writing out a page, page control compares the
time stored in the CI and the time stored in the
corresponding dm_per_journal entry. If the time in the
CI is greater, it means that the journal has not been
flushed far enough yet, and the page is to be held.
Otherwise, the page can be written out.
"dm_per_journal.n_held" is the number of pages currently in
main memory that page control would normally have
MTB-560 Multics Technical Bulletin
DM: Before Journal Design
written out but actually did not because of the time
stamp in this entry.
"dm_per_journal.uid" is the unique id of the before journal
described by this entry. By convention, if it is zero,
this entry is free. When the before journal manager
"activates" a new journal, it calls ring zero, with the
uid of the journal, to allocate an entry. Ring zero
finds a free entry in the dm_journal, initializes it
with the uid of the journal, and returns the index to
the before journal manager, which uses it as a bj_pste
index.
"dm_per_journal.access_class" is the AIM access class of the
process which activated the journal. In the current
implementation, there is only one dm_journal table for
the entire Multics system. It is not possible to use 2
data bases of different AIM classes in the same process.
It is possible to have data bases of different classes
in the system; however, this requires having one data
management system for each access class. Each data
management system would have its own set of bj_pst and
bj_txt tables. Since the dm_journal table is unique for
all AIM classes, it is necessary for ring 2 to ask ring
0 to allocate an entry and use the index assigned by
ring 0.
"dm_per_journal.entry_relp" is a relative pointer to the list
of pages held for this journal.
"page_entry" is an array with one entry for each page that
can be held. It is not relevant to the before journal
manager design.