Multics MTB-660


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

To:       Distribution

From:     Gregory A.  Baryza

Date:     20 August 1984

Subject:  New Pattern-Matching Routines Via Regular Expressions

1.  Abstract

This  document  describes  a  set  of  subroutine  interfaces and
commands which perform textual  pattern matching.  The facilities
described  here  include  those  supplied  by  the "search_file_"
entries used by qedx, emacs,  mail, and others.  In addition, the
syntax proposed  is such that upwardly  compatible extensions may
be made in the future without jeopardizing existing applications.

Comments on this MTB should be sent to the author -

     via Multics mail to:

        Baryza.Multics

     via posted mail to:

        Gregory A.  Baryza
        Honeywell Information Systems, Inc.
        Four Cambridge Center
        Cambridge, Massachusetts, U.S.A.   02142

     via telephone to:

        (HVN)-492-9315,
        (617)-492-9315

     via Multics forum on System-M:

        >udd>Multics>Baryza>Online_Meetings>Regular_Expressions
        >udd>m>Baryza>mtgs>rx

________________________________________

Multics project  internal documentation; not to  be reproduced or
distributed outside the Multics project.


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

                        TABLE OF CONTENTS

Section    Page  Subject
=======    ====  =======

1             i  Abstract
2             1  Introduction
3             2  Definition of Regular Expressions
4             5  Commentary
4.1           5  . . The Null String
4.2           5  . . Bounded Iteration
4.3           6  . . Operator Precedence & Associativity
4.4           6  . . Strong vs Weak Alternatives
4.4.1         7  . . . . Strong Alternatives
4.4.2         7  . . . . Weak Alternatives
4.5           8  . . Recursively Defined Expressions
4.6          10  . . References
5            11  Input Conventions
6            13  The Pattern-Matching Process
6.1          13  . . Macro-components of Regular Expressions
6.2          14  . . A Model of the Matching Process
6.3          19  . . Labelling Substrings
7            21  Some Examples
7.1          21  . . Alternative Matches
7.2          24  . . Recursive Patterns
8            30  String versus Line Mode
9            32  Subroutines
9            33  . . rx_$get_escape_char
9            34  . . rx_$set_escape_char
9            36  . . rx_$translate
9            38  . . rx_$translate_old
9            40  . . rx_$execute
9            42  . . rx_$release
9            43  . . rx_$search
9            45  . . rx_$substitute
9            48  . . rx_$get_newline_count
9            49  . . rx_$sub_error_handler
10           51  Commands and Active Functions
10           52  . . rx_escape_char
10           53  . . rx
10           54  . . . . change
10           56  . . . . change_all
10           57  . . . . change_some
10           57  . . . . count
10           58  . . . . exclude
10           59  . . . . exclude_index
10           59  . . . . include
10           60  . . . . include_index
10           60  . . . . match_count
10           61  . . . . test


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

11           62  Include Files
11.1         62  . . rx_execute_ctl.incl.pl1
11.2         64  . . rx_match_info.incl.pl1
11.3         67  . . rx_error_info.incl.pl1
12           68  Comparative Timings
12.1         68  . . The Test Environment
12.1.1       69  . . . . Test 1
12.1.2       69  . . . . Test 2
12.1.3       69  . . . . Test 3
12.1.4       70  . . . . Test 4
12.1.5       70  . . . . Test 5
12.2         71  . . General Commentary on the Tests
12.3         72  . . Comparison Against A Compiled Program
13           73  Miscellaneous Topics
13.1         73  . . Use of National Characters
13.2         73  . . Installing it in the System
13.3         74  . . Starname Processing
13.4         74  . . Efficiency Considerations
13.4.1       74  . . . . Translator Optimizations
13.4.2       76  . . . . Sequences That Work Against Efficiency
13.5         77  . . Implementation Restrictions


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

2.  Introduction

String  pattern-matching in  PL/I and  other high-level languages
can be a chore at best.  Although  there is a great deal of power
available to handle the characters themselves, the implementation
of a  matching strategy given an  arbitrary pattern specification
is not straightforward.  Nonetheless, the efficient interrogation
of text  data is a  necessity in many  data-base applications, in
word-processing, and in command parsing.

This  document  is an(other)  attempt  to alleviate  some  of the
character  handling   burden  by  providing   a  pattern-matching
facility  employing  regular   expressions.   Unlike  the  common
Multics  line  editors,  however,  all  characters  appearing  in
regular expressions  are treated literally  except those preceded
by a  user-specified escape character.  It  is conceded that this
does   not  provide   a  smooth   transition  from   the  present
"search_file_" interface.   For that reason, a  special entry has
been included to provide for the same kind of pattern definitions
as are presently available.

These routines are intended to  provide a reasonably complete set
of   pattern  matching   capabilities  upon   which  other,  more
specialized or more specific, routines may be built.


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

3.  Definition of Regular Expressions

A  regular expression(1)  is a pattern  which specifies  a set of |
character strings; it is said  to match certain strings.  Regular
expressions  are used  in searching  for text  strings satisfying
some  condition.  Regular  expressions are  defined rigorously as
follows.

    a)  An  ordinary  character  is  a  regular  expression which
        matches that character.

    b)  "^"  is a  regular expression which  matches the null(2)
        character before the first character of a string.

    c)  "$"  is  a  regular  expression which  matches  the null
        character after the last character of a string.

    d)  "."  is a regular  expression which matches any non-null
        character.

    e)  "[<string>]" is a regular  expression which matches any
        one of the characters in  the <string> and no others.  It
        is illegal to include  other regular expression sequences
        (like  "."   for  example)   within  the  characters  of
        <string>.

    f)  "[^<string>]"  is a  regular expression  which matches
        any one character but the characters of <string>.

________________________________________

(1) Strictly  speaking, the  definition describes  something more
    powerful  than a  recognizer for  a regular  grammar can cope
    with.  The use  of the term "regular expression"  here is due
    to historical antecedents.

(2) The  "null"  character referred  to  in this  context  is the
    character of length zero and  not the ASCII character at 0/0.
    In this document, the ASCII  character will be shown as <NUL>
    and  the   word  "null"  will  refer   to  empty  strings  or
    characters.


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

    g)  A  regular  expression  followed  by  "*"  is  a regular
        expression  which matches  the largest  number (including
        zero) of adjacent occurrences of  the text matched by the
        regular expression.

    h)  A regular  expression followed by "<m:n>"  is a regular |
        expression which  matches at least  m but no  more than n |
        adjacent occurrences  of the text matched  by the regular |
        expression.   This  regular  expression  matches  as many |
        adjacent  occurrences as  it can.  The  semantics of this |
        bounded count matching is discussed below.                |

    i)  Two   adjacent   regular  expressions   form   a  regular
        expression which matches adjacent  occurrences of the the
        text matched by each of the regular expressions.

    j)  Two regular expressions separated  by "|" or "||" form
        a  regular expression  which matches the  text matched by
        either  of  the  regular expressions.   The  semantics of
        alternatives is discussed below.

    k)  A  regular  expression enclosed  by  "(" and  ")"  is a
        regular  expression which  matches the  same text  as the
        original  regular expression.   Meta-parentheses are used
        to  alter the  order of  evaluation implied  by "g", "h", |
        "i", and "j"; for  example, "a(b|c)d" will match "abd"
        or "acd", while "ab|cd" matches "ab" or "cd".

    l)  If "<regexp>" is a regular expression, "{<regexp>}A" is
        a  regular  expression,  where  A  is  any  ASCII graphic
        character.   This  regular  expression  matches  the same
        strings  as <regexp>;  it has certain  side effects which
        allow references to sub-portions of the string matched by
        the regular expression.

    m)  "&"  is a  regular expression  which recursively matches
        the  regular   expression  in  which   it  is  contained.
        Left-recursive  regular  expressions may  be  defined via
        this  route.  Care  must be  taken in  their use, however
        Their  effectiveness  depends  on the  actual  strings to
        which they applied.  Infinite loops are possible and will
        be detected and reported as errors at run-time.


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

    n)  Nothing else is a regular expression.


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

4.  Commentary

In  most  of the  examples  given throughout  this  document, the
meta-characters  which  make  up  regular  expressions  are shown
beginning  with the  backslash character,  "".  This  is not the
only representation.  The user is free to choose any character as
the  "lead-in"  or  "escape"  for  meta-characters.   See  "Input
Conventions" for more on representation.

4.1.  The Null String

According to  definition "g" above, a  regular expression has the
potential  for  matching  strings   of  zero  length.   That  the
expression "A*" should match "" is not surprising in itself.  It
seems most reasonable.

However, what should  "A*" match in the string  "BBBB"?  That is
to  say, how  many null strings  are there in  the string "BBBB"?
Certainly,  there  are any  number of  answers to  this question;
however,  for purposes  of this document  and implementation, the
regular expression "A*" matches the  single occurrence of a null
string which occurs (in this example)

     -- BEFORE the first character of the string

     -- BETWEEN each pair of characters in the string

     -- AFTER the last character of the string

Thus, if we were to substitute a "-" for each occurrence of "A*"
in "BBBB", the resulting string would be "-B-B-B-B-".
                                                                  |
                                                                  |
4.2.  Bounded Iteration                                           |
                                                                  |
Definition "h" above gives a way to explicitly control the number |
of  times  a particular  regular  expression match  is attempted. |
Both m and  n must be unsigned integers,  and the relation m <= n |
be true.
                                                                  |
                                                                  |
For ease of use, the following defaults are applied:              |
                                                                  |
     -- if m is omitted, its value is taken to be zero.           |
                                                                  |
     -- if n is  omitted, its value is taken  to be (effectively) |
        infinity.                                                 |


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

The   following   is  a   list   of  various   formats   and  the |
interpretations of each one:                                      |
                                                                  |
     A<1:5>            match any string containing at least one |
                         but no more than five A's in a row       |
                                                                  |
     B<:10>            matches any string of up to ten B's in a |
                         row (including zero)                     |
                                                                  |
     C<5:>             matches five or more C's in a row        |
                                                                  |
     D<8>              matches exactly eight D's                |
                                                                  |
     E<:>              is equivalent to "E*"                   |
                                                                  |
     F<>               is syntactically ambiguous and therefore |
                         will cause an error at compile time      |

4.3.  Operator Precedence & Associativity

The operators of the regular expression are listed below in order
of decreasing precedence (binding power).

    "*" and "<m:n>"    indefinite or bounded repetition        |

     concatenation        implied by juxtaposition of expressions

      "|" and "||"     pattern alternation

Furthermore, operators  of equal precedence  are left-associative
so that the  regular expression A|B||C is parsed  as if it had
been written (A|B)||C.


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

4.4.  Strong vs Weak Alternatives

The  use of  "|" and  "||" in  regular expressions  allows the
implementation  of  alternative   regular  expressions  during  a
pattern-match attempt.

4.4.1.  Strong Alternatives

In matching the expression defined by "|", an attempt is made to
match  the regular  expression on the  left of the  "|".  If the
attempt succeeds,  the next match attempted  is the one following
the entire  pattern defined by  "|".  However, note  is taken of
the fact  that there is  an untried pattern  to the right  of the
"|".   Should  a regular  expression  later fail  to  match, the
"state of the match" will be restored  to what it was at the time
the "|" was  detected, and an attempt will be  made to match the
regular expression to the right of  the "|".  (Of course, if the
expression to the  left of the "|" fails,  the expression to the
right will be tried immediately).

This sequence of "remembering" and retrying will continue as long
as untried alternatives exist.

4.4.2.  Weak Alternatives

The  "||"  functions  somewhat differently.   It  still permits
alternative match  attempts, but once the  expression to the left
of the  "||" succeeds, any  untried expression to  the right of
the "||"  is disregarded.  Thus,  "||" acts locally  like the
SNOBOL4 "fence".

Another way to  think about weak alternatives is  that they are a |
way  of expressing  preferences for  the matches  which should be |
tried  first.  And  furthermore, the  preference specification is |
such that, once a particular match is made, none of the remaining |
options will be exercised.                                        |


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

As an example,  suppose we wish to parse  a FORTRAN-like "common" |
statement  which  consists of  either  the word  "common"  or its |
abbreviation "com",  followed by a variable  name.  An expression |
of the form:                                                      |

               (common||com) *[ul][a]*               |

will properly  match a statement  like "common X" or  "comY".  It |
will  not,  however,  parse  the  statement  consisting  only  of |
"common" as placing the variable "mon" in the common area.  There |
is no way to prevent this latter interpretation using only strong |
alternation.                                                      |

To  correctly interpret  the statement "common2"  as referring to |
the variable  "mon2" (variable names  must begin with  a letter), |
the expression should read                                        |

         (common *[ul]||com *[ul])[a]*          |


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

4.5.  Recursively Defined Expressions

The  use of  the "&"  allows regular  expressions to  be defined
recursively.   Its  most  common  use is  in  the  formulation of
patterns which are "balanced" in the SNOBOL4 sense of "BAL()".

When a "&" is encountered in the execution of a compiled regular
expression,  the  state  of the  match  is saved  and  the entire
expression is re-invoked on that  part of the target string which
remains  as  yet unmatched.   Of  course, this  re-invocation may
itself  cause  another state-save  and another  invocation.  This
recursive  "calling" may  occur an  arbitrary number  of times(1)
depending on the data in the target string.

At  some point,  a given invocation  will either  succeed or fail
without doing any  more calls.  If it fails,  any alternatives in
its caller  are tried just as  in the processing of  "|".  If it
succeeds, it updates the state of  the match on the target string
to reflect  the characters that  it matched.  It  then returns to
its "caller", who continues its  own match attempt from the newly
updated string positions.

As an example of this process, the strings matched by the regular
expression

     <&>|[abcdef][abcdef]*

in the following target string are underlined.

     <<<bad>> + <deaf>> + dead

One  assumption which  is made by  the run-time  routines is that
each recursively invoked attempt to  continue the match will make
some  amount of  progress or "fail".   If an  invocation does not
match any characters before  invoking another level of recursion,
the  run-time support  library will take  this as  evidence of an
error     condition     and     return     the     status    code
error_table_$fatal_error.

________________________________________

(1) The  recursion  level is  also constrained  by the  amount of
    scratch storage available for saved match states.


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

4.6.  References

Further information  on regular expressions  and pattern-matching
are presented in the following articles.

     "A Theory  of Discrete Patterns and  their Implementation in
     SNOBOL4"
     J.  F.  Gimpel
     Communications of the ACM
     Vol.  16, No.  2 (Feb 1973), pp.91-100

     "Regular Expression Search Algorithm"
     K.  Thompson
     Communications of the ACM
     Vol.  11, No.  6 (June 1968), pp.419-422

     "Derivatives of Regular Expressions"
     J.  A.  Brzozowski
     Journal of the ACM
     Vol.  11, No.  4 (Oct 1964), pp.481-494

     Formal Languages and their Relation to Automata
     J.  Hopcroft and J.  Ullman
     Addison-Wesley Publishing Co.
     Reading Mass., 1969

     U.S.  Patent #3,568,156
     Issued to K.  Thompson, 1971

     "QED Text Editor"
     B.  W.  Kernighan, D.  M.  Ritchie, K.  L.  Thompson
     Computing Science Technical Report #5
     Bell Telephone Laboratories
     Murray Hill, N.  J.
     May 1972

     The SNOBOL4 Programming Language
     R.  E.  Griswold, J.  F.  Poage, I.  P.  Polansky
     Prentice-Hall
     Englewood Cliff, N.  J.
     1971


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

5.  Input Conventions

Escape sequences are used for three purposes:

     1) they are used to add special meanings to characters (i.e.
        make  them into  "meta-characters") so  that they  can be
        dealt with in a reasonably direct manner.

     2) they provide  a concise means  of representing frequently
        occurring character sequences  thus improving readability
        and decreasing programming error.

     3) they  are  used  to   represent  the  occurrence  of  the
        character  which   is  the  "escape"   character  itself.
        Instances  of  the  escape character  are  represented by
        "doubling" according to conventional practice.  If "" is
        the   escape   character,   a   backslash   character  is
        represented  as  a  "normal" character  by  the sequence,
        "\".

The  following  sequences  are  allowed  as  an  aid  in  program
maintenance and  readability.  These sequences  are recognized in |
either  case,  i.e.  both  "n" and  "N" are  taken to  mean the |
newline  character.   They  are  listed in  upper-case  below for |
brevity.                                                          |

     Char      Interpretation                                     |

       A       the set of ASCII letters and digits                |

       B       the ASCII backspace character                      |

       C       the  set  of ASCII  control characters  <NUL> thru |
               <US> and <DEL>                                     |

       D       the ASCII digits "0" thru "9"                      |

       E       the current escape character                       |

       F       the ASCII form-feed character                      |

       G       the    set    of    ASCII    graphic    characters |
               exclamation-point thru tilde                       |

       H       the ASCII digits "0" thru "9", characters "a" thru |
               "f", and characters "A" thru "F"                   |


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

       L       the ASCII characters "a" thru "z"                  |

       N       the ASCII newline character                        |

       O       the ASCII digits "0" thru "7"                      |

       S       the ASCII space character                          |

       T       the ASCII tab character                            |

       U       the ASCII characters A thru Z                      |

       V       the set of valid ASCII characters                  |

       W       the  ASCII  characters:  horizontal  tab, vertical |
               tab, form-feed, and space                          |

In addition, an occurrence of the escape character followed by up |
to three octal digits is interpreted as the character whose octal |
is  equal to  that specified.   Thus, assuming  that "?"   is the |
escape  character,  "?12",  "?n",  and "?012"  all  represent the |
newline  character;  and  "?101"  and "A"  are  equivalent.  This |
sequence may  also be used  to represent characters  "outside" of |
ASCII, e.g.  "?776".                                              |

NOTE:  All sequences employing the  escape character as a lead-in
character except those described in this document will be treated
as illegal.


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

6.  The Pattern-Matching Process

As with many things, reference  material is not enough to provide
an  understanding  of  the  object  defined.   Descriptions which
suggest   appropriate  models   and  give   examples  are  highly
desirable.  This section attempts to provide an operative, though
slightly inaccurate, paradigm of  pattern-matching and gives some
examples with discussion.

6.1.  Macro-components of Regular Expressions

One of  the most important  cognitive steps in  understanding the
meaning of a regular expression  is the ability to "clump" groups
of  characters  into semantically  more meaningful  pieces.  This
takes  practice, but  is of great  utility.  It  is equivalent to
becoming comfortable  enough with Multics  starname processing to
read

     dl  bound_??*_.s.archive

as a request  to delete "all source archive  files in the working
directory whose names are bound_<something-or-other>_".

In  this  vein, the  following  table gives  some  simple regular
expressions along  with a suggested  way of thinking  about them.
Note  that  in  each case,  the  expression given  has  many more
components  (according   to  the  formal   definition)  than  are
described by the larger view.

Expression             Interpretation

^cat                  the  word  "cat"  at  the  beginning  of a
                       string

fox...*            the  word "fox"  followed by  at least two
                       characters(1)

^$                   a null string (contains no characters)

^.*$               all the characters of any string, even the
                       null one

________________________________________

(1) The word, "fox", followed by two characters, followed by zero
    or more characters implies at  least two characters after the
    word.


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

declare|dcl           the start of a PL/I data declaration(1)

ca(n|b|t|re)s     any of the words:  "cans", "cabs", "cats",
                       or "cares"

ca[brn]s             any  of  the  words:   "cabs",  "cars", or
                       "cans"(2)

$[d][d]*        an integral dollar amount

"[01]*"b            a  PL/I  bit-string  of  arbitrary  length
                       (including ""b)

6.2.  A Model of the Matching Process

Once  we have  decomposed a  regular expression  string into some
reasonably high-level clumps,  talking about the pattern-matching
process becomes easier.  One way  to visualize what happens is to
watch the interaction among three items:

     1. an  ordered  list of  the  clumps making  up  the regular
        expression,

     2. a string  of characters which  is to be  searched for the
        desired pattern,

     3. a cursor indicating the "current" position in the string.

For  illustration purposes,  a cursor  will be  shown as pointing
between  the characters  of a string.   In special  cases, it can
also point  before the first  or after the last  character of the
string.  The sample  strings to be searched in  this section will
not contain blanks  in them, so we will depict  the position of a
cursor within a string schematically as

                   string:     ABCDEF GHIJKLMN
                                     A|
                   cursor:           |        

________________________________________

(1) Catenation of  expressions (such as "d",  "c", and "l") binds
    more  tightly  than alternation.   Hence, No  parentheses are
    necessary.

(2) Note  that tests  for membership  in a  set of  characters is
    often   more  readable   than  a   list  of  single-character
    alternatives.


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

In this case, the cursor is said to be between "F" and "G", after
the "F", or  before the "G" depending on  which context makes the
most sense.

Once  we have  the ideas  of "clumps"  and "cursors", determining |
whether a given regular expression  can match a pattern somewhere
in a specified string is a simple process.  We must ask, for each
"clump" in the list, the following question:

    "Does the pattern I am  looking for occur immediately to
     the right of the cursor?"

Each  time the  answer to  the question  is "yes",  the cursor is
advanced  to the  right by  the appropriate  number of characters
Then, the next  clump in the list gets to  ask the same question.
When all  the clumps have,  in turn, answered  "yes", the regular
expression  has  matched  some   portion  (perhaps  all)  of  the
specified string.

This  simplified picture  ignores issues of  bookkeeping, such as
where the cursor  starts, how it is advanced,  whether it must be
remembered, and where  it is left when things  are done.  It also
does not touch  on how alternatives are handled,  or how patterns
like ".*" know how many  characters to match.  Those items will
be covered now.

Before the first clump takes its first look at the target string,
the  cursor  is  positioned before  (to  the left  of)  the first
character  of  the string.   After all  the clumps  have answered
"yes" and some portion of the target string has been matched, the
position of the  cursor is examined.  If the  cursor is not after
the last character of the string,  the list of clumps is asked to
try  and  match the  remaining  (as yet  unexamined) part  of the
string to the right of the cursor.

This approach has two main  consequences.  First, a given regular
expression  may match  several times at  several different places
within  the  target string.   Second,  if there  are two  or more
matches, none of them overlap with any other.(1)

________________________________________

(1) There  are a  few special  cases involving  "^" and  "$" in
    which this is not true.


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

We have  not discussed what  happens when some clump  in the list
cannot find its pattern in the string to the right of the cursor.
In the  simplest case (a regular  expression that doesn't contain
"*", or  flavors of "|"),  the obvious happens.   The cursor is
restored to the position it had  before the first clump was asked
to decide.(1)  If  this point is after the  last character of the
string, the  regular expression gives up  and reports any matches
it might have already made.  Otherwise, the cursor is advanced to
the right  one position and the  clumps in the list  are asked to
decide again.

In the  more complicated case where  there are alternatives ("|"
or  "||")  in  a  regular  expression  or  where  an indefinite
repetition  ("*") occurs,  the actions  taken are  slightly more
complicated.  In each of these cases, the clump "remembers" where
it was when it found one of  the patterns it was supposed to look
for.   If one  of the  later clumps should  fail to  find its own
pattern, rather  than advancing the  cursor and starting  over at
the  head of  the clump  list, the most  recent clump  that has a
remembered choice is given a chance to try again.

In the case of an alternative, the clump tries to match the other
pattern.  In the  case of an indefinite repetition,  the match is
shortened by one group.  Then, the next clump after this point of
recovery is asked to try again.

In an obvious fashion, the  opportunities to "try again" provided
by  this  process  are  nested.  Failure  at  one  level  is only
reported back if  all the possibilities at this  and lower levels
have  been  exhausted.   This  allows  all  possibilities  to  be
systematically and reliably tried.

________________________________________

(1) This  is  not  necessarily   the  beginning  of  the  string.
    Successful matches  may have been  made in the  target string
    prior to this attempt.


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

As an example, the following table  shows how this process may be
visualized.   The  regular  expression  which  will  be  used  to
illustrate the procedure is

                           or.*ten$

which  can  be divided  into  the clumps  given in  the following
table.

    Clump             Interpretation

    or                the word, "or"

    .*              some number of characters, maybe none

    ten$             The word, "ten", at the end of the string

Given   that  regular   expression  and   those  clumps   as  its
interpretation, the following table shows the target string (with
cursor), the "active"  clump, and a short explanation  of what is |
happening.

    Step  Target_String    Clump   Commentary

      1.   foreshorten     or      "or"   doesn't   match   here.
          A|                        Since this is the first clump,
          |                        advance  the  cursor  and  try
                                   again.

      2.  f oreshorten     or      There   is   a   match   here.
           A|                       Advance  the   cursor  by  the
           |                       length  of  the  pattern.  Get
                                   the next clump.

      3.  for eshorten     .*    This   matches   the   longest
             A|                     string it can.   It gobbles up
             |                     the  remainder of  the string,
                                   but remembers  there are other
                                   choices.  The  next clump gets
                                   control.

      4.  foreshorten      ten$   There  is no  possibility of a
                     A|             match.  Fall back.
                     |


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

      5.  foreshorte n     .*    The indefinite  repeat of step
                    A|              3 gives up a character, moving
                    |              the  cursor   back  one.   The
                                   match  attempt  restarts  with
                                   the next clump.

      6.  foreshorte n     ten$   Still no match.  Fail again.
                    A|
                    |

      7.  foreshort en     .*    Once  again,  a  character  is
                   A|               surrendered  from  the  series
                   |               grabbed  in  step 3.   And the
                                   next  clump  in the  list gets
                                   control.

      8.  foreshort en     ten$   There still aren't enough here
                   A|               to match.   This attempt fails
                   |               miserably.  But there is hope;
                                   some  characters  still remain
                                   from the match in step 3.

      9.  foreshor ten     .*    Another   character   whittled
                  A|                away in the name of progress.
                  |

     10.  foreshor ten     ten$   This one works!
                  A|
                  |

At the  completion of the  last step in the  preceding table, the
first and  third clumps have  matched copies of  themselves.  The
indefinite  repetition  has  matched  the  string  of characters,
"eshor".  Since the cursor will now be positioned to the right of
the  last character,  no further  attempts to  match this pattern
will be tried.

It should be noted once again, that  that this is only a model of
how pattern matching works.  In actuality, the compiler does many
optimizations to reduce the amount  of backtracking which must be
done.  However, the process shown above does provide a reasonably
accurate model.


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

6.3.  Labelling Substrings

Those people  who are users  of the text editor  QEDX are already |
familiar with the concept of  a "label" for a regular expression.
In the context of the  "substitute" command, the occurrence of an
"&" on the right-hand side it  interpreted to mean a reference to
whatever string was matched by the regular expression on the left
side.  This is what allows the command

                           s/quick/&,/

to change the familiar sentence fragment

                  The quick brown fox jumped ...

into

                 The quick, brown fox jumped ...

Labelled  substrings  in  the  context of  this  facility  are an
extension  of  this interpretation.   They allow  references, not
only to the entire string matched,  but also to pieces of it.  In
the example of the preceding table, if the regular expression had
been written as

                         or{.*}Xten$

then "X" would name the sequence of characters "eshor".  Assuming
that QEDX supported this kind of facility, the command

                    s/or{.*}Xten$/rXed/                     |

would change "foreshorten" into the (dubious) word, "reshored".


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

The  "labelling" facility  listed here  allows up  to 94 labelled
substrings to be a part of each match; one for each ASCII graphic
character.   Nesting  is permitted  as in  "{{..*}X}Y" which
names  the  same substring  with  the tags  "X"  and "Y".   It is
illegal, however, to define a regular expression which causes the
same label character to be defined simultaneously to two separate
strings in the same match.  Thus,

                       {.....}X{.....}X

is  always  illegal (and  can,  in general,  only be  detected at
run-time), while

                     {[D]}X|{[L]}X

is correct usage (and will always work, in fact).


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

7.  Some Examples

This  section  contains  some  detailed examples  of  the pattern
matching  process.    While  they  do  not   illustrate  all  the
possibilities,  there are  enough here to  illustrate the general
methodology.

7.1.  Alternative Matches

This  example  demonstrates the  use  of alternatives  in pattern
matching.  The words  we are looking for are  "is" and "it".  The
pattern we will use for this is

                            i(s|t)

which is decomposed in the following way.

    Clump             Interpretation

    i                 the letter, "i"

    s|t              either  the letter  "s", or  the letter "t"
                      (note  that  the parenthesis  only indicate
                      the grouping and play no active role in the
                      match)

The  sentence  we will  apply  this pattern  to  is "This_is_it."
Underscores  have been  used so that  the cursor  position can be
shown unequivocally in the table.

    Step  Target_String    Clump   Commentary

      1.   This_is_it.     i       The character to  the right of
          A|                        the  cursor  does  not  match.
          |                        Since  there  are  more  left,
                                   advance  the  cursor  and  try
                                   again from the beginning.

      2.  T his_is_it.     i       The  same result  (and action)
           A|                       as in the previous step.
           |


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

      3.  Th is_is_it.     i       This  time,  the  character to
            A|                      the right is an "i".  Remember
            |                      this as the  start of a match.
                                   Advance  the   cursor  by  the
                                   length  of  the  pattern  (one
                                   character),  and try  the next
                                   clump.

      4.  Thi s_is_it.     s|t    Remember  this  as  a fallback
             A|                     point.  Try the alternative on
             |                     the left first.

      5.  Thi s_is_it.     s       The character to  the right is
             A|                     an "s".  Advance the cursor by
             |                     the length of the pattern (one
                                   character).  Since this is the
                                   end  of the  clumps, the match
                                   succeeds.  However,  there are
                                   characters left  in the string
                                   so restart the  match from the
                                   beginning   (and   forget  the
                                   saved fallback point).

      6.  This _is_it.     i       The  character  to  the  right
              A|                    doesn't  match.   Advance  the
              |                    cursor and start again.

      7.  This_ is_it.     i       The     character     matches.
               A|                   Remember  the   start  of  the
               |                   second  match in  this string.
                                   Advance  the   cursor  by  the
                                   length  matched,  and  try the
                                   second clump.

      8.  This_i s_it.     s|t    Remember  this  as  a fallback
                A|                  point.  Try the alternative on
                |                  the left first.

      9.  This_i s_it.     s       The character to  the right is
                A|                  an "s".  Advance the cursor by
                |                  the length of the pattern (one
                                   character).  Since this is the
                                   end of the  clumps, the second
                                   match    succeeds.    However,
                                   there  are characters  left in
                                   the  string   so  restart  the
                                   match from  the beginning (and
                                   forget   the   saved  fallback
                                   point again).


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

     10.  This_is _it.     i       Try the first clump again.  It
                 A|                 fails  to match  so we advance
                 |                 the cursor and start again.

     11.  This_is_ it.     i       This time  there is an  "i" to
                  A|                the   right  of   the  cursor.
                  |                Remember  the   start  of  the
                                   third     potential     match.
                                   Advance  the cursor  as before
                                   and move to the next clump.

     12.  This_is_i t.     s|t    Mark this cursor position as a
                   A|               fallback  point  and  try  the
                   |               clump on the left.

     13.  This_is_i t.     s       The character to  the right is
                   A|               not  an "s".   The match would
                   |               fail  if not  for the fallback
                                   point established  in step 12.
                                   In  this  case,  it  does  not
                                   involve  any  readjustment  of
                                   the cursor.

     14.  This_is_i t.     s|t    Having come back  to a restart
                   A|               position,   try  to   use  the
                   |               pattern  on   the  right  (but
                                   remember  that   there  is  no
                                   further recourse  if something
                                   fails after this).

     15.  This_is_i t.     t       This  pattern  also  succeeds.
                   A|               The  start  and  size  of  the
                   |               third   successful   match  is
                                   remembered.   The   cursor  is
                                   advanced  by  the size  of the
                                   match.     There    are   more
                                   characters  left.   The  great
                                   mandala continues.


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

     16.  This_is_it .     i       The character to  the right of
                    A|              the cursor is not an "i".  The
                    |              cursor is  advanced.  However,
                                   because no more characters are
                                   left(1)  to  the right  of the
                                   cursor, no  further tries will
                                   be made.

The  results  returned  from   this  pattern-match  attempt  will
indicate  that  there were  three matches  in the  target string.
They occurred at positions 3, 6,  and 9.  All were two characters
long.

7.2.  Recursive Patterns

This   example    demonstrates   the   use    of   recursion   in
pattern-matching.  In this case, we will use a simplified form of
arithmetic expressions.   If we define  the arithmetic expression
as:

     1) a variable name, or

     2) a variable  name followed by  an operator followed  by an
        expression, or

     3) an expression enclosed in parentheses,

then the regular expression which matches this is given by

               [abcde][+-*/]&|(&)|[abcde]

where  the  variable names  have  been simplified  to  the single
characters chosen from  the first five letters of  the lower case
alphabet,  and the  operators to those  of addition, subtraction,
multiplication, and division.  This pattern is broken down as

    Clump             Interpretation

    [abcde]         a variable name

    [+-*/]          an arithmetic operator

________________________________________

(1) In  fact,  this  match  attempt  will  never  be  made.   The
    translator will recognize the fact that this pattern needs at
    least two characters to succeed  and pass this information to
    the  run-time.  At  this point,  there aren't  two characters
    left.


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

    &                an occurrence of an expression

    (                 the character, "("

    &                an occurrence of an expression

    )                 the character, ")"

    [abcde]         a variable name

The alternation  clumps have been omitted  from the discussion as
previously.  As a test case, we will use the string

                            a+(b*(c))

which  is an  example of  most all  the items  we have informally
mentioned above.  To simplify  the following discussion, however,
the  alternations and  fallback points  will be  mentioned in the
commentary rather than being given explicitly as separate steps.

    Step  Target_String    Clump       Commentary

      1.   a+(b*(c))       [abcde]   There  is a  variable name
          A|                            to   the   right   of  the
          |                            cursor.    However,  there
                                       are       two      untried
                                       possibilities  to  use  if
                                       something    goes    wrong
                                       later.      Advance    the
                                       cursor,  and try  the next
                                       clump.

      2.  a +(b*(c))       [+-*/]    There  is  an  operator to
           A|                           the  right of  the cursor.
           |                           Advance  it by  the length
                                       of   the    pattern   (one
                                       character)  and   try  the
                                       next clump.

      3.  a+ (b*(c))       &          Remember  this  as  return
            A|                          point from  recursion.  Do
            |                          not  advance  the  cursor,
                                       but  "call"  ourselves  to
                                       start with the first clump
                                       again.

      4.  a+ (b*(c))       [abcde]   This  doesn't  match;  but
            A|                          there   are   alternatives
            |                          left to try.


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

      5.  a+ (b*(c))       (           This  alternative matches.
            A|                          Advance the  cursor by the
            |                          length  of the  match (one
                                       character)     and    pass
                                       control to the next clump.
                                       Remember  that   there  is
                                       still one  other remaining
                                       alternative from step 3.

      6.  a+( b*(c))       &          This is  another recursive
             A|                         "call"  on  ourselves.  At
             |                         this    point,    we   are
                                       three-deep in invocations,
                                       the other  two having been
                                       started at steps 1 and 3.

      7.  a+( b*(c))       [abcde]   This one matches.  Advance
             A|                         the  cursor by  the length
             |                         of  the match  and try the
                                       next clump.  Remember that
                                       there   are    two   other
                                       alternative  patterns left
                                       untried   (besides   those
                                       pending from earlier).

      8.  a+(b *(c))       [+-*/]    This also  matches.  Looks
              A|                        like  we're on  a roll, so
              |                        advance  the  cursor again
                                       and continue with the next
                                       clump.

      9.  a+(b* (c))       &          Once   again,   we  "call"
               A|                       ourselves to  continue the
               |                       pattern.   This  will make
                                       the   fourth    level   of
                                       invocation.

     10.  a+(b* (c))       [abcde]   Nothing  here  like  that;
               A|                       maybe the next alternative
               |                       will have some luck.

     11.  a+(b* (c))       (           That   alternative   looks
               A|                       fine.  Step the cursor and
               |                       try the next clump.

     12.  a+(b*( c))       &          Another          recursive
                A|                      invocation.
                |


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

     13.  a+(b*( c))       [abcde]   This one matches  so we do
                A|                      the  customary  things  to
                |                      the  cursor  and  list  of
                                       clumps.    However,  there
                                       are  untried things  to do
                                       if we fail  to match later
                                       on (hint, hint).

     14.  a+(b*(c ))       [+-*/]    This  one  fails.  Restore
                 A|                     things to  where they were
                 |                     at  the point  of the last
                                       alternative   choice,  and
                                       retry  with  the  next one
                                       (if any) in line.

     15.  a+(b*( c))       (           This    is     the    next
                A|                      alternative  to  try (left
                |                      over   from    step   13).
                                       Unfortunately,   it   also
                                       fails.  So we try the last
                                       alternative  available  to
                                       us.  If that doesn't work,
                                       we  will   report  failure
                                       back  to the  last untried
                                       alternative  prior  to us,
                                       and so on.

     16.  a+(b*( c))       [abcde]   This   one   matches.   We
                A|                      advance the  cursor by the
                |                      length  of the  match (one
                                       character).       However,
                                       there  are no  clumps left
                                       in  the  list.   Since  we
                                       were called as a recursive
                                       invocation  from  step 13,
                                       we  return  control  there
                                       leaving  the   cursor  and
                                       match information alone.

     17.  a+(b*(c ))       )           This  matches   too.   The
                 A|                     cursor is  advanced.  This
                 |                     is the end of this portion
                                       of  the  pattern.  Control
                                       returns to the "caller" at
                                       step 9.


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

     18.  a+(b*(c) )       <NONE>      There   is   no  successor
                  A|                    clump to  pass control to.
                  |                    The  pattern  which called
                                       us  from  step  9  was one
                                       which   ended    with   an
                                       invocation     of    "|&".
                                       Therefore,  the expression
                                       ends  at this  point.  The
                                       cursor is  not changed and
                                       this   "returns"   to  its
                                       caller at step 6.

     19.  a+(b*(c) )       )           This   is    the   pattern
                  A|                    following the "&" invoked
                  |                    by  step  6.   It matches,
                                       advances  the  cursor, and
                                       (since this is  the end of
                                       a  successful alternative)
                                       returns  to its  caller at
                                       step 3.

     20.  a+(b*(c))        <NONE>      The same  sequence happens
                   A|                   here that happened in step
                   |                   18.   The  alternative has
                                       successfully matched.  The
                                       only  difference  is  that
                                       the "caller"  was actually
                                       the    run-time    support
                                       software.    It  initiated
                                       this  match  attempt  with
                                       step  1.   Since  all  the
                                       clumps have  been used, we
                                       have   had   a  successful
                                       match.    Moreover,  there
                                       are no  further characters
                                       left in the string, so all
                                       processing is finished.

The results returned from this match will show a pattern detected
starting at position 1 which has a length of 9.


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

There are  a few general  points to be noted  about the preceding
example and about recursion in general.

     1) If untried alternatives exist at  the end of a particular
        "invocation", they  will be discarded once  that level of
        recursion returns to  its "caller".  Untried alternatives
        from previous levels are still available, however.(1)

     2) Recursive expressions are easy to write, and even natural
        for  patterns  involving  nested  constructs.   They  can
        easily go awry if the  writer does not understand clearly
        how the matches will be made.

     3) The   use   of  labelled   substrings   within  recursive
        expressions will  almost always cause  the same character
        to  be  associated  with  more than  one  substring  in a
        pattern.   This is  an error  as far  as the  run-time is
        concerned.

     4) Because  of the  way recursion  works, we  cannot write a
        pattern to match <expr> <op> <expr> as "&[+-*/]&" The
        reason is  that the initial "&"  causes a left-recursion
        loop which will be detected  by the run-time and reported
        as an error.

________________________________________

(1) Untried  alternatives  mimic  the  action  of  PL/I condition
    handlers with respect to recursive procedures.


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

8.  String versus Line Mode                                       |
                                                                  |
The discussion  up until now  has presumed that the  string to be |
matched was viewed as an  unbroken series of characters.  This is |
called "string" mode.   However, it is possible to  view a string |
to  be examined  as a  series of  "lines", that  is, sequences of |
characters delimted by ASCII  newline characters.(1)  By changing |
the way the  sequence of characters is viewed,  rapid searches of |
textual data are possible.                                        |
                                                                  |
                                                                  |
This  other  ("line")  model  of character  sequences  means that |
certain  regular  expressions  will have  to  perform differently |
depending on whether the character sequence is viewed as a single |
string, or a series of  lines.  The following discussion explains |
the differences.                                                  |
                                                                  |
                                                                  |
Item        String Mode                  Line Mode
                                                                  |
Ordinary    matches a newline if the     ditto
Character   ordinary       character
            itself is a newline
                                                                  |
.          matches   any  character     matches   any  character
            including a newline          except a newline
                                                                  |
[  ]      matches a newline if the     ditto
            newline  is  included in
            the  set  of  characters
            within the brackets
                                                                  |
[^  ]    matches  a newline  if a     never matches a newline
            newline is  not included
            in the set of characters
            within the brackets
                                                                  |
^          matches before the first     matches before the first
            character of the string      character of the string,
                                         or   immediately  before
                                         any  character  preceded
                                         by a newline
                                                                  |

________________________________________

(1) The entire body of a Multics mail text is an example of this.


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

$          matches  after  the last     matches  after  the last
            character of a string        character  of  a string,
                                         or  before a  newline in
                                         the middle  or end(1) of
                                         the string

________________________________________

(1) The newline is stepped over in  this case, but is not counted
    as being in the part of the string that is matched.  Newlines
    may  only  be  included  in  matches  by  including  them  as
    "ordinary characters" in a pattern definition.


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

9.  Subroutines

The following  pages describe subroutines  available to translate
strings  which  define regular  expressions into  Multics machine
code, and apply the patterns they define to "target" strings.     *

The  include  files  used  as  part  of  the  regular  expression
processing are described in a separate part of this document.

All  routines  described in  this  section are  interruptable and
re-entrant.


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1
___________________                           ___________________

rx_$get_escape_char                           rx_$get_escape_char
___________________                           ___________________

NAME:  rx_$get_escape_char

This entry  returns the present  value of the  "escape" character
for regular expressions.

USAGE

declare rx_$get_escape_char entry (char(1) varying);

call rx_$get_escape_char (escape_char);

ARGUMENTS

escape_char
     is  the present  value of  the process-local  default escape
     character.  (Output)

NOTES:

If no default  escape character has been established  at the time
of  the call  to this entry  point, a zero-length  string will be
returned.


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines
___________________                           ___________________

rx_$set_escape_char                           rx_$set_escape_char
___________________                           ___________________

NAME:  rx_$set_escape_char

This  entry  defines  the character  to  be used  as  the default
"escape"  character  when  one  is  not  supplied  in  a  call to
rx_$translate.  It also returns the value of the escape character
which was set prior to calling this entry.

USAGE

declare rx_$set_escape_char entry (char(1)    varying,    char(1)
                                   varying);

call rx_$set_escape_char (new_escape_char, old_escape_char);

ARGUMENTS

new_escape_char
     is  the  new  value  for  the  process-local  default escape
     character.  (Input)

old_escape_char
     is the  value of the process-local  default escape character
     before the call to this entry.  (Output)

NOTES:

The  present value  of the escape  character may  be cancelled by
setting new_escape_char to the null string.  All subsequent calls
to rx_$translate must then explicitly provide an escape character
if they wish regular expression meta-characters to be recognized.

If no default  escape character has been established  at the time
of the  call to this  entry point (or  it has been  cancelled), a
zero-length string will be returned.


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1
___________________                           ___________________

rx_$set_escape_char                           rx_$set_escape_char
___________________                           ___________________

The user should choose a value  for the escape character which is
not  likely  to  be  typed in  the  normal  course  of specifying
patterns  to avoid  unexpected results.   Choosing a  period, for
example, is  probably a "bad"  idea because it  occurs frequently
and  using it  as the escape  character also prevents  its use in
matching an arbitrary character.


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines
_____________                                       _____________

rx_$translate                                       rx_$translate
_____________                                       _____________

NAME:  rx_$translate

This subroutine  translates a regular  expression defining string
into an  internal form.  This  internal form can  subsequently be
applied to other strings by rx_$execute to determine whether they
"match" the pattern defined by the regular expression.

USAGE

declare rx_$translate entry (char(*), char(1) varying, ptr, fixed |
                             bin(35));                            |

call rx_$translate (rx_string, escape_char, form_p, code);        |

ARGUMENTS

rx_string
     is  the  string  defining  the  regular  expression pattern.
     (Input)

escape_char
     is the character which should be considered to be the escape
     or    "lead-in"    character    for    regular    expression
     meta-characters.  (Input)
                                                                  *
form_p
     is  a  pointer  to  the   translated  form  of  the  regular
     expression.  (Output)

code
     is a standard status code.  (Output)

NOTES:

If  escape_char  is  passed  in  as  a  zero-length  string,  the
character established  as the "default" escape  character will be
used.   If no  character has  previously been  established as the
default escape  character, no regular  expression meta-characters
will be recognized.


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1
_____________                                       _____________

rx_$translate                                       rx_$translate
_____________                                       _____________

Processing of  "doubled" escape characters  takes precedence over
recognition of  meta-characters.  For example, if  "."  is chosen
as the escape character, then  ".."  is interpreted as matching a
period, not any non-null character.

The  return value  "form_p" can be  considered a  "ticket" to the |
translated form of the regular  expression.  It will be needed in |
subsequent calls  to rx_$execute or rx_$search  to accomplish the |
actual searching of  strings.  It is valid only  for the duration |
of  the process.   It can be  voided during  execution by calling |
rx_$release to return the space it occupies.                      |

If the string  defining the regular expression to  be compiled is |
malformed, or if the translator encounters an internal error, the |
return     code     will     be     set     to     the     value, |
error_table_$translation_failed.  In addition, rx_$translate will |
signal sub_err_ passing additional information about the location |
of  the error,  if known.  This  information is  described by the |
include file, rx_error_info.                                      |


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines
_________________                               _________________

rx_$translate_old                               rx_$translate_old
_________________                               _________________

NAME:  rx_$translate_old

This  entry point  performs a  translation function  identical to
that  defined  by  rx_$translate,  and  returns  the  same result
information.   The  difference  lies  in  the  syntax  of regular
expressions.   This  entry  point  accepts  the  form  of regular
expression defined for the "qedx" command.(1)  It also recognizes
the "c"  convention to strip  regular expression meta-characters
of their special meanings.

USAGE

declare rx_$translate_old entry (char(*), ptr, fixed bin(35));    |

call rx_$translate_old (rx_string, form_p, code);                 |

ARGUMENTS

rx_string
     is  the  string  defining  the  regular  expression pattern.
     (Input)
                                                                  *
form_p
     is  a  pointer  to  the   translated  form  of  the  regular
     expression.  (Output)

code
     is a standard status code.  (Output)

________________________________________

(1) See  manual #AG92,  Multics Programmer's  Manual, Command and
    Active Functions; "NOTES ON REGULAR EXPRESSIONS"


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1
_________________                               _________________

rx_$translate_old                               rx_$translate_old
_________________                               _________________

NOTES:

The  structure returned  as the  result of  translating the given
regular expression is  exactly the same as the  one documented by
rx_$translate.  There is no way that the run-time support routine
can know from what form the regular expression derived.
                                                                  *

This  entry is  being provided  to allow  the replacement  of the
interpreter presently included in the "search_file_" subroutine.


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines
___________                                           ___________

rx_$execute                                           rx_$execute
___________                                           ___________

NAME:  rx_$execute

This  subroutine  applies  the  translated form  of  a previously
defined  regular  expression  pattern  to a  "target"  string and
returns information about any "matches" which occur.

USAGE

declare rx_$execute entry (ptr, char(*), ptr, ptr, fixed bin(35), |
                           fixed bin(35));                        |

call rx_$execute (form_p,  string,  control_info_p, match_info_p, |
                  match_count, code);                             |

ARGUMENTS

form_p
     is  a  pointer  to  the   translated  form  of  the  regular
     expression pattern.  (Input)

string
     is the string to be tested.  (Input)

control_info_p
     is a pointer to  a structure, rx_execute_ctl, describing the
     kind of match to be done and the results desired.  (Input)
                                                                  |
match_info_p                                                      |
     is  a  pointer  to  a  structure  describing  the  places in |
     "string"  where   the  regular  expression   represented  by |
     "form_p" matched.  This data  can be selectively returned to |
     the  user  under control  of  the options  expressed  in the |
     structure pointed to by "control_info_p".  (Input/Output)    |
                                                                  |
match_count                                                       |
     is the number of times the regular expression represented by |
     "form_p" matched the string given by "string".  (Output)     |

code
     is a standard status code.  (Output)


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1
___________                                           ___________

rx_$execute                                           rx_$execute
___________                                           ___________

NOTES:

The  input  argument,  form_p,  identifies  the  location  of the
translated regular expression.                                    *

The information specifying whether labels  are desired as part of
the returned info; whether one, all,  or a bounded set of matches
is desired; and similar information is contained in the structure
described by rx_execute_ctl.incl.pl1.

Upon completion  of the routine, information  about the places in |
the  string where  the pattern matched  is returned  to the user. |
This  information  is  contained  in the  segment  pointed  to by |
match_info_p and is described by  the structures contained in the |
include file, rx_match_info.incl.pl1 (q.v.).                      |

If  the  caller specifies  that  details on  the match  should be |
returned,  then a  temp segment  will be  allocated to  hold that |
information.   A  pointer  to  the segment  will  be  returned in |
match_info_p.  It  may be returned by  passing it to rx_$release. |
As  an  efficiency consideration,  however,  this segment  may be |
re-used  by passing  it to  subsequent calls  to rx_$execute thus |
saving the time of disposal and re-allocation.                    |
                                                                  |
                                                                  |
Errors  which  occur  during  the  attempted  application  of the |
translated  pattern  to the  target string  will be  reported via |
calls  to  sub_err_ so  that additional  explanatory text  can be |
given.                                                            |
                                                                  |
                                                                  |
All  non_zero  status  returns,  save   one,  from  the  call  to |
rx_$execute are fatal.   In the fatal cases, the  contents of the |
match  segment  are indeterminate.   A  returned status  value of |
error_table_$noalloc is indicative of  the single non-fatal case. |
It  signifies  that  the match  segment  was not  long  enough to |
contain all the match results.   In this case, the returned match |
data are consistent and properly threaded; and in most instances, |
application may begin again later on the substring after the last |
match described with no ill effect.                               |


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines
___________                                           ___________

rx_$release                                           rx_$release
___________                                           ___________

NAME:  rx_$release                                                |
                                                                  |
                                                                  |
This  subroutine  de-allocates the  storage obtained  for various |
return   arguments   from  the   rx_$translate   and  rx_$execute |
entrypoints.                                                      |
                                                                  |
                                                                  |
USAGE                                                             |
                                                                  |
                                                                  |
declare rx_$release entry (ptr, fixed bin(35));                   |
                                                                  |
call rx_$release (object_p, code);                                |
                                                                  |
                                                                  |
ARGUMENTS                                                         |
                                                                  |
                                                                  |
object_p                                                          |
     is  a pointer  to some object  returned by one  of the other |
     rx_$XXX entrypoints.  It can, for example, be the translated |
     regular   expression   form   returned   by   rx_$translate. |
     (Input/Output)                                               |
                                                                  |
code                                                              |
     is a standard status code.  (Output)                         |
                                                                  |
                                                                  |
NOTES:                                                            |
                                                                  |
                                                                  |
If  "object_p" is  a null pointer  upon entry,  then this routine |
will set the status code to zero and return.                      |
                                                                  |
                                                                  |
If "object_p" legitimately points at one of the items returned by |
an rx_$XXX  entrypoint, then the storage  occupied by that object |
will be de-allocated and "object_p" will be set to null.          |
                                                                  |
                                                                  |
In      all      other     cases,      the      error     status, |
error_table_$unimplemented_version, will be  returned and nothing |
will  be  done to  the input  pointer value  or the  locations it |
references.                                                       |


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1
__________                                             __________

rx_$search                                             rx_$search
__________                                             __________

NAME:  rx_$search                                                 |
                                                                  |
                                                                  |
This subroutine  searches for the  first occurrence of  a regular |
expression  in a  given string.   Either the  entire string  or a |
substring may be examined for the pattern.                        |
                                                                  |
                                                                  |
USAGE                                                             |
                                                                  |
                                                                  |
declare rx_$search entry (ptr,  char(*),   fixed  bin(21),  fixed |
                          bin(21), fixed  bin(21), fixed bin(21), |
                          fixed bin(35));                         |
                                                                  |
call rx_$search (form_p,    string    scan_start,    scan_length, |
                 match_start, match_length, code);                |
                                                                  |
                                                                  |
ARGUMENTS                                                         |
                                                                  |
                                                                  |
form_p                                                            |
     is  a  pointer  to  the   translated  form  of  the  regular |
     expression.  (Input)                                         |
                                                                  |
string                                                            |
     is the string to be searched.  (Input)                       |
                                                                  |
scan_start                                                        |
     is  the  position within  "string"  where searching  for the |
     pattern may commence.  (Input)                               |
                                                                  |
scan_length                                                       |
     is  the  number  of characters,  beginning  at "scan_start", |
     which  may be  examined for  the occurrence  of the pattern. |
     (Input)                                                      |
                                                                  |
match_start                                                       |
     is the index position relative  to the beginning of "string" |
     where the first match occurred.  (Output)                    |
                                                                  |
match_length                                                      |
     is  the  number  of  characters  in  "string",  beginning at |
     "match_start",  which  were included  in  the pattern-match. |
     (Output)                                                     |


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines
__________                                             __________

rx_$search                                             rx_$search
__________                                             __________

code                                                              |
     is a standard status code.  (Output)                         |
                                                                  |
                                                                  |
NOTES:                                                            |
                                                                  |
                                                                  |
If "scan_length"  is given as the  value -1, it is  taken to mean |
the rest of the string from "scan_start" to "length(string)".     |
                                                                  |
Other than using -1 for "scan_length", all substrings defined via |
the  use  of  "scan_start"   and  "scan_length"  must  be  wholly |
contained within "string" itself.                                 |
                                                                  |
                                                                  |
If the pattern  does not occur within the  string specified, both |
"match_start" and "match_length" will be set to zero.             |


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1
______________                                     ______________

rx_$substitute                                     rx_$substitute
______________                                     ______________

NAME:  rx_$substitute

This entry implements the algorithm  for modifying a string given
a  set  of  match  data derived  from  application  of  a regular
expression  to  the  string   and  a  change  specification.   In
subroutine  terms,  it performs  like  the qedx  "substitute" (s)
command.

USAGE

declare rx_$substitute entry (char(*),    ptr,   char(1) varying,
                              char(*),   char(*),  fixed bin(21),
                              fixed bin(35));

call rx_$substitute (input_string,   match_data_p,   escape_char,
                     change_spec,                  output_string,
                     output_string_used, code);

ARGUMENTS

input_string
     is the  string to which the  regular expression was applied.
     (Input)

match_data_p
     is  a pointer  to the  match data  structure, rx_match_info,
     returned from the call to rx_$execute.  (Input)

escape_char
     is the character  which can be used to  override the default
     escape  character  in  recognizing  references  to  labelled
     substrings in the match data.  (Input)

change_spec
     is a string  which describes what to substitute  for each of
     the matched portions of "input_string".  (Input)

output_string
     is the place  where the transformed string is  to be stored.
     (Output)


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines
______________                                     ______________

rx_$substitute                                     rx_$substitute
______________                                     ______________

output_string_used
     is a value that indicates how long the resultant string was.
     It  may   indicate  a  value  longer   than  the  length  of
     "output_string" (see NOTES below).  (Output)

code
     is a standard status code.  (Output)

NOTES:

The  algorithm  used to  affect the  transformation of  the input
string according to the change specifications is as follows:

A)   The  internal  escape  character  is  set  to  the  value of
     "escape_char".   If this  is a zero-length  string, then the
     default value returned by rx_$get_escape_char is used.

B)   The input string is broken down into a series of substrings.
     The  criteria for  the break  is that  each piece  is either
     wholly  described  by  one  of  the  elements  in  the match
     information (rx_match_data_template), or  does not intersect
     any of the matched parts.

C)   Once this has been done,  the substrings of the input string
     are  examined one  at a  time from  left to  right.  Each is
     processed according to the  rules described by the remainder
     of this section.

D)   If this substring  is not part of a  successful match, it is
     copied to  the output string  as is.  The  next substring to
     the right is then examined.

E)   Otherwise, each character of the change specification string
     is  examined in  turn.  If it  is not equal  to the internal
     escape character, the  change specification character itself
     is copied to  the output string.  And the  next character of
     the change specification is examined.

F)   If  the  current change  specification character  equals the |
     internal  escape  character,   the  following  character  is |
     checked  to  see  if it  a  second occurence  of  the escape |
     character.  If this  is so, a single instance  of the escape |
     character is copied to the output string.                    |


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1
______________                                     ______________

rx_$substitute                                     rx_$substitute
______________                                     ______________

G)   Otherwise,  if  the current  change  specification character |
     equals   the  internal   escape  character,   the  following |
     character is checked to see if it is a valid label character
     for regular expression sub-matches.

H)   If it is valid, a check is  made to see if a "label" by that
     name was defined during the  match.  If so, the substring of
     the  input  string matched  by  the label  is copied  to the
     output  string.   If not,  the  label character  is ignored;
     i.e.,  nothing is  copied to  the output  string.  In either
     case,   the   next   change   specification   character   is
     subsequently examined.

If both the  input escape character supplied to  this routine and
the  default escape  character obtained  from rx_$get_escape_char
are  zero-length  strings, no  labels will  be recognized  in the
change specification.  It is an error to end "change_spec_string"
with the escape character.

Under the rules outlined above,  a pair of escape characters will |
always cause  a single escape  character to appear  in the output |
string, even if there was  a labelled substring which was "named" |
by the escape character.                                          |

The return  value, "output_string_used", will  contain the length
of the transformed string in  all cases.  When "output_string" is
too small to contain the result,  "code" will be set to the value
error_table_$smallarg and "output_string_used" will have the size
needed to fully contain the  result.  Otherwise, the result is in
substr(output_string, 1, output_string_used).


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines
_____________________                       _____________________

rx_$get_newline_count                       rx_$get_newline_count
_____________________                       _____________________

NAME:  rx_$get_newline_count                                      |
                                                                  |
                                                                  |
This  functions  returns  the  maximum  number  of  lines  that a |
translated  regular  expression will  match  if it  applied  to a |
target string  in "line mode".  If the  expression would match an |
arbitrary number of lines, then  this function returns the value, |
-1.                                                               |
                                                                  |
                                                                  |
USAGE                                                             |
                                                                  |
                                                                  |
declare rx_$get_newline_count entry   (ptr)                       |
                              returns (fixed bin(21));            |
                                                                  |
nl_cnt = rx_$get_newline_count (form_p);                          |
                                                                  |
                                                                  |
ARGUMENTS                                                         |
                                                                  |
                                                                  |
form_p                                                            |
     is  a  pointer  to  the   translated  form  of  the  regular |
     expression pattern.  (Input)                                 |
                                                                  |
                                                                  |
NOTES:                                                            |
                                                                  |
                                                                  |
If the argument supplied is null,  or does not point to a version |
of  the  translated regular  expression,  then this  routine will |
signal the sub_error_ condition  by calling the sub_err_ routine. |
The text of the sub_error_info  structure will contain the reason |
for the error.   Upon return from the signal,  this function will |
return the value, -1.                                             |


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1
_____________________                       _____________________

rx_$sub_error_handler                       rx_$sub_error_handler
_____________________                       _____________________

NAME:  rx_$sub_error_handler                                      |
                                                                  |
                                                                  |
This  entry  provides  a  simple,  common  way  of  handling  the |
sub_error_ condition  raised when errors are  detected in various |
rx_  routines.   It provides  for obtaining  the most  recent rx_ |
error   frame,   interpreting  the   information   supplied,  and |
displaying it via standard output mechanisms.                     |
                                                                  |
                                                                  |
USAGE                                                             |
                                                                  |
                                                                  |
declare rx_$sub_error_handler entry (entry options(variable),     |
                                     char(*), fixed bin(35));     |
                                                                  |
call rx_$sub_error_handler (error_report_proc,      callers_name, |
                            code);                                |
                                                                  |
                                                                  |
ARGUMENTS                                                         |
                                                                  |
                                                                  |
error_report_proc                                                 |
     is  a procedure  which can  be called  to display  the error |
     information to the user (see NOTES below).  (Input)          |
                                                                  |
callers_name                                                      |
     is  the name  of the routine  on whose  behalf the condition |
     handling is taking place.  (Input)                           |
                                                                  |
code                                                              |
     is a standard status code.  (Output)                         |
                                                                  |
                                                                  |
NOTES:                                                            |
                                                                  |
                                                                  |
This entry  assumes that it  is being called  as the result  of a |
sub_error_  condition signalled  by an rx_  routine.  It searches |
backward on the stack for  the most recent, appropriate condition |
frame and interprets the information supplied.  It then calls the |
routines  given as  the value of  "error_report_proc" passing the |
interpreted error information.                                    |


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines
_____________________                       _____________________

rx_$sub_error_handler                       rx_$sub_error_handler
_____________________                       _____________________

This  routine assumes  that the "error_report_proc"  has the same |
calling    sequence    as     com_err_    and    its    siblings: |
com_err_$suppress_name,                          active_fnc_err_, |
active_fnc_err_$af_suppress_name.      This     routine     calls |
"error_report_proc"  passing it  the status code  supplied in the |
call to sub_err_,  the value given as "callers_name"  as the name |
of the reporting entity, and  ioa_ control strings and parameters |
describing the error condition.                                   |
                                                                  |
                                                                  |
The  return  status code  from this  entry indicates  whether the |
transmission of the error information to the user was successful. |
It      is      zero      if     it      was.       The     code, |
error_table_$action_not_performed, will be  returned if the frame |
could not be  located; and error_table_$unimplmented_version will |
be returned if the frame was  present but the "info_ptr" given to |
sub_err_ did not point to a valid instance of rx_error_info.  One |
other  source  of non-zero  return codes  is find_condition_info_ |
itself.                                                           |


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

10.  Commands and Active Functions

The  following   pages  describe  several   commands  and  active
functions built on the  regular expression subroutines documented
earlier.   They  allow manipulation  and testing  of command-line
arguments,  generalized selection,  and control  over the default
escape character.


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines
______________                                     ______________

rx_escape_char                                     rx_escape_char
______________                                     ______________

SYNTAX AS A COMMAND:

   rx_escape_char {value}

SYNTAX AS AN ACTIVE FUNCTION:

   [rx_escape_char {value}]

FUNCTION:  sets/prints/returns  the default escape  character for
regular expression processing.

ARGUMENTS:

value
   gives  the  character  to  be  used  as  the  default  regular
   expression escape character.

NOTES:

Whether or not the first argument is supplied, the present escape
character   will  be   printed  (command)   or  returned  (active
function).   Unless "value"  is supplied, however,  no attempt to
modify the character saved as the default will be made.

If the string supplied for  "value" is longer than one character,
the  first  character  of  the argument  will  become  the escape
character.  In the case of command invocation, this instance will
be noted as a warning.

The default  escape character can  be cancelled by  supplying the
null string as the value of the argument.


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1
__                                                             __

rx                                                             rx
__                                                             __

SYNTAX AS A COMMAND:

   rx opname regexp string1 ... stringN

SYNTAX AS AN ACTIVE FUNCTION:

   [rx opname regexp string1 ... stringN]

FUNCTION:  translates the regular  expression given as the second
argument  and,  depending  on  the  opname  given  as  the  first
argument,  "applies" the  translated expression  to the following
strings and returns a result.

ARGUMENTS:

opname
   designates a list of possible operations to perform on the the
   strings  given as  arguments.  Opnames  permitted, followed by
   their alternate forms where applicable are:

      change, chg
      change_all, chga
      change_some, chgs
      count, ct
      exclude, ex                                                 |
      exclude_index, exi                                          |
      include, in                                                 |
      include_index, ini                                          |
      match_count, mct
      test

   See  "List  of Operations"  below  for a  description  of each
   opname, its syntax, and  specific application.  Operations are
   arranged in the same order as above.

regexp
   is a string which defines a regular expression.  The syntax of |
   this  string  is the  syntax  governing input  strings  to the |
   entrypoint, rx_$translate.                                     |

stringi
   can  be  one or  more  arguments depending  on  the particular
   operation to be performed.


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines
__                                                             __

rx                                                             rx
__                                                             __

LIST OF OPERATIONS:

Unless  otherwise  noted,  all   errors  detected  by  a  command
invocation  will  be  reported   via  com_err_;  active  function |
invocations   will    report   their   errors    via   calls   to
active_fnc_err_.

OPERATION:  change

SYNTAX AS A COMMAND

   rx change regexp change_spec string {strings}

SYNTAX AS AN ACTIVE FUNCTION

   [rx change regexp change_spec string {strings}]

FUNCTION:   applies the  regular expression  given as  the second
argument to its fourth  and following arguments.  Those arguments
which the regular expression matches are transformed according to
the  specifications  given as  the  third argument  and returned.
Arguments that the regular expression does not match are returned
unaltered.


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1
__                                                             __

rx                                                             rx
__                                                             __

NOTES:

The  transformation  of arguments  takes  place according  to the
following  rules.(1)  For  purposes of this  example, the regular
expression  escape  character  is  assumed  to  be  the backslash
character, "".

     A)   each  occurrence of  "regexp" in "strN"  is replaced by
          "change_spec".

     B)   if a construct of  the form "{regexp2}X" is evaluated
          as  part  of  the  resulting  pattern-match,  then  any
          occurrence of  "X" in "change_spec" is  replaced by the
          string matched by "regexp2" during rule A above.

     C)   the character, "&", is predefined  to be a "label" for
          the characters matched by  "regexp".  Its occurrence in
          "change_spec" is treated as in rule B above.

     D)   preceding any character in "change_spec" by a "c" will
          cause  that  character  to be  copied  literally during
          replacement, thus  negating the effects of  rules B and
          C.

These  rules  imply that  the "change"  operation functions  in a |
manner  which is  similar to  the "substitute"  command(2) of the |
qedx  editor; regexp  plays the role  of the  left-hand side, and
change_spec  mimics the  right-hand side.   Each of  the argument
stringN's  are   analogues  of  lines  in   a  text  buffer.   If
"change_spec"  is a  null string,  any occurrence  of "regexp" in
each argument will be deleted.

________________________________________

(1) These are the same actions  as implemented by the subroutine,
    rx_$substitute.  In addition, if the default escape character
    is  not one  of the characters  "{", "}", or  "&", the user's
    regular expression  is augmented to  be preceded by  "{" and
    followed by "}&" before it  is translated.  This allows easy
    reference  to  the  entire   matched  string  by  the  change
    specification.

(2) The actions which take place  are similar.  The syntax of the
    regular  expression  is that  of  rx_$translate, not  that of
    rx_$translate_old (or qedx).

MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines
__                                                             __

rx                                                             rx
__                                                             __

As an example, the command line

     rx  change  "^f.{..*}X$"  X.fortran  [segs **]       |

will change all two-component names  whose first component is "f"
to  names  comprised  of  the  second  and  succeeding components
suffixed  by  ".fortran".   All  other names  are  returned asis.
Thus,  "f.foo.source"  becomes "foo.source.fortran";  but nothing
happens to "x.pl1".

OPERATION:  change_all

SYNTAX AS A COMMAND

   rx change_all regexp change_spec string {strings}

SYNTAX AS AN ACTIVE FUNCTION

   [rx change_all regexp change_spec string {strings}]

FUNCTION:   applies the  regular expression  given as  its second
argument to its fourth  and following arguments.  Those arguments
which the regular expression matches are transformed according to
the  specifications  given as  the  third argument  and returned.
This functions returns an error  diagnostic if any argument fails
to match the regular expression.

NOTES:

The  rules for  transforming arguments  are the  same as  for "rx
change" (q.v.).


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1
__                                                             __

rx                                                             rx
__                                                             __

OPERATION:  change_some

SYNTAX AS A COMMAND

   rx change_some regexp change_spec string {strings}

SYNTAX AS AN ACTIVE FUNCTION

   [rx change_some regexp change_spec string {strings}]

FUNCTION:   applies the  regular expression  given as  its second
argument to its fourth  and following arguments.  Those arguments
which the regular expression matches are transformed according to
the  specifications  given as  the  third argument  and returned.
Arguments that the regular expression does not match are returned
unaltered.   This  function  returns  an error  diagnostic  if no
argument can be matched by the regular expression.

NOTES:

The  rules for  transforming arguments  are the  same as  for "rx
change" (q.v.).


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines
__                                                             __

rx                                                             rx
__                                                             __

OPERATION:  count

SYNTAX AS A COMMAND

   rx count regexp string {strings}

SYNTAX AS AN ACTIVE FUNCTION

   [rx count regexp string {strings}]

FUNCTION:   applies the  regular expression  given as  its second
argument   to  its   third  and  subsequent   arguments.   On  an
argument-by-argument  basis,  it  returns   "1"  if  the  regular
expression  matched  at  least  once in  that  argument,  and "0"
otherwise.

OPERATION:  exclude

SYNTAX AS A COMMAND

   rx exclude regexp string {strings}

SYNTAX AS AN ACTIVE FUNCTION

   [rx exclude regexp string {strings}]

FUNCTION:   applies the  regular expression  given as  its second
argument to  its third and  subsequent arguments.  It  returns as
its result only those arguments  which do not "match" the regular
expression.


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1
__                                                             __

rx                                                             rx
__                                                             __

OPERATION:  exclude_index

SYNTAX AS A COMMAND

   rx exclude_index regexp string {strings}

SYNTAX AS AN ACTIVE FUNCTION

   [rx exclude_index regexp string {strings}]

FUNCTION:   applies the  regular expression  given as  its second
argument to  its third and  subsequent arguments.  It  returns as
its result  the indices of  those arguments which  do not "match"
the  regular  expression.   The  first  string  in  the  list  is
considered to have an index value of 1.

OPERATION:  include

SYNTAX AS A COMMAND

   rx include regexp string {strings}

SYNTAX AS AN ACTIVE FUNCTION

   [rx include regexp string {strings}]

FUNCTION:   applies the  regular expression  given as  its second
argument to  its third and  subsequent arguments.  It  returns as
its  result  only  those  arguments  which  "match"  the  regular
expression.


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines
__                                                             __

rx                                                             rx
__                                                             __

OPERATION:  include_index

SYNTAX AS A COMMAND

   rx include_index regexp string {strings}

SYNTAX AS AN ACTIVE FUNCTION

   [rx include_index regexp string {strings}]

FUNCTION:   applies the  regular expression  given as  its second
argument to  its third and  subsequent arguments.  It  returns as
its  result  the indices  of  those arguments  which  "match" the
regular expression.  The first string  to be tested is considered
to have an index value of 1.

OPERATION:  match_count

SYNTAX AS A COMMAND

   rx match_count regexp string {strings}

SYNTAX AS AN ACTIVE FUNCTION

   [rx match_count regexp string {strings}]

FUNCTION:   applies the  regular expression  given as  its second
argument to  its third and  subsequent arguments.  It  returns as
its result  the number of times  the regular expression "matched"
in each argument on an argument-by-argument basis.


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1
__                                                             __

rx                                                             rx
__                                                             __

OPERATION:  test

SYNTAX AS A COMMAND

   rx test regexp string {strings}

SYNTAX AS AN ACTIVE FUNCTION

   [rx test regexp string {strings}]

FUNCTION:   applies  the regular  expression  given as  its first
second    to    all    its   subsequent    arguments.     On   an
argument-by-argument  basis,  it  returns "true"  if  the pattern
defined by the regular expression  appears in the given argument,
and "false" otherwise.


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

11.  Include Files

This  section  contains the  declarations and/or  descriptions of
several include files available to the caller of rx_.
                                                                  *

11.1.  rx_execute_ctl.incl.pl1

This structure describes the parameters which control the pattern
match attempt and provide information on what kind of results are
desired.

dcl  1  rx_execute_ctl                aligned based,
        2  version                    char (8) aligned,
        2  select                     aligned,                    *
           3  restriction             fixed bin(35) aligned,      *
        2  flags                      aligned,
           3  match_detail_wanted     bit(1) unaligned,           |
           3  ignore_labels           bit(1) unaligned,
           3  restriction_is_bound    bit(1) unaligned,           *
           3  restriction_is_index    bit(1) unaligned,
           3  line_mode               bit(1) unaligned,           |
           3  PAD                     bit(31) unaligned;          |

where:

version
     is the identifier for this version of the structure.

                                                                  *
restriction
     is a value  which provides for the selection  of a subset of
     all  possible  matches  found  in  the  target  string.  The
     interpretation of this value depends upon the setting of the
     flag variables.

                                                                  *


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

match_detail_wanted
     is a flag which indicates  that the locations and lengths of
     each match should be placed in  a segment and the address of
     that  segment  should  be  returned  to  the  user.   In the
     description  of  rx_$execute,  this is  the  argument named,
     match_info_p.
ignore_labels
     when  set  to  "1"b  indicates that  any  labelled substring
     information  provided  by  compiled  code  for  "{ ... }X"
     sequences be left out of the match information structure.
                                                                  *
restriction_is_bound
     when  set to  "1"b indicates  that the  value of restriction
     should be taken  as an upper bound on  the number of matches
     reported  in the  match information  segment.  No  more than
     "restriction" instances of  the rx_match_info structure will
     be  placed  in  the  segment.   Less  than  this  amount may
     actually be returned, however.

restriction_is_index
     when set to "1"b indicates that only information on the N-th
     successful pattern match be returned.

line_mode                                                         |
     when set  to "1"b indicates  that the string  to be searched |
     for patterns  is to be  treated as if  it were made  up of a |
     series of lines separated by  newlines, rather than a single |
     string of characters.  The newlines in this case are treated |
     as delimiters for a sequences of strings.                    |

PAD
     is unused.


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

11.2.  rx_match_info.incl.pl1

These structures  describe the information on  where matches have
been made in the target string.

dcl  1  rx_match_info            aligned based,
        2  version               char (8) aligned,
        2  template_count        fixed bin (35) aligned,
        2  PAD                   bit (36) aligned,
        2  first_rx_match_data   ptr aligned;                     |

dcl  1  rx_match_data_template   aligned based,
        2  link                  aligned,
           3  next               ptr aligned,
        2  match                 aligned,
           3  start              fixed bin(21) aligned,
           3  length             fixed bin(21) aligned,
        2  label_count           fixed bin(17) aligned,
        2  label                 aligned,
           3  name   (0                                           |
                      refer(rx_match_data_template.label_count))  |
                                 char(1) unaligned,
           3  start  (0                                           |
                      refer(rx_match_data_template.label_count))  |
                                 fixed bin(21) aligned,
           3  length (0                                           |
                      refer(rx_match_data_template.label_count))  |
                                 fixed bin(21) aligned;

where:

version
     is the identifier for the current version of this structure.

template_count
     is the  number of instances  of rx_match_data_template which
     have  been placed  in the segment.   It is  included here to
     make the match data self-describing.

PAD
     is presently unused and should be set to ""b.


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

rx_match_data
     is the  pointer to the  first (if there were  any matches at |
     all)  of  a  series  of  linked  structures  describing  the
     patterns found.

next
     is a pointer  to the next instance of  this structure in the
     match  info  segment.   It  has  the  value  null()  if this
     instance is the last in the list.

start
     is the index  from the beginning of the  string to the start
     of the match.

length
     is  the number  of characters  which matched  the pattern on
     this attempt.

label_count
     is a count  of the number of labelled  substrings which were
     identified  during this  pattern match  attempt.  This value
     may      be      forced     to      zero      by     setting
     rx_execute_ctl.flags.ignore_labels.

name
     is the character which was specified to be used as the label
     for this substring.

start
     is the index from the beginning  of the target string to the
     first character of the substring.

length
     is the number of characters in the named substring.

NOTES:

Match data is  returned as a starting location  and a length.  To
clarify the  notion of starting location,  the following model is
used.   Given  a  string  of length  n,  there  are  n+1 possible
starting locations for a string match, numbered as follows:

       1. before the first character of the string

       2. between the first and second characters of the string

       3. between the second and third characters of the string


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

      ...

       n. between  the next-to-last  and the  final characters of
          the string

     n+1. after the last character of the string

The  length of  the match  is given  as the  number of characters
matched.  Thus, a match or  labeled substring which is identified
as starting at location 2 with a length of 0 (zero) specifies the
null string between the first and second characters.

Match information is returned in order from the leftmost match to
the rightmost.   Note, however, that two  succeeding instances of
rx_match_info may specify the same starting location depending on
the regular expression being used(1) for the match attempt.

Unlike match attempts which proceed from left to right, the order
of  production  of  labeled   substring  data  is  implementation
dependent.  The user is  cautioned against making any assumptions
about the order in which the labels are returned.

________________________________________

(1) Consider  that  the  expression  "^|."   will  return  two
    instances  of  rx_match_info starting  at  location 1  in the
    string "A".  One  will have a length of  zero, the other will
    have a length of one.


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

                                                                  |
                                                                  |
11.3.  rx_error_info.incl.pl1                                     |
                                                                  |
Whenever  possible, the  rx_subroutine reporting  an error  via a |
call to  sub_err_ will supply  information to assist  the user in |
locating  the  source of  the error.   This takes  the form  of a |
formatted  string, and  a position  withing the  string where the |
error is.  The formatted string is pre-processed so that terminal |
dependencies  are edited  out; e.g.  backslashes  are doubled for |
printing,  control and  non-ASCII characters  are converted  to a |
"nnn"  representation.   The error  position reflects  the error |
position within this edited string.                               |
                                                                  |
                                                                  |
A  pointer  to  this  structure is  supplied  as  the information |
pointer on the call to sub_err_.                                  |
                                                                  |
                                                                  |
dcl  rx_error_info_string_len         fixed bin(21);              |
                                                                  |
dcl  1 rx_error_info                  aligned based,              |
       2 version                      char(8) aligned,            |
       2 string                       aligned,                    |
         3 len                        fixed bin(21),              |
         3 error_pos                  fixed bin(21),              |
         3 text          char (rx_error_info_string_len           |
                            refer (rx_error_info.string.len));    |
                                                                  |
                                                                  |
where:                                                            |
                                                                  |
                                                                  |
version                                                           |
     is the identifier for the current version of this structure. |
                                                                  |
len                                                               |
     is the length of the edited display string.                  |
                                                                  |
error_pos                                                         |
     is the position, relative to  the start of the string, where |
     the error occurred.  When error_pos  = 0, the error reported |
     applies to the entire string.                                |
                                                                  |
text                                                              |
     is the edited display string.                                |


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

12.  Comparative Timings

This section  provides some timing comparisons  for the functions
of  translation  and  execution  between  the  regular expression
facility   described   here  and   that  presently   provided  by
search_file_.(1)
                                                                  |
                                                                  |
The  timings  represented  here  are  those  of  the  software as |
described  by  the first  publication  of the  MTB.   They differ |
slightly, but not substantially, from those of later revisions.   |

12.1.  The Test Environment

The overall setup of each test is as follows:

     1) All  target  strings were  built from  randomly generated
        strings of  digits or alphabetic  characters.  The random
        number generator  "seed" was set at  the beginning of the
        tests to insure repeat results.

     2) Only the time used from  the beginning of the translation
        (or execution)  to the return  from the routine  has been
        accumulated.  No  processing times for  the returned data
        are included.  All times reported are in milliseconds.

     3) For  each  test,  the   translation  routine  was  called
        multiple times,  and the results averaged.   The same was
        done for the execution routine.  In both cases, the first
        timing run data was discarded.

     4) In all  tests, the number  of strings which  were matched
        was the same by both routines.

________________________________________

(1) The  search_file_  interface  does not  provide  for separate
    translation  and  execution.   Translation  was  simulated by
    applying  the pattern  to a zero-length  string; execution by
    making use of the "last expression" feature.


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

12.1.1.  Test 1

Objective:  match an entire string of characters.

Regular expression:  "^.*$"

Strings:                  1
Length (min):           100
Length (max):           100
Length (average):       100

Total Trials:            25
Total Matches:           25

                               rx_   search_file_         ratio
                                                               
Compilation                 10.100          0.328         30.79

Execution                    0.330          0.504          0.65

12.1.2.  Test 2

Objective:  match  the  last  character  in  a  string  of random
            length.

Regular expression:  ".$"

Strings:                100
Length (min):             1
Length (max):            99
Length (average):        50

Total Trials:            25
Total Matches:         2475

                               rx_   search_file_         ratio
                                                               
Compilation                  8.937          0.406         22.01

Execution                   38.512        323.394          0.12


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

12.1.3.  Test 3

Objective:  find  all  runs  of  9's in  digit  string  of random
            length.

Regular expression:  "99*"

Strings:                100
Length (min):             1
Length (max):            98
Length (average):        50

Total Trials:            25
Total Matches:        11300

                               rx_   search_file_         ratio
                                                               
Compilation                  8.596          0.271         31.72

Execution                  100.374        119.587          0.84

12.1.4.  Test 4

Objective:  find runs of ordered sequences of digits.

Regular expression:  "0.*1.*2.*3.*4.*5.*6.*7.*8.*9"

Strings:                100
Length (min):             1
Length (max):            97
Length (average):        49

Total Trials:            25
Total Matches:          500

                               rx_   search_file_         ratio
                                                               
Compilation                 41.026          1.221         33.60

Execution                1,867.058      2,697.056          0.69


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

12.1.5.  Test 5

Objective:  find all strings of random characters ending in "Zz".

Regular expression:  "Zz$"

Strings:                500
Length (min):             1
Length (max):           160
Length (average):        80

Total Trials:            25
Total Matches:           25

                               rx_   search_file_         ratio
                                                               
Compilation                  9.515          0.231         41.19

Execution                  120.327        100.171          1.20

12.2.  General Commentary on the Tests

As should  be obvious, the  translation times for  the new method
are  significantly longer  than those  of search_file_.   This is
primarily  due to  the increased complexity  which the translator
must be prepared to handle.  Furthermore, there are optimizations
in both the lexical scanning and code generation phases which are
performed  to  speed  up  the execution.   This  is  done  on the
expectation  that  translation  will  occur far  less  often than
execution.   Since  the storage  of  the compiled  code  is under
control  of  the caller,  it  is possible  to  provide additional
guarantees that this is so by  means of an "n"-level LRU cache of
expression strings and translated equivalents.

In four  out of the five  tests done, the execution  speed of the
new method was faster.  The exception has been analyzed and it is
slower because,  again, there is additional  mechanism present to
support the additional information-return possibilities.


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

12.3.  Comparison Against A Compiled Program

Research,  using  an earlier  implementation  of a  facility like
this,  was done  comparing its performance  against PL/I programs
specially  written for  specific purposes.  A  description of the
work  was published  in the Proceedings  of HLSUA Forum XXXIV.(1)
The  timing  studies indicate  that,  at best,  the hand-tailored
program  offered a  30% performance advantage  while taking about
300% longer to write and debug.

________________________________________

(1) April 4-7, 1982; pp. 276-291


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

13.  Miscellaneous Topics

This section  contains various topics  related to the  subject of
the regular expression facility  described in this document which
do not logically fit in the exposition of other sections.

13.1.  Use of National Characters

The facility  described in this document  employs many characters
from the  ISO International Alphabet  #5 which are  designated as
"national"  characters.  In  view of the  evolving consensus away
from the  use of such  characters in Multics'  text processors, a
few words about their use here is appropriate.

The  regular  expressions  implemented  by  this  facility employ
"national"   characters  for   two  reasons.    First,  they  are
considered  as  part of  the  Multics standard  character set.(1)
Second,  it  is  difficult  (if not  impossible)  to  provide the
required functionality in such a  compact form without the use of
these, or other, "national" characters.

The author is open to suggestions and comments which will rectify
this situation.

13.2.  Installing it in the System

By use of the entry  point rx_$translate_old, this facility could
be  used  in a  re-written  "search_file_" to  provide equivalent
functionality  through  an   existing  interface.   In  addition,
"search_file_"  could also  be enhanced to  provide a multi-level
cache of  the "n" most recently  used regular expressions, rather
than the single level provided now.

Users could select which type of processing they wanted "old" vs.
"new"  via  an  interface  command  or  through  a process-global
switch.

________________________________________

(1) See MPM Reference Manual, AG91-03  Appendix A, (in draft form
    at the time of this  writing) which defines the characters of
    Multics to be those of ANSI X3.4-1968.


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

13.3.  Starname Processing

Given the  facility described here,  it is possible  to provide a
simple  transformation  from  Multics  "starnames"  into  regular
expressions.   This would  enable a  single facility  to underlie
both mechanisms.   An informal set  of rules which  describe this
transformation  are listed  below.  For  discussion purposes, the
escape  character is  assumed to  be the  accent grave character,
"`".

     1) All  occurrences of  the escape character  are doubled so
        that they match themselves literally.

     2) Any "?"  is transformed into "`.".

     3) The sequence "**" is changed to be "`.`*".

     4) The single "*" occurrence becomes "`[`^.`]`*".

     5) All other characters are copied as is.

     6) After the pattern is constructed,  it is prefixed by "`^"
        and suffixed by "`$".

Of  course,  this transformation  assumes  that the  starname was
well-formed  to  begin  with.   However, syntax  checking  can be
incorporated as part of the transformation process.  An advantage
of this method  is that the starname syntax  may later be relaxed
to allow items  like "*foo*" with much less  effort than re-doing
the present finite-state machine program.
                                                                  |
                                                                  |
13.4.  Efficiency Considerations                                  |
                                                                  |
Despite  attempts  to make  the  translated form  of  the regular |
expression  run  as quickly  as  possible, there  are  still some |
things that the  writer of the regular expression  can do to help |
out.  This  section discusses some of  the optimizations that the |
translator does  and gives some tips  on what the user  can do to |
avoid causing the execution to be slowed.                         |


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

13.4.1.  Translator Optimizations                                 |
                                                                  |
In  order   to  provide  for  efficient   pattern  matching,  the |
translator does  a number of optimizations  (described earlier as |
building "clumps").  In this way,  it can generate more effective |
code than  would otherwise be  the case if  the syntax definition |
definition  were  followed  slavishly.    The  semantics  of  the |
optimization  are identical  with that in  the formal definition, |
however.                                                          |
                                                                  |
                                                                  |
In the  following discussion, the escape  character is assumed to |
be the accent grave (`) character.                                |
                                                                  |
                                                                  |
     -- A  single  character  set  membership  test  followed  by |
        indefinite repetition is treated as indefinite repetition |
        of  the character.   That is, "`[X`]`*"  is translated as |
        "X`*".                                                    |
                                                                  |
     -- Strings of "`."s are combined  into a single test for the |
        specified number of characters remaining to be matched.   |
                                                                  |
     -- Strings of "`."s followed by a repetition are adjusted to |
        include  the length  of the  "`."-string less  one in the |
        bounds.   For  example, "`.`.`.`.`*"  becomes "`.`<3:`>"; |
        and  "`.`.`.`.`.`<:6`>"  is  translated  as  if  it  were |
        "`.`<4:10`>".                                             |
                                                                  |
     -- Strings of  ordinary characters are  combined into single |
        string comparisons.                                       |
                                                                  |
     -- Set membership ("`[  `]")  and set exclusion ("`[`^  `]") |
        strings are  transformed internally into  tables suitable |
        for  use  with  the  hardware  TCT  (Test  Character  and |
        Translate) instruction.                                   |
                                                                  |
     -- Notice is taken of initial character strings which can be |
        used  to  find places  where  the scan  can begin  in the |
        target strings.  Expressions  like "define`|declare" will |
        only  be profitably  begun in  target strings  which have |
        sequences of the letters "de".                            |
                                                                  |
     -- The translator remembers the  minimum and maximum lengths |
        which the pattern needs to succeed and passes this to the |
        execute phase.  The pattern  "AB`.`.`*C" needs at least a |
        four-character target string to succeed.                  |


MTB-660                                Multics Technical Bulletin
Revision 1                            Regular Expression Routines

     -- Items  which  are  indefinitely repeated,  and  which are |
        followed by character string patterns, will automatically |
        do  "backscanning"  to find  effective places  to resume. |
        For example,  in "`.`*AB", the "`.`*"  matches the entire |
        remainder  of  the  string.   Then  it  backs  up  to the |
        rightmost  occurrence  of "AB"  before  proceeding.  This |
        cuts down on excessive backtracking at run-time.          |
                                                                  |
     -- When  building  the  parse   tree,  notice  is  taken  of |
        "anchoring" of  the pattern at  either end.  And  this is |
        propagated to the root node.  Thus, "`^ABCDE`|`^VWXYZ" is |
        just  as  effective  as  "`^`(ABCDE`|VWXYZ`)".   The same |
        applies to the use of "`$".                               |
                                                                  |
                                                                  |
13.4.2.  Sequences That Work Against Efficiency                   |
                                                                  |
Despite the attempts at  producing efficient run-time code, there |
is a limited  amount of time and effort  that the translator will |
spend  achieving  this goal.   While the  assumption is  that the |
translated  pattern will  be applied many  more times  than it is |
translated,  it is  also assumed that  the source of  many of the |
strings given to the translator will be a human user.  Therefore, |
speed in translation also must be considered.  The following list |
of  situations  result  in  slower execution  speeds  because the |
translator is not yet good enough  or fast enough to resolve them |
into efficient equivalent sequences.                              |
                                                                  |
                                                                  |
As above, the escape character  used in the illustrations will be |
accent grave.                                                     |
                                                                  |
                                                                  |
     -- Items  enclosed  in  parentheses, which  are  followed by |
        repetition  forms,  are   translated  into  longer,  more |
        general, and  therefore slower code  sequences.  For this |
        reason, "X`*" is preferable  to "`(X`)*" even though they |
        have equivalent semantics.                                |
                                                                  |
     -- The  use of  alternation on single  characters results in |
        slower  execution  than   set  membership  tests.   Thus, |
        "`[XYZ`]" executes faster than "X`|Y`|Z".                 |
                                                                  |
     -- Recursion is slower than repetition.  The pattern, "A`*", |
        will run faster than "A`&".                               |
                                                                  |
     -- In  general,  recursion upsets  the anchoring  and target |
        string  length analysis  due to  the difficulty  of doing |
        efficient  flow  analysis.  Therefore,  it  executes more |
        slowly.                                                   |


Multics Technical Bulletin                                MTB-660
Regular Expression Routines                            Revision 1

13.5.  Implementation Restrictions                                |
                                                                  |
The  following  items  are  general notes  about  details  of the |
implementation  which  restrict in  some  way the  extent  of the |
translator or execution domains.                                  |
                                                                  |
                                                                  |
     -- The  translator  does all  of its  space allocation  in a |
        single segment area for efficiency.  This is not expected |
        to  cause  problems  since it  requires  several thousand |
        "clumps"  before  this  limit is  broached.   Most users' |
        regular expressions have on the order of a dozen or less. |
                                                                  |
     -- There   is  a   limit  to  the   amount  of  backtracking |
        (checkpoint) information which can be stored at execution |
        time.   A checkpoint  is taken when  alternation ("`|" or |
        "`|`|")  is done,  when repetition  is used  ("A`*"), and |
        when  recursion is  performed ("A`&")  at execution time. |
        There   is   room  for   about  32,000   checkpoints  per |
        pattern-match attempt.                                    |
                                                                  |
     -- Since  the match  data is  returned in  a single segment, |
        there is a  limit to the number of  times a given pattern |
        will be reported as  matching a particular target string. |
        This limit  is also affected  by the number  of "labelled |
        substrings" which  are defined.  If  all ninty-four label |
        characters   are  used   in  each   match  attempt,  then |
        approximately  900 matches  can be  reported back  to the |
        user.  If  no label characters  are used (or  if they are |
        specifically  ignored),  then slightly  more  than 52,000 |
        matches will be reported.                                 |
                                                                  |
     -- Because of limits on the  nature of the data available to |
        the   run-time   regarding   "long"   repetition   as  in |
        "`(ABCDE`)`<10:20`>" where the parentheses interfere with |
        optimization,   recursion   cannot   be   used   in  such |
        combinations.  Thus, the translator will refuse to accept |
        "`(/`&/`)`<14:16`>".    Simpler   repetition,   as   with |
        "X`<6:7`>, can still be used with recursion, however.     |
                                                                  |
        The exact conditions where this will cause a conflict are |
        difficult to delineate exactly.  Furthermore, this is not |
        expected to cause most users a problem.                   |
Home | History | People | Library | Sites | About | Site Map | Changes