documentation/info_segments/lex_string

  1 02/06/84  lex_string_
  2 
  3 The lex_string_ subroutine provides a facility for parsing an ASCII
  4 character string into tokens (character strings delimited by break
  5 characters) and statements (groups of tokens).  It supports the parsing
  6 of comments and quoted strings.  It parses an entire character string
  7 during one invocation, creating a chain of descriptors for the tokens
  8 and statements in a temporary segment.  The cost per token of
  9 lex_string_ is significantly lower than that of parse_file_ because the
 10 overhead of calling parse_file_ to obtain each token is eliminated.
 11 Therefore, the lex_string_ subroutine is recommended for translators
 12 that deal with moderate to large amounts of input.
 13 
 14 
 15 The descriptors generated when the lex_string_ subroutine parses a
 16 character string can be used as input to translators generated by the
 17 reduction_compiler command, as well as in other applications.  In
 18 addition, the information in the statement and token descriptors can be
 19 used in error messages printed by the lex_error_ subroutine.
 20 
 21 Refer to the Subroutines manual for details on the operation of the
 22 lex_string_ subroutine.
 23 
 24 
 25 Entry points in lex_string_:
 26    (List is generated by the help command)
 27 
 28 
 29 :Entry:  init_lex_delims:  02/06/84 lex_string_$init_lex_delims
 30 
 31 
 32 Function: constructs two character strings from the set of break
 33 characters and comment, quoting, and statement delimiters: one string
 34 contains the first character of every delimiter or break character
 35 defined by the language to be parsed; the second string contains a
 36 character of control information for each character in the first
 37 string.  These two character strings form the break tables that the
 38 lex_string_ subroutine uses to parse an input string.  It is intended
 39 that these two (delimiter and control) character strings be internal
 40 static variables of the program that calls lex_string_, and that they
 41 be initialized only once per process.  They can then be used in
 42 successive calls to lex_string_$lex.
 43 
 44 
 45 Syntax:
 46 declare lex_string_$init_lex_delims entry (char(*), char(*), char(*),
 47      char(*), char(*), bit(*), char(*) varying aligned,
 48      char(*) varying aligned, char(*) varying aligned,
 49      char(*) varying aligned);
 50 call lex_string_$init_lex_delims (quote_open, quote_close,
 51      comment_open, comment_close, statement_delim, Sinit, break_chars,
 52      ignored_break_chars, lex_delims, lex_control_chars);
 53 
 54 
 55 Arguments:
 56 quote_open
 57    is the character string delimiter that begins a quoted string.
 58    (Input).  It can contain up to four characters.  If it is a null
 59    character string, then quoted strings are not supported during the
 60    parsing of an input string.
 61 quote_close
 62    is the character string delimiter that ends a quoted string.
 63    (Input).  It can be the same character string as quote_open, and can
 64    contain up to four characters.
 65 comment_open
 66    is the character string delimiter that begins a comment.  (Input).
 67    It can contain up to four characters.  If it is a null character
 68    string, then comments are not supported during the parsing of a
 69    character string.
 70 
 71 
 72 comment_close
 73    is the character string delimiter that ends a comment.  (Input).  It
 74    can be the same character string as comment_open, and can contain up
 75    to four characters.
 76 statement_delim
 77    is the character string delimiter that ends a statement.  (Input).
 78    It can contain up to four characters.  If it is a null character
 79    string, then statements are not delimited during the parsing of a
 80    character string.
 81 
 82 
 83 Sinit
 84    is a bit string that controls the creation of statement descriptors
 85    and token descriptors for quoting delimiters.  (Input)  The bit
 86    string consists of two bits in the order listed below.
 87    Ssuppress_quoting_delims
 88       is "1"b if token descriptors for the quote opening and closing
 89       delimiters of a quoted string are to be suppressed.  A token
 90       descriptor is still created for the quoted string itself, and the
 91       quoted_string switch in this descriptor is turned on.  If
 92       Ssuppress_quoting_delims is "0"b, then token descriptors are
 93       returned for the quote opening and closing delimiters, as well as
 94       for the quoted string.
 95 
 96 
 97    Ssuppress_stmt_delims
 98       is "1"b if the token descriptor for a statement delimiter is to
 99       be suppressed.  The end_of_stmt switch in the descriptor of the
100       token that precedes the statement delimiter is turned on,
101       instead.  If Ssuppress_stmt_delims is "0"b, then a token
102       descriptor is returned for a statement delimiter, and the
103       end_of_stmt switch in this descriptor is turned on.
104 
105 
106 break_chars
107    is a character string containing all of the characters that can be
108    used to delimit tokens.  (Input).  The string can include characters
109    used also in the quoting, comment, or statement delimiters, and
110    should include any ASCII control characters that are to be treated
111    as delimiters.
112 ignored_break_chars
113    is a character string containing all of the break_chars that can be
114    used to delimit tokens but that are not tokens themselves.  (Input).
115    No token descriptors are created for these characters.
116 
117 
118 lex_delims
119    is an output character string containing all of the delimiters that
120    the lex_string_ subroutine uses to parse an input string.  (Output)
121    This string is constructed by the init_lex_delims entry from the
122    preceding arguments.  It must be long enough to contain all of the
123    break_chars, plus the first character of the quote_open delimiter,
124    the comment_open delimiter, and the statement_delim delimiter, plus
125    30 additional characters.  This length must not exceed 128
126    characters, the number of characters in the ASCII character set.
127 lex_control_chars
128    is an output character string containing one character of control
129    information for each character in lex_delims.  (Output).  This
130    string is also constructed by init_lex_delims from the preceding
131    arguments.  It must be as long as lex_delims.
132 
133 
134 :Entry:  lex:  02/06/84 lex_string_$lex
135 
136 
137 Function: parses an input string according to the delimiters, break
138 characters, and control information given as its arguments.  The input
139 string consists of two parts: the first part is a set of characters,
140 which are to be ignored by the parser except for the counting of
141 lines; the second part is the characters to be parsed.  It is
142 necessary to count lines in the part that is otherwise ignored so that
143 accurate line numbers can be stored in the token and statement
144 descriptors for the parsed section of the string.
145 
146 
147 Syntax:
148 declare lex_string_$lex entry (ptr, fixed bin(21), fixed bin(21), ptr,
149      bit(*), char(*), char(*), char(*), char(*), char(*),
150      char(*) varying aligned, char(*) varying aligned,
151      char(*) varying aligned, char(*) varying aligned, ptr, ptr,
152      fixed bin(35));
153 call lex_string_$lex entry (Pinput, Linput, Lignored_input, Psegment,
154      Slex, quote_open, quote_close, comment_open, comment_close,
155      statement_delim, break_chars, ignored_break_chars, lex_delims,
156      lex_control_chars, Pfirst_stmt_desc, Pfirst_token_desc, code);
157 
158 
159 Arguments:
160 Pinput
161    is a pointer to the string to be parsed.  (Input)
162 Linput
163    is the length (in characters) of the second part of the input
164    string, the part that is actually to be parsed.  (Input)
165 Lignored_input
166    is the length (in characters) of the first part of the input string,
167    the part that is ignored except for line counting.  (Input).  This
168    length can be 0 if none of the input characters are to be ignored.
169 Psegment
170    is a pointer to a temporary segment created by the translator_temp_
171    subroutine.  (Input)
172 
173 
174 SLex
175    is a bit string that controls the creation of statement and comment
176    descriptors, the handling of doubled quotes within a quoted string,
177    and the interpretation of a comment_close delimiter that equals the
178    statement_delim.  (Input).  The bit string consists of four bits:
179    Sstatement_desc
180       is "1"b if statement descriptors are to be created along with the
181       token descriptors.  If Sstatement_desc is "0"b, or if the
182       statement delimiter is a null character string, then no statement
183       descriptors are created.
184    Sscomment_desc
185       is "1"b if comment descriptors are to be created for any comments
186       that appear in the input string.  When Scomment_desc is "0"b,
187       comment_open is a null character string, or statement descriptors
188       are not being created, then no comment descriptors are created.
189 
190 
191    Sretain_doubled_quotes
192       is "1"b if doubled quote_close delimiters that appear within a
193       quoted string are to be retained.  If Sretain_doubled_quotes is
194       "0"b, then a copy of each quoted string containing doubled
195       quote_close delimiters is created in the temporary segment with
196       all doubled quote_close delimiters changed to single quote_close
197       delimiters.
198    Sequate_comment_close_stmt_delim
199       is "1"b if the comment_close and statement_delim character
200       strings are the same, and if the closing of a comment is to be
201       treated as the ending of the statement containing the comment.
202       It could be used when parsing line-oriented languages that have
203       only one statement per line and one comment per statement.
204 
205 
206 quote_open
207    is the character string delimiter that begins a quoted string.
208    (Input).  It can contain up to four characters.  If it is a null
209    character string, then quoted strings are not supported during the
210    parsing of an input string.
211 quote_close
212    is the character string delimiter that ends a quoted string.
213    (Input).  It can be the same character string as quote_open, and can
214    contain up to four characters.
215 comment_open
216    is the character string delimiter that begins a comment.  (Input).
217    It can contain up to four characters.  If it is a null character
218    string, then comments are not supported during the parsing of a
219    character string.
220 
221 
222 comment_close
223    is the character string delimiter that ends a comment.  (Input).  It
224    can be the same character string as comment_open, and can contain up
225    to four characters.
226 statement_delim
227    is the character string delimiter that ends a statement.  (Input).
228    It can contain up to four characters.  If it is a null character
229    string, then statements are not delimited during the parsing of a
230    character string.
231 break_chars
232    is a character string containing all of the characters that can be
233    used to delimit tokens.  (Input).  The string can include characters
234    used also in the quoting, comment, or statement delimiters, and
235    should include any ASCII control characters that are to be treated
236    as delimiters.
237 
238 
239 ignored_break_chars
240    is a character string containing all of the break_chars that can be
241    used to delimit tokens but that are not tokens themselves.  (Input).
242    No token descriptors are created for these characters.
243 lex_delims
244    is the character string initialized by lex_string_$init_lex_delims.
245    (Input)
246 lex_control_chars
247    is the character string initialized by lex_string_$init_lex_delims.
248    (Input)
249 Pfirst_stmt_desc
250    is a pointer to the first in the chain of statement descriptors.
251    (Output).  This is a null pointer on return if no statement
252    descriptors have been created.
253 
254 
255 Pfirst_token_desc
256    is a pointer to the first in the chain of token descriptors.
257    (Output).  This is a null pointer on return if no tokens were found
258    in the input string.
259 
260 
261 code
262    is one of the following status codes:  (Output)
263    0
264       the parsing was completed successfully.
265    error_table_$zero_length_seg
266       no tokens were found in the input string.
267    error_table_$no_stmt_delim
268       the input string did not end with a statement delimiter, when
269       statement delimiters were used in the parsing.
270    error_table_$unbalanced_quotes
271       the input string ended with a quoted string that was not
272       terminated by a quote_close delimiter.