Previous Next Contents

2   ML-Yacc specifications

An ML-Yacc specification consists of three parts, each of which is separated from the others by a %% delimiter. The general format is:
{user declarations}
%%
{ML-Yacc declarations}
%%
{rules}
You can define values available in the semantic actions of the rules in the user declarations section. It is recommended that you keep the size of this section as small as possible and place large blocks of code in other modules.

The ML-Yacc declarations section is used to make a set of required declarations and a set of optional declarations. You must declare the nonterminals and terminals and the types of the values associated with them there. You must also name the parser and declare the type of position values. You should specify the set of terminals which can follow the start symbol and the set of non-shiftable terminals. You may optionally declare precedences for terminals, make declarations that will improve error-recovery, and suppress the generation of default reductions in the parser. You may declare whether the parser generator should create a verbose description of the parser in a ``.desc'' file. This is useful for finding the causes of shift/reduce errors and other parsing conflicts.

You may also declare whether the semantic actions are free of significant side-effects and always terminate. Normally, ML-Yacc delays the evaluation of semantic actions until the completion of a successful parse. This ensures that there will be no semantic actions to ``undo'' if a syntactic error-correction invalidates some semantic actions. If, however, the semantic actions are free of significant side-effects and always terminate, the results of semantic actions that are invalidated by a syntactic error-correction can always be safely ignored.

Parsers run faster and need less memory when it is not necessary to delay the evaluation of semantic actions. You are encouraged to write semantic actions that are free of side-effects and always terminate and to declare this information to ML-Yacc.

A semantic action is free of significant side-effects if it can be reexecuted a reasonably small number of times without affecting the result of a parse. (The reexecution occurs when the error-correcting parser is testing possible corrections to fix a syntax error, and the number of times reexecution occurs is roughly bounded, for each syntax error, by the number of terminals times the amount of lookahead permitted for the error-correcting parser).

The rules section contains the context-free grammar productions and their associated semantic actions.

2.1   Lexical Definitions

Comments have the same lexical definition as they do in Standard ML and can be placed anywhere in a specification.

All characters up to the first occurrence of a delimiting %% outside of a comment are placed in the user declarations section. After that, the following words and symbols are reserved:


of for = { } , * -> : | ( )
The following classes of ML symbols are used:
identifiers:
nonsymbolic ML identifiers, which consist of an alphabetic character followed by one or more alphabetic characters, numeric characters, primes ``''', or underscores ``_''.
type variables:
nonsymbolic ML identifier starting with a prime ``'''
integers:
one or more decimal digits.
qualified identifiers:
an identifer followed by a period.
The following classes of non-ML symbols are used:
% identifiers:
a percent sign followed by one or more lowercase alphabet letters. The valid % identifiers are:
%arg %eop %header %keyword %left %name %nodefault %nonassoc %nonterm %noshift %pos %prec %prefer %pure %right %start %subst %term %value %verbose
code:
This class is meant to hold ML code. The ML code is not parsed for syntax errors. It consists of a left parenthesis followed by all characters up to a balancing right parenthesis. Parentheses in ML comments and ML strings are excluded from the count of balancing parentheses.

2.2   Grammar

This is the grammar for specifications:
spec ::= user-declarations %% cmd-list %% rule-list
ML-type ::= nonpolymorphic ML types (see the Standard ML manual)
symbol ::= identifier
symbol-list ::= symbol-list symbol
  | e
symbol-type-list ::= symbol-type-list | symbol of ML-type
  | symbol-type list | symbol
  | symbol of ML-type
  | symbol
subst-list ::= subst-list | symbol for symbol
  | e
cmd ::= %arg (Any-ML-pattern) : ML-type
  | %eop symbol-list
  | %header code
  | %keyword symbol-list
  | %left symbol-list
  | %name identifier
  | %nodefault
  | %nonassoc symbol-list
  | %nonterm symbol-type list
  | %noshift symbol-list
  | %pos ML-type
  | %prefer symbol-list
  | %pure
  | %right symbol-list
  | %start symbol
  | %subst subst-list
  | %term symbol-type-list
  | %value symbol code
  | %verbose
cmd-list ::= cmd-list cmd
  | cmd
rule-prec ::= %prec symbol
  | e
clause-list ::= symbol-list rule-prec code
  | clause-list | symbol-list rule-prec code
rule ::= symbol : clause-list
rule-list ::= rule-list rule
  | rule

2.3   Required ML-Yacc Declarations

%name
You must specify the name of the parser with %name {name}.
%nonterm and %term
You must define the terminal and nonterminal sets using the %term and %nonterm declarations, respectively. These declarations are like an ML datatype definition. The type of the value that a symbol may carry is defined at the same time that the symbol is defined. Each declarations consists of the keyword (%term or %nonterm) followed by a list of symbol entries separated by a bar (``|''). Each symbol entry is a symbol name followed by an optional ``of <ML-type>''. The types cannot be polymorphic. Those symbol entries without a type carry no value. Nonterminal and terminal names must be disjoint and no name may be declared more than once in either declaration.

The symbol names and types are used to construct a datatype union for the values on the semantic stack in the LR parser and to name the values associated with subcomponents of a rule. The names and types of terminals are also used to construct a signature for a structure that may be passed to the lexer functor.

Because the types and names are used in these manners, do not use ML keywords as symbol names. The programs produced by ML-Yacc will not compile if ML keywords are used as symbol names. Make sure that the types specified in the %term declaration are fully qualified types or are available in the background environment when the signatures produced by ML-Yacc are loaded. Do not use any locally defined types from the user declarations section of the specification.

These requirements on the types in the %term declaration are not a burden. They force the types to be defined in another module, which is a good idea since these types will be used in the lexer module.
%pos
You must declare the type of position values using the %pos declaration. The syntax is %pos <ML-type>. This type MUST be the same type as that which is actually found in the lexer. It cannot be polymorphic.

2.4   Optional ML-Yacc Declarations

%arg
You may want each invocation of the entire parser to be parameterized by a particular argument, such as the file-name of the input being parsed in an invocation of the parser. The %arg declaration allows you to specify such an argument. (This is often cleaner than using ``global'' reference variables.) The declaration


%arg (Any-ML-pattern) : <ML-type>
specifies the argument to the parser, as well as its type. For example:


%arg (filename) : string


If %arg is not specified, it defaults to () : unit.
%eop and %noshift
You should specify the set of terminals that may follow the start symbol, also called end-of-parse symbols, using the %eop declaration. The %eop keyword should be followed by the list of terminals. This is useful, for example, in an interactive system where you want to force the evaluation of a statement before an end-of-file (remember, a parser delays the execution of semantic actions until a parse is successful).

ML-Yacc has no concept of an end-of-file. You must define an end-of-file terminal (EOF, perhaps) in the %term declaration. You must declare terminals which cannot be shifted, such as end-of-file, in the %noshift declaration. The %noshift keyword should be followed by the list of non-shiftable terminals. An error message will be printed if a non-shiftable terminal is found on the right hand side of any rule, but ML-Yacc will not prevent you from using such grammars.

It is important to emphasize that non-shiftable terminals must be declared. The error-correcting parser may attempt to read past such terminals while evaluating a correction to a syntax error otherwise. This may confuse the lexer.
%header
You may define code to head the functor {parser name}LrValsFun here. This may be useful for adding additonal parameter structures to the functor. The functor must be parameterized by the Token structure, so the declaration should always have the form:
%header (functor {parser name}LrValsFun(
                                structure Token : TOKEN
                                       ...) 
        )
%left,%right,%nonassoc
You should list the precedence declarations in order of increasing (tighter-binding) precedence. Each precedence declaration consists of % keyword specifying associativity followed by a list of terminals. The keywords are %left, %right, and %nonassoc, standing for their respective associativities.
%nodefault
The %nodefault declaration suppresses the generation of default reductions. If only one production can be reduced in a given state in an LR table, it may be made the default action for the state. An incorrect reduction will be caught later when the parser attempts to shift the lookahead terminal which caused the reduction. ML-Yacc usually produces programs and verbose files with default reductions. This saves a great deal of space in representing the LR tables,but sometimes it is useful for debugging and advanced uses of the parser to suppress the generation of default reductions.
%pure
Include the %pure declaration if the semantic actions are free of significant side-effects and always terminate.
%start
You may define the start symbol using the %start declaration. Otherwise the nonterminal for the first rule will be used as the start nonterminal. The keyword %start should be followed by the name of the starting nonterminal. This nonterminal should not be used on the right hand side of any rules, to avoid conflicts between reducing to the start symbol and shifting a terminal. ML-Yacc will not prevent you from using such grammars, but it will print a warning message.
%verbose


Include the %verbose declaration to produce a verbose description of the LALR parser. The name of this file is the name of the specification file with a ``.desc'' appended to it.

This file has the following format:
  1. A summary of errors found while generating the LALR tables.
  2. A detailed description of all errors.
  3. A description of the states of the parser. Each state is preceded by a list of conflicts in the state.

2.5   Declarations for improving error-recovery

These optional declarations improve error-recovery:

%keyword
Specify all keywords in a grammar here. The %keyword should be followed by a list of terminal names. Fixes involving keywords are generally dangerous; they are prone to substantially altering the syntactic meaning of the program. They are subject to a more rigorous parse check than other fixes.

%prefer
List terminals to prefer for insertion after the %prefer. Corrections which insert a terminal on this list will be chosen over other corrections, all other things being equal.
%subst
This declaration should be followed by a list of clauses of the form {terminal} for {terminal}, where items on the list are separated using a |. Substitution corrections on this list will be chosen over all other corrections except preferred insertion corrections (listed above), all other things being equal.
%change
This is a generalization of %prefer and %subst. It takes a the following syntax:
tokens1a -> tokens1b | tokens2a -> tokens2b etc.
where each tokens is a (possibly empty) seqence of tokens. The idea is that any instance of tokens1a can be ``corrected'' to tokens1b, and so on. For example, to suggest that a good error correction to try is IN ID END (which is useful for the ML parser), write,
       %change   ->  IN ID END
%value
The error-correction algorithm may also insert terminals with values. You must supply a value for such a terminal. The keyword should be followed by a terminal and a piece of code (enclosed in parentheses) that when evaluated supplies the value. There must be a separate %value declaration for each terminal with a value that you wish may be inserted or substituted in an error correction. The code for the value is not evaluated until the parse is successful.

Do not specify a %value for terminals without values. This will result in a type error in the program produced by ML-Yacc.

2.6   Rules

All rules are declared in the final section, after the last %% delimiter. A rule consists of a left hand side nonterminal, followed by a colon, followed by a list of right hand side clauses.

The right hand side clauses should be separated by bars (``|''). Each clause consists of a list of nonterminal and terminal symbols, followed by an optional %prec declaration, and then followed by the code to be evaluated when the rule is reduced.

The optional %prec consists of the keyword %prec followed by a terminal whose precedence should be used as the precedence of the rule.

The values of those symbols on the right hand side which have values are available inside the code. Positions for all the symbols are also available. Each value has the general form {symbol name}{n+1}, where {n} is the number of occurrences of the symbol to the left of the symbol. If the symbol occurs only once in the rule, {symbol name} may also be used. The positions are given by {symbol name}{n+1}left and {symbol name}{n+1}right. where {n} is defined as before. The position for a null rhs of a production is assumed to be the leftmost position of the lookahead terminal which is causing the reduction. This position value is available in defaultPos.

The value to which the code evaluates is used as the value of the nonterminal. The type of the value and the nonterminal must match. The value is ignored if the nonterminal has no value, but is still evaluated for side-effects.


Previous Next Contents