Introduction

1 Introduction

1.1 General

ML-Yacc is a parser generator for Standard ML modeled after the Yacc parser generator. It generates parsers for LALR languages, like Yacc, and has a similar syntax. The generated parsers use a different algorithm for recovering from syntax errors than parsers generated by Yacc. The algorithm is a partial implementation of an algorithm described in [1]. A parser tries to recover from a syntax error by making a single token insertion, deletion, or substitution near the point in the input stream at which the error was detected. The parsers delay the evaluation of semantic actions until parses are completed successfully. This makes it possible for parsers to recover from syntax errors that occur before the point of error detection, but it does prevent the parsers from affecting lexers in any significant way. The parsers can insert tokens with values and substitute tokens with values for other tokens. All symbols carry left and right position values which are available to semantic actions and are used in syntactic error messages.

ML-Yacc uses context-free grammars to specify the syntax of languages to be parsed. See [2] for definitions and information on context-free grammars and LR parsing. We briefly review some terminology here. A context-free grammar is defined by a set of terminals T, a set of nonterminals NT, a set of productions P, and a start nonterminal S. Terminals are interchangeably referred to as tokens. The terminal and nonterminal sets are assumed to be disjoint. The set of symbols is the union of the nonterminal and terminal sets. We use lower case Greek letters to denote a string of symbols. We use upper case Roman letters near the beginning of the alphabet to denote nonterminals. Each production gives a derivation of a string of symbols from a nonterminal, which we will write as A ® a. We define a relation between strings of symbols a and b, written a |- b and read as a derives b, if and only if a = d A g, b = d f g and there exists some production A ® f. We write the transitive closure of this relation as |-_*. We say that a string of terminals a is a valid sentence of the language, i.e. it is derivable, if the start symbol S |-_* a. The sequence of derivations is often visualized as a parse tree.

ML-Yacc uses an attribute grammar scheme with synthesized attributes. Each symbol in the grammar may have a value (i.e. attribute) associated with it. Each production has a semantic action associated with it. A production with a semantic action is called a rule. Parsers perform bottom-up, left-to-right evaluations of parse trees using semantic actions to compute values as they do so. Given a production P = A ® a, the corresponding semantic action is used to compute a value for A from the values of the symbols in a. If A has no value, the semantic action is still evaluated but the value is ignored. Each parse returns the value associated with the start symbol S of the grammar. A parse returns a nullary value if the start symbol does not carry a value.

The synthesized attribute scheme can be adapted easily to inherited attributes. An inherited attribute is a value which propagates from a nonterminal to the symbols produced by the nonterminal according to some rule. Since functions are values in ML, the semantic actions for the derived symbols can return functions which takes the inherited value as an argument.

1.2 Modules

ML-Yacc uses the ML modules facility to specify the interface between a parser that it generates and a lexical analyzer that must be supplied by you. It also uses the ML modules facility to factor out a set of modules that are common to every generated parser. These common modules include a parsing structure, which contains an error-correcting LR parser ¹, an LR table structure, and a structure which defines the representation of terminals. ML-Yacc produces a functor for a particular parser parameterized by the LR table structure and the representation of terminals. This functor contains values specific to the parser, such as the LR table for the parser ², the semantic actions for the parser, and a structure containing the terminals for the parser. ML-Yacc produces a signature for the structure produced by applying this functor and another signature for the structure containing the terminals for the parser. You must supply a functor for the lexing module parameterized this structure.

Figure 1 is a dependency diagram of the modules that summarizes this information. A module at the head of an arrow is dependent on the module at the tail.

parsing structure ¾® values for a particular parser

values for a particular parser ¾® lexical analyzer

parsing structure, ¾® particular parser

values for a particular parser,

lexical analyzer

Figure 1: Module Dependencies

1.3 Error Recovery

The error recovery algorithm is able to accurately recover from many single token syntax errors. It tries to make a single token correction at the token in the input stream at which the syntax error was detected and any of the 15 tokens ³ before that token. The algorithm checks corrections before the point of error detection because a syntax error is often not detected until several tokens beyond the token which caused the error.⁴

The algorithm works by trying corrections at each of the 16 tokens up to and including the token at which the error was detected. At each token in the input stream, it will try deleting the token, substituting other tokens for the token, or inserting some other token before the token.

The algorithm uses a parse check to evaluate corrections. A parse check is a check of how far a correction allows a parser to parse without encountering a syntax error. You pass an upper bound on how many tokens beyond the error point a parser may read while doing a parse check as an argument to the parser. This allows you to control the amount of lookahead that a parser reads for different kinds of systems. For an interactive system, you should set the lookahead to zero. Otherwise, a parser may hang waiting for input in the case of a syntax error. If the lookahead is zero, no syntax errors will be corrected. For a batch system, you should set the lookahead to 15.

The algorithm selects the set of corrections which allows the parse to proceed the farthest and parse through at least the error token. It then removes those corrections involving keywords which do not meet a longer minimum parse check. If there is more than one correction possible after this, it uses a simple heuristic priority scheme to order the corrections, and then arbitrarily chooses one of the corrections with the highest priority. You have some control over the priority scheme by being able to name a set of preferred insertions and a set of preferred substitutions. The priorities for corrections, ordered from highest to lowest priority, are preferred insertions, preferred substitutions, insertions, deletions, and substitutions.

The error recovery algorithm is guaranteed to terminate since it always selects fixes which parse through the error token.

The error-correcting LR parser implements the algorithm by keeping a queue of its state stacks before shifting tokens and using a lazy stream for the lexer. This makes it possible to restart the parse from before an error point and try various corrections. The error-correcting LR parser does not defer semantic actions. Instead, ML-Yacc creates semantic actions which are free of side-effects and always terminate. ML-Yacc uses higher-order functions to defer the evaluation of all user semantic actions until the parse is successfully completed without constructing an explicit parse tree. You may declare whether your semantic actions are free of side-effects and always terminate, in which case ML-Yacc does not need to defer the evaluation of your semantic actions.

1.4 Precedence

ML-Yacc uses the same precedence scheme as Yacc for resolving shift/reduce conflicts. Each terminal may be assigned a precedence and associativity. Each rule is then assigned the precedence of its rightmost terminal. If a shift/reduce conflict occurs, the conflict is resolved silently if the terminal and the rule in the conflict have precedences. If the terminal has the higher precedence, the shift is chosen. If the rule has the higher precedence, the reduction is chosen. If both the terminal and the rule have the same precedence, then the associativity of the terminal is used to resolve the conflict. If the terminal is left associative, the reduction is chosen. If the terminal is right associative, the shift is chosen. Terminals may be declared to be nonassociative, also, in which case an error message is produced if the associativity is need to resolve the parsing conflict.

If a terminal or a rule in a shift/reduce conflict does not have a precedence, then an error message is produced and the shift is chosen.

In reduce/reduce conflicts, an error message is always produced and the first rule listed in the specification is chosen for reduction.

1.5 Notation

Text surrounded by brackets denotes meta-notation. If you see something like {parser name}, you should substitute the actual name of your parser for the meta-notation. Text in a bold-face typewriter font (like this) denotes text in a specification or ML code.

parsing structure	¾®	values for a particular parser
values for a particular parser	¾®	lexical analyzer
parsing structure,	¾®	particular parser
values for a particular parser,
lexical analyzer