The RegExpSyntax
structure provides an abstractsyntaxtree
representation of regular expressions. Its main purpose is to
provide communication between different frontends (implementing
different RE specification languages), and different backends
(implementing different compilation/searching algorithms).
It is also possible, however, to use it as a way to directly
specify a regular expression for a backend engine.
Synopsis
signature REGEXP_SYNTAX
structure RegExpSyntax : REGEXP_SYNTAX
Interface
exception CannotCompile
structure CharSet : ORD_SET where type Key.ord_key = char
datatype syntax
= Group of syntax
 Alt of syntax list
 Concat of syntax list
 Interval of (syntax * int * int option)
 MatchSet of CharSet.set
 NonmatchSet of CharSet.set
 Char of char
 Begin
 End
val optional : syntax > syntax
val closure : syntax > syntax
val posClosure : syntax > syntax
val fromRange : char * char > CharSet.set
val addRange : CharSet.set * char * char > CharSet.set
val allChars : CharSet.set
val alnum : CharSet.set
val alpha : CharSet.set
val ascii : CharSet.set
val blank : CharSet.set
val cntl : CharSet.set
val digit : CharSet.set
val graph : CharSet.set
val lower : CharSet.set
val print : CharSet.set
val punct : CharSet.set
val space : CharSet.set
val upper : CharSet.set
val word : CharSet.set
val xdigit : CharSet.se
Description
exception CannotCompile

This exception is meant to be raised by backends when they encounter a feature that they cannot handle.
structure CharSet : ORD_SET where type Key.ord_key = char

This substructure implements sets of 8bit characters. Currently it is implemented using sorted lists (i.e., using the
ListSetFn
functor), but that may be changed in the future. datatype syntax

This datatype defines the abstract syntax of regular expressions that is supported by the library. The constructors are defined as follows:

Group re
:: defines a match group (i.e., that produce a corresponding matchtree node for the input matched byre
. 
Alt[re1, re2, …, ren]
:: matches any ofre1
,re2
, …,ren
. If the list is empty, then it matches nothing. 
Concat[re1, re2, …, ren]
:: matches the concatenation ofre1
,re2
, …,ren
. If the list is empty, then it matches the empty string. 
Interval(re, n, NONE)
:: matchesre
repeated at leastn
times. 
Interval(re, n, SOME m)
:: matchesre
repeated fromn
tom
times. 
MatchSet cs
:: matches a single character that is in the setcs
. 
NonmatchSet cs
:: matches a single character that is not in the setcs
. 
Char c
:: matches the single characterc
. 
Begin
:: matches beginning of the input stream. 
End
:: matches end of the input stream.

val optional : syntax → syntax

optional re
is equivalent toInterval(re, 0, SOME 1)
. val closure : syntax → syntax

closure re
is equivalent toInterval(re, 0, NONE)
. val posClosure : syntax → syntax

posClosure re
is equivalent toInterval(re, 1, NONE)
. val fromRange : char * char > CharSet.set

fromRange (c1, c2)
returns the set containing the characters in the range fromc1
toc2
(inclusive). This expression raises theSize
exception ifc2 < c1
. val addRange : CharSet.set * char * char > CharSet.set

addRange (cs, c1, c2)
adds the set of characters in the range fromc1
toc2
(inclusive) tocs
. This expression raises theSize
exception ifc2 < c1
. val allChars : CharSet.set

is the set of all 8bit characters.
POSIX Character Classes
The RegExpSyntax
structure predefines the following character sets,
which are part of the POSIX regularexpression standard (plus a couple
of extras):
val alnum : CharSet.set

is the set of letters and digits.
val alpha : CharSet.set

is the set of letters.
val ascii : CharSet.set

is the set of characters
c
such that0 <= ord c <= 127
. val blank : CharSet.set

is the set of
#"\t"
and space. val cntl : CharSet.set

is the set of nonprintable characters.
val digit : CharSet.set

is the set of decimal digits.
val graph : CharSet.set

is the set of visible characters (does not include space).
val lower : CharSet.set

is the set of lowercase letters.
val print : CharSet.set

is the set of printable characters (includes space).
val punct : CharSet.set

is the set of visible characters other than letters and digits.
val space : CharSet.set

is the set of
#"\t"
,#"\r"
,#"\n"
,#"\v"
,#"\f"
, and space. val upper : CharSet.set

is the set of uppercase letters.
val word : CharSet.set

is the set of letters, digit, and
#"_"
. val xdigit : CharSet.set

is the set of hexadecimal digits.