regexp(n) Tcl Built-In Commands regexp(n)_________________________________________________________________NAMEregexp - Match a regular expression against a string
SYNOPSISregexp ?switches? exp string ?matchVar? ?subMatchVar sub-
MatchVar ...?
_________________________________________________________________DESCRIPTION
Determines whether the regular expression exp matches part
or all of string and returns 1 if it does, 0 if it
doesn't.
If additional arguments are specified after string then
they are treated as the names of variables in which to
return information about which part(s) of string matched
exp. MatchVar will be set to the range of string that
matched all of exp. The first subMatchVar will contain
the characters in string that matched the leftmost paren-
thesized subexpression within exp, the next subMatchVar
will contain the characters that matched the next paren-
thesized subexpression to the right in exp, and so on.
If the initial arguments to regexp start with - then they
are treated as switches. The following switches are cur-
rently supported:
-nocase Causes upper-case characters in string to be
treated as lower case during the matching pro-
cess.
-indices Changes what is stored in the subMatchVars.
Instead of storing the matching characters from
string, each variable will contain a list of two
decimal strings giving the indices in string of
the first and last characters in the matching
range of characters.
-- Marks the end of switches. The argument follow-
ing this one will be treated as exp even if it
starts with a -.
If there are more subMatchVar's than parenthesized subex-
pressions within exp, or if a particular subexpression in
exp doesn't match the string (e.g. because it was in a
portion of the expression that wasn't matched), then the
corresponding subMatchVar will be set to ``-1 -1'' if
-indices has been specified or to an empty string other-
wise.
Tcl 1
regexp(n) Tcl Built-In Commands regexp(n)REGULAR EXPRESSIONS
Regular expressions are implemented using Henry Spencer's
package (thanks, Henry!), and much of the description of
regular expressions below is copied verbatim from his man-
ual entry.
A regular expression is zero or more branches, separated
by ``|''. It matches anything that matches one of the
branches.
A branch is zero or more pieces, concatenated. It matches
a match for the first, followed by a match for the second,
etc.
A piece is an atom possibly followed by ``*'', ``+'', or
``?''. An atom followed by ``*'' matches a sequence of 0
or more matches of the atom. An atom followed by ``+''
matches a sequence of 1 or more matches of the atom. An
atom followed by ``?'' matches a match of the atom, or the
null string.
An atom is a regular expression in parentheses (matching a
match for the regular expression), a range (see below),
``.'' (matching any single character), ``^'' (matching
the null string at the beginning of the input string),
``$'' (matching the null string at the end of the input
string), a ``\'' followed by a single character (matching
that character), or a single character with no other sig-
nificance (matching that character).
A range is a sequence of characters enclosed in ``[]''.
It normally matches any single character from the
sequence. If the sequence begins with ``^'', it matches
any single character not from the rest of the sequence.
If two characters in the sequence are separated by ``-'',
this is shorthand for the full list of ASCII characters
between them (e.g. ``[0-9]'' matches any decimal digit).
To include a literal ``]'' in the sequence, make it the
first character (following a possible ``^''). To include
a literal ``-'', make it the first or last character.
CHOOSING AMONG ALTERNATIVE MATCHES
In general there may be more than one way to match a regu-
lar expression to an input string. For example, consider
the command
regexp (a*)b* aabaaabb x y
Considering only the rules given so far, x and y could end
up with the values aabb and aa, aaab and aaa, ab and a, or
any of several other combinations. To resolve this poten-
tial ambiguity regexp chooses among alternatives using the
rule ``first then longest''. In other words, it considers
the possible matches in order working from left to right
across the input string and the pattern, and it attempts
Tcl 2
regexp(n) Tcl Built-In Commands regexp(n)
to match longer pieces of the input string before shorter
ones. More specifically, the following rules apply in
decreasing order of priority:
[1] If a regular expression could match two different
parts of an input string then it will match the one
that begins earliest.
[2] If a regular expression contains | operators then
the leftmost matching sub-expression is chosen.
[3] In *, +, and ? constructs, longer matches are cho-
sen in preference to shorter ones.
[4] In sequences of expression components the compo-
nents are considered from left to right.
In the example from above, (a*)b* matches aab: the (a*)
portion of the pattern is matched first and it consumes
the leading aa; then the b* portion of the pattern con-
sumes the next b. Or, consider the following example:
regexp (ab|a)(b*)c abc x y z
After this command x will be abc, y will be ab, and z will
be an empty string. Rule 4 specifies that (ab|a) gets
first shot at the input string and Rule 2 specifies that
the ab sub-expression is checked before the a sub-expres-
sion. Thus the b has already been claimed before the (b*)
component is checked and (b*) must match an empty string.
KEYWORDS
match, regular expression, string
Tcl 3