Metacharacters
Some characters have a special meaning to the searcher. These characters are called metacharacters. Although they may seem confusing at first, they add a great deal of flexibility and convenience to the searcher.
The period (.) is a commonly used metacharacter. It matches exactly one character, regardless of what the character is. For example, the regular expression:
2,.-Dimethylbutane
will match "2,2-Dimethylbutane" and "2,3-Dimethylbutane". Note that the period matches exactly one character-- it will not match a string of characters, nor will it match the null string. Thus, "2,200-Dimethylbutane" and "2,-Dimenthylbutane" will not be matched by the above
regular expression. But what if you wanted to search for a string containing a period? For example, suppose we wished to search for references to pi. The following regular expression would not work:
3.14 (THIS IS WRONG!)
This would indeed match "3.14", but it would also match "3514", "3f14", or even "3+14". In short, any string of the form "3x14", where x is any character, would be matched by the regular expression above. To get around this, we introduce a second metacharacter, the backslash (\). The backslash can be used to indicate that the
character immediately to its right is to be taken literally. Thus, to search for the string "3.14", we would use:
3\.14 (This will work.)
This is called "quoting". We would say that the period in the regular expression above has been quoted. In general, whenever the backslash is placed before a metacharacter, the searcher treats the metacharacter literally rather than invoking its special meaning. (Unfortunately, the backslash is used for other things besides quoting metacharacters. Many "normal" characters take on special meanings when preceded by a backslash. The rule of thumb is, quoting a metacharacter turns it into a normal character, and quoting a normal character may turn it into a metacharacter.)
Let's look at some more common metacharacters. We consider first the question mark (?). The question mark indicates that the
character immediately preceding it either zero times or one time. Thus
m?ethane
would match either "ethane" or "methane". Similarly,
comm?a
would match either "coma" or "comma".
Another metacharacter is the star (*). This
indicates that the character immediately to its left may be repeated any
number of times, including zero. Thus
ab*c
would match "ac", "abc", "abbc", "abbbc",
"abbbbbbbbc", and any string that starts with an
"a", is followed by a sequence of "b"'s, and ends
with a "c".
The plus (+) metacharacter indicates that
the character immediately preceding it may be repeated one or more
times. It is just like the star metacharacter, except it doesn't match
the null string. Thus
ab+c
would not match "ac", but it would match
"abc", "abbc", "abbbc", "abbbbbbbbc"
and so on.
Metacharacters may be combined. A common combination includes the
period and star metacharacters, with the star immediately following the
period. This is used to match an arbitrary string of any length,
including the null string. For example:
cyclo.*ane
would match "cyclodecane", "cyclohexane" and even
"cyclones drive me insane." Any string that starts with "cyclo",
is followed by an arbitrary string, and ends with "ane" will
be matched. Note that the null string will be matched by the period-star
pair; thus, "cycloane" would be matche by the above
expression.
If you wanted to search for articles on cyclodecane and cyclohexane,
but didn't want to match articles about how cyclones drive one insane,
you could string together three periods, as follows:
cyclo...ane
This would match "cyclodecane" and "cyclohexane",
but would not match "cyclones drive me insane." Only strings
eleven characters long which start with "cyclo" and end with
"ane" will be matched. (Note that "cyclopentane"
would not be matched, however, since cyclopentane has twelve characters,
not eleven.)
Here are some more examples. These involve the backslash. Note that
the placement of backslash is important.
a\.*z
- Matches any string starting with "a", followed by a
series of periods (including the "series" of length zero),
and terminated by "z". Thus, "az", "a.z",
"a..z", "a...z" and so forth are all matched.
a.\*z
- (Note that the backslash and period are reversed in this regular
expression.)
Matches any string starting with an "a", followed by
one arbitrary character, and terminated with "*z". Thus,
"ag*z", "a5*z" and "a@*z" are all
matched. Only strings of length four, where the first character is
"a", the third "*", and the fourth
"z", are matched.
a\++z
- Matches any string starting with "a", followed by a
series of plus signs, and terminated by "z". There must be
at least one plus sign between the "a" and the
"z". Thus, "az" is not matched, but
"a+z", "a++z", "a+++z", etc. will be
matched.
a\+\+z
- Matches only the string "a++z".
a+\+z
- Matches any string starting with a series of "a"'s,
followed by a single plus sign and ending with a "z".
There must be at least one "a" at the start of the string.
Thus "a+z", "aa+z", "aaa+z" and so on
will match, but "+z" will not.
a.?e
- Matches "ace", "ale", "axe" and any
other three-character string beginning with "a" and ending
with "e"; will also match "ae".
a\.?e
- Matches "ae" and "a.e". No other string is
matched.
a.\?e
- Matches any four-character string starting with "a" and
ending with "?e". Thus, "ad?e", "a1?e"
and "a%?e" will all be matched.
a\.\?e
- Matches only "a.?e" and nothing else.
Earlier it was mentioned that the backslash can turn ordinary characters
into metacharacters, as well as the other way around. One such use of
this is the digit metacharacter, which is invoked by following
a backslash with a lower-case "d", like this: "\d".
The "d" must be lower case, for reasons explained
later. The digit metacharacter matches exactly one digit; that is,
exactly one occurence of "0", "1", "2",
"3", "4", "5", "6",
"7", "8" or "9". For example, the regular
expression:
2,\d-Dimethylbutane
would match "2,2-Dimethylbutane",
"2,3-Dimethylbutane" and so forth. Similarly,
1\.\d\d\d\d\d
would match any six-digit floating-point number from 1.00000 to 1.99999
inclusive. We could combine the digit metacharacter with other
metacharacters; for instance,
a\d+z
matches any string starting with "a", followed by a string of
numbers, followed by a "z". (Note that the plus is used, and
thus "az" is not matched.)
The letter "d" in the string "\d"
must be lower-case. This is because there is another metacharacter, the non-digit
metacharacter, which uses the uppercase "D". The non-digit
metacharacter looks like "\D" and matches any
character except a digit. Thus,
a\Dz
would match "abz", "aTz" or "a%z", but
would not match "a2z", "a5z" or
"a9z". Similarly,
\D+
Matches any non-null string which contains no numeric
characters.
Notice that in changing the "d" from lower-case to
upper-case, we have reversed the meaning of the digit metacharacter.
This holds true for most other metacharacters of the format
backslash-letter.
There are three other metacharacters in the backslash-letter format.
The first is the word metacharacter, which matches exactly one
letter, one number, or the underscore character (_
). It is
written as "\w". It's opposite, "\W",
matches any one character except a letter, a number or the
underscore. Thus,
a\wz
would match "abz", "aTz", "a5z", "a_z",
or any three-character string starting with "a", ending with
"z", and whose second character was either a letter (upper- or
lower-case), a number, or the underscore. Similarly,
a\Wz
would not match "abz", "aTz",
"a5z", or "a_z". It would match "a%z",
"a{z", "a?z" or any three-character string starting
with "a" and ending with "z" and whose second
character was not a letter, number, or underscore. (This means the
second character must either be a symbol or a whitespace character.)
The whitespace metacharacter matches exactly one character
of whitespace. (Whitespace is defined as spaces, tabs, newlines, or any
character which would not use ink if printed on a printer.) The
whitespace metacharacter looks like this: "\s".
It's opposite, which matches any character that is not
whitespace, looks like this: "\S". Thus,
a\sz
would match any three-character string starting with "a" and
ending with "z" and whose second character was a space, tab,
or newline. Likewise,
a\Sz
would match any three-character string starting with "a" and
ending with "z" whose second character was not a
space, tab or newline. (Thus, the second character could be a letter,
number or symbol.)
The word boundary metacharacter matches the boundaries of
words; that is, it matches whitespace, punctuation and the very
beginning and end of the text. It looks like "\b".
It's opposite searches for a character that is not a word
boundary. Thus:
\bcomput
will match "computer" or "computing", but not
"supercomputer" since there is no spaces or punctuation
between "super" and "computer". Similarly,
\Bcomput
will not match "computer" or "computing",
unless it is part of a bigger word such as "supercomputer" or
"recomputing".
Note that the underscore (_
) is considered a
"word" character. Thus,
super\bcomputer
will not match "super_computer".
There is one other metacharacter starting with a backslash, the octal
metacharacter. The octal metacharacter looks like this: "\nnn",
where "n" is a number from zero to seven. This is used for
specifying control characters that have no typed equivalent. For
example,
\007
would find all subjects with an embedded ASCII "bell"
character. (The bell is specified by an ASCII value of 7.) You will
rarely need to use the octal metacharacter.
There are three other metacharacters that may be of use. The first is
the braces metacharacter. This metacharacter follows a normal
character and contains two number separated by a comma (,)
and surrounded by braces ({}). It is like the star
metacharacter, except the length of the string it matches must be within
the minimum and maximum length specified by the two numbers in braces.
Thus,
ab{3,5}c
will match "abbbc", "abbbbc" or "abbbbbc".
No other string is matched. Likewise,
.{3,5}pentane
will match "cyclopentane", "isopentane" or "neopentane",
but not "n-pentane", since "n-" is only two
characters long.
The alternative metacharacter is represented by a vertical bar (|).
It indicates an either/or behavior by separating two or more possible
choices. For example:
isopentane|cyclopentane
will match any subject containing the strings "isopentane" or
"cyclopentane" or both. However, It will not match
"pentane" or "n-pentane" or "neopentane."
The last metacharacter is the brackets metacharacter. The
bracket metacharacter matches one occurence of any character inside the
brackets ([]). For example,
\s[cmt]an\s
will match "can", "man" and "tan", but not
"ban", "fan" or "pan". Similarly,
2,[23]-dimethylbutane
will match "2,2-dimethylbutane" or
"2,3-dimethylbutane", but not "2,4-dimethylbutane",
"2,23-dimethylbutane" or "2,-dimethybutane". Ranges
of characters can be used by using the dash (-) within
the brackets. For example,
a[a-d]z
will match "aaz", "abz", "acz" or
"adz", and nothing else. Likewise,
textfile0[3-5]
will match "textfile03", "textfile04", or
"textfile05" and nothing else.
If you wish to include a dash within brackets as one of the
characters to match, instead of to denote a range, put the dash
immediately before the right bracket. Thus:
a[1234-]z
and
a[1-4-]z
both do the same thing. They both match "a1z",
"a2z", "a3z", "a4z" or "a-z",
and nothing else.
The bracket metacharacter can also be inverted by placing a caret (^)
immediately after the left bracket. Thus,
textfile0[^02468]
matches any ten-character string starting with "textfile0" and
ending with anything except an even number. Inversion and ranges can be
combined, so that
\W[^f-h]ood\W
matches any four letter wording ending in "ood" except
for "food", "good" or "hood". (Thus
"mood" and "wood" would both be matched.)
Note that within brackets, ordinary quoting rules do not apply and
other metacharacters are not available. The only characters that can be
quoted in brackets are "[
", "]
",
and "\
". Thus,
[\[\\\]]abc
matches any four letter string ending with "abc" and starting
with "[
", "]
", or "\
".