Regular expression syntax

Regular expression syntax has several basic rules and methods.

Using character sets

The pattern within the brackets of a regular expression defines a character set that is used to match a single character. For example, the regular expression " [A-Za-z] " specifies to match any single uppercase or lowercase letter enclosed by spaces. In the character set, a hyphen indicates a range of characters.

The regular expression " B[IAU]G " matches the strings “ BIG “, “ BAG “, and “ BUG “, but does not match the string " BOG ".

If you specified the regular expression as " B[IA][GN] ", the concatenation of character sets creates a regular expression that matches the corresponding concatenation of characters in the search string. This regular expression matches a space, followed by “B”, followed by an “I” or “A”, followed by a “G” or “N”, followed by a trailing space. The regular expression matches “ BIG ”, “ BAG ”, “BIN ”, and “BAN ”.

The regular expression [A-Z][a-z]* matches any word that starts with an uppercase letter and is followed by zero or more lowercase letters. The special character * after the closing square bracket specifies to match zero or more occurrences of the character set.

Note: The * only applies to the character set that immediately precedes it, not to the entire regular expression.

A + after the closing square bracket specifies to find one or more occurrences of the character set. You interpret the regular expression "[A-Z]+" as matching one or more uppercase letters enclosed by spaces. Therefore, this regular expression matches " BIG " and also matches “ LARGE ”, “ HUGE ”, “ ENORMOUS ”, and any other string of uppercase letters surrounded by spaces.

Considerations when using special characters

Since a regular expression followed by an * can match zero instances of the regular expression, it can also match the empty string. For example,

<cfoutput> 
    REReplace("Hello","[T]*","7","ALL") - #REReplace("Hello","[T]*","7","ALL")#<BR> 
</cfoutput>

results in the following output:

REReplace("Hello","[T]*","7","ALL") - 7H7e7l7l7o7

The regular expression [T]* can match empty strings. It first matches the empty string before “H” in “Hello”. The “ALL” argument tells REReplace to replace all instances of an expression. The empty string before “e” is matched, and so on, until the empty string before “o” is matched.

This result might be unexpected. The workarounds for these types of problems are specific to each case. In some cases you can use [T]+, which requires at least one “T”, instead of [T]*. Alternatively, you can specify an additional pattern after [T]*.

In the following examples the regular expression has a “W” at the end:

<cfoutput> 
    REReplace("Hello World","[T]*W","7","ALL") – 
        #REReplace("Hello World","[T]*W","7","ALL")#<BR> 
</cfoutput>

This expression results in the following more predictable output:

REReplace("Hello World","[T]*W","7","ALL") - Hello 7orld

Finding repeating characters

In some cases, you might want to find a repeating pattern of characters in a search string. For example, the regular expression "a{2,4}" specifies to match two to four occurrences of “a”. Therefore, it would match: "aa", "aaa", "aaaa", but not "a" or "aaaaa". In the following example, the REFind function returns an index of 6:

<cfset IndexOfOccurrence=REFind("a{2,4}", "hahahaaahaaaahaaaaahhh")> 
<!--- The value of IndexOfOccurrence is 6--->

The regular expression "[0-9]{3,}" specifies to match any integer number containing three or more digits: “123”, “45678”, and so on. However, this regular expression does not match a one-digit or two-digit number.

You use the following syntax to find repeating characters:

  1. {m,n}

    Where m is 0 or greater and n is greater than or equal to m. Match m through n (inclusive) occurrences.

    The expression {0,1} is equivalent to the special character ?.

  2. {m,}

    Where m is 0 or greater. Match at least m occurrences. The syntax {,n} is not allowed.

    The expression {1,} is equivalent to the special character +, and {0,} is equivalent to *.

  3. {m}

    Where m is 0 or greater. Match exactly m occurrences.

Case sensitivity in regular expressions

ColdFusion supplies case-sensitive and case-insensitive functions for working with regular expressions. REFind and REReplace perform case-sensitive matching and REFindNoCase and REReplaceNoCase perform case-insensitive matching.

You can build a regular expression that models case-insensitive behavior, even when used with a case-sensitive function. To make a regular expression case insensitive, substitute individual characters with character sets. For example, the regular expression [Jj][Aa][Vv][Aa], when used with the case-sensitive functions REFind or REReplace, matches all of the following string patterns:

  • JAVA

  • java

  • Java

  • jAva

  • All other combinations of case

Using subexpressions

Parentheses group parts of regular expressions into subexpressions that you can treat as a single unit. For example, the regular expression "ha" specifies to match a single occurrence of the string. The regular expression "(ha)+" matches one or more instances of “ha”.

In the following example, you use the regular expression "B(ha)+" to match the letter "B" followed by one or more occurrences of the string "ha":

<cfset IndexOfOccurrence=REFind("B(ha)+", "hahaBhahahaha")> 
<!--- The value of IndexOfOccurrence is 5 --->

You can use the special character | in a subexpression to create a logical "OR". You can use the following regular expression to search for the word "jelly" or "jellies":

<cfset IndexOfOccurrence=REFind("jell(y|ies)", "I like peanut butter and jelly"> 
<!--- The value of IndexOfOccurrence is 26--->

Using special characters

Regular expressions define the following list of special characters:

+ * ? . [ ^ $ ( ) { | \ 

In some cases, you use a special character as a literal character. For example, if you want to search for the plus sign in a string, you have to escape the plus sign by preceding it with a backslash:

"\+"

The following table describes the special characters for regular expressions:

Special Character

Description

\

A backslash followed by any special character matches the literal character itself, that is, the backslash escapes the special character.

For example, "\+" matches the plus sign, and "\\" matches a backslash.

.

A period matches any character, including newline.

To match any character except a newline, use [^#chr(13)##chr(10)#], which excludes the ASCII carriage return and line feed codes. The corresponding escape codes are \r and \n.

[ ]

A one-character character set that matches any of the characters in that set.

For example, "[akm]" matches an “a”, “k”, or “m”. A hyphen in a character set indicates a range of characters; for example, [a-z] matches any single lowercase letter.

If the first character of a character set is the caret (^), the regular expression matches any character except those in the set. It does not match the empty string.

For example, [^akm] matches any character except “a”, “k”, or “m”. The caret loses its special meaning if it is not the first character of the set.

^

If the caret is at the beginning of a regular expression, the matched string must be at the beginning of the string being searched.

For example, the regular expression "^ColdFusion" matches the string "ColdFusion lets you use regular expressions" but not the string "In ColdFusion, you can use regular expressions."

$

If the dollar sign is at the end of a regular expression, the matched string must be at the end of the string being searched.

For example, the regular expression "ColdFusion$" matches the string "I like ColdFusion" but not the string "ColdFusion is fun."

?

A character set or subexpression followed by a question mark matches zero or one occurrence of the character set or subexpression.

For example, xy?z matches either “xyz” or “xz”.

|

The OR character allows a choice between two regular expressions.

For example, jell(y|ies) matches either “jelly” or “jellies”.

+

A character set or subexpression followed by a plus sign matches one or more occurrences of the character set or subexpression.

For example, [a-z]+ matches one or more lowercase characters.

*

A character set or subexpression followed by an asterisk matches zero or more occurrences of the character set or subexpression.

For example, [a-z]* matches zero or more lowercase characters.

()

Parentheses group parts of a regular expression into subexpressions that you can treat as a single unit.

For example, (ha)+ matches one or more instances of “ha”.

(?x)

If at the beginning of a regular expression, it specifies to ignore whitespace in the regular expression and lets you use ## for end-of-line comments. You can match a space by escaping it with a backslash.

For example, the following regular expression includes comments, preceded by ##, that are ignored by ColdFusion:

reFind("(?x)

one ##first option

|two ##second option

|three\ point\ five ## note escaped spaces

", "three point five")

(?m)

If at the beginning of a regular expression, it specifies the multiline mode for the special characters ^ and $.

When used with ^, the matched string can be at the start of the entire search string or at the start of new lines, denoted by a linefeed character or chr(10), within the search string. For $, the matched string can be at the end the search string or at the end of new lines.

Multiline mode does not recognize a carriage return, or chr(13), as a new line character.

The following example searches for the string “two” across multiple lines:

#reFind("(?m)^two", "one#chr(10)#two")#

This example returns 4 to indicate that it matched “two” after the chr(10) linefeed. Without (?m), the regular expression would not match anything, because ^ only matches the start of the string.

The character (?m) does not affect \A or \Z, which always match the start or end of the string, respectively. For information on \A and \Z, see Using escape sequences.

(?i)

If at the beginning of a regular expression for REFind(), it specifies to perform a case-insensitive compare.

For example, the following line would return an index of 1:

#reFind("(?i)hi", "HI")#

If you omit the (?i), the line would return an index of zero to signify that it did not find the regular expression.

(?=...)

If at the beginning of a regular expression, it specifies to use positive lookahead when searching for the regular expression.

If you prefix a subexpression with this, ColdFusion uses positive lookahead for that subexpression.

Positive lookahead tests for the parenthesized subexpression like regular parenthesis, but does not include the contents in the match - it merely tests to see if it is there in proximity to the rest of the expression.

For example, consider the expression to extract the protocol from a URL:

<cfset regex = "http(?=://)">

<cfset string = "http://">

<cfset result = reFind(regex, string, 1, "yes")>

mid(string, result.pos[1], result.len[1])

This example results in the string "http". The lookahead parentheses ensure that the "://" is there, but does not include it in the result. If you did not use lookahead, the result would include the extraneous "://".

Lookahead parentheses do not capture text, so backreference numbering will skip over these groups. For more information on backreferencing, see Using backreferences.

(?!...)

If at the beginning of a regular expression, it specifies to use negative lookahead. Negative is just like positive lookahead, as specified by (?=...), except that it tests for the absence of a match.

Lookahead parentheses do not capture text, so backreference numbering will skip over these groups. For more information on backreferencing, see Using backreferences.

(?:...)

If you prefix a subexpression with "?:", ColdFusion performs all operations on the subexpression except that it will not capture the corresponding text for use with a back reference.

You must be aware of the following considerations when using special characters in character sets, such as [a-z]:

  • To include a hyphen (-) in the brackets of a character set as a literal character, you cannot escape it as you can other special characters because ColdFusion always interprets a hyphen as a range indicator. Therefore, if you use a literal hyphen in a character set, make it the last character in the set.

  • To include a closing square bracket (]) in the character set, escape it with a backslash, as in [1-3\]A-z]. You do not have to escape the ] character outside the character set designator.

Using escape sequences

Escape sequences are special characters in regular expressions preceded by a backslash (\). You typically use escape sequences to represent special characters within a regular expression. For example, the escape sequence \t represents a tab character within the regular expression, and the \d escape sequence specifies any digit, as [0-9] does. ColdFusion escape sequences are case sensitive.

The following table lists the escape sequences that ColdFusion supports:

Escape Sequence

Description

\b

Specifies a boundary defined by a transition from an alphanumeric character to a nonalphanumeric character, or from a nonalphanumeric character to an alphanumeric character.

For example, the string " Big" contains boundary defined by the space (nonalphanumeric character) and the "B" (alphanumeric character).

The following example uses the \b escape sequence in a regular expression to locate the string "Big" at the end of the search string and not the fragment "big" inside the word "ambiguous".

reFindNoCase("\bBig\b", "Don’t be ambiguous about Big.")

<!--- The value of IndexOfOccurrence is 26 --->

When used inside a character set (for example [\b]), it specifies a backspace

\B

Specifies a boundary defined by no transition of character type. For example, two alphanumeric characters in a row or two nonalphanumeric characters in a row; opposite of \b.

\A

Specifies a beginning of string anchor, much like the ^ special character.

However, unlike ^, you cannot combine \A with (?m) to specify the start of newlines in the search string.

\Z

Specifies an end of string anchor, much like the $ special character.

However, unlike $, you cannot combine \Z with (?m) to specify the end of newlines in the search string.

\n

Newline character

\r

Carriage return

\t

Tab

\f

Form feed

\d

Any digit, similar to [0-9]

\D

Any nondigit character, similar to [^0-9]

\w

Any alphanumeric character, or the underscore (_), similar to [[:word:]]

\W

Any nonalphanumeric character, except the underscore similar to [^[:word:]]

\s

Any whitespace character including tab, space, newline, carriage return, and form feed. Similar to [ \t\n\r\f].

\S

Any nonwhitespace character, similar to [^ \t\n\r\f]

\\x

A hexadecimal representation of character, where d is a hexadecimal digit

\ddd

An octal representation of a character, where d is an octal digit, in the form \000 to \377

Using character classes

In character sets within regular expressions, you can include a character class. You enclose the character class inside brackets, as the following example shows:

REReplace ("Adobe Web Site","[[:space:]]","*","ALL")

This code replaces all the spaces with *, producing this string:

Adobe*Web*Site

You can combine character classes with other expressions within a character set. For example, the regular expression [[:space:]123] searches for a space, 1, 2, or 3. The following example also uses a character class in a regular expression:

<cfset IndexOfOccurrence=REFind("[[:space:]][A-Z]+[[:space:]]",  
    "Some BIG string")> 
<!--- The value of IndexOfOccurrence is 5 --->

The following table shows the character classes that ColdFusion supports. Regular expressions using these classes match any Unicode character in the class, not just ASCII or ISO-8859 characters.

Character class

Matches

:alpha:

Any alphabetic character.

:upper:

Any uppercase alphabetic character.

:lower:

Any lowercase alphabetic character

:digit:

Any digit. Same as \d.

:alnum:

Any alphabetic or numeric character.

:xdigit:

Any hexadecimal digit. Same as [0-9A-Fa-f].

:blank:

Space or a tab.

:space:

Any whitespace character. Same as \s.

:print:

Any alphanumeric, punctuation, or space character.

:punct:

Any punctuation character

:graph:

Any alphanumeric or punctuation character.

:cntrl:

Any character not part of the character classes [:upper:], [:lower:], [:alpha:], [:digit:], [:punct:], [:graph:], [:print:], or [:xdigit:].

:word:

Any alphabetic or numeric character, plus the underscore (_). Same as \w

:ascii:

The ASCII characters, in the Hexadecimal range 0 - 7F