20 Regular Expressions Regular expressions are the text processing workhorse of perl. With regular expressions, you can search strings for patterns, find out what matched the patterns, and substitute the matched patterns with new strings. There are three different regular expression operators in perl: 1.match m{PATTERN} 2.substitute s{OLDPATTERN}{NEWPATTERN} 3.transliterate tr{OLD_CHAR_SET}{NEW_CHAR_SET} Perl allows any delimiter in these operators, such as {} or () or // or ## or just about any character you wish to use. The most common delimiter used is probably the m// and s/// delimiters, but I prefer to use m{} and s{}{} because they are clearer for me. There are two ways to "bind" these operators to a string expression: 1.=~ pattern does match string expression 2.!~ pattern does NOT match string expression Binding can be thought of as "Object Oriented Programming" for regular expressions. Generic OOP structure can be represented as $subject -> verb ( adjectives, adverbs, etc ); Binding in Regular Expressions can be looked at in a similar fashion: $string =~ verb ( pattern ); where "verb" is limited to 'm' for match, 's' for substitution, and 'tr' for translate. You may see perl code that simply looks like this: /patt/; This is functionally equivalent to this: $_ =~ m/patt/; Here are some examples: # spam filter my $email = "This is a great Free Offer\n"; if($email =~ m{Free Offer}) {$email="*deleted spam*\n"; } print "$email\n"; # upgrade my car my $car = "my car is a toyota\n"; $car =~ s{toyota}{jaguar}; print "$car\n"; # simple encryption, Caesar cypher my $love_letter = "How I love thee.\n"; $love_letter =~ tr{A-Za-z}{N-ZA-Mn-za-m}; print "encrypted: $love_letter"; $love_letter =~ tr{A-Za-z}{N-ZA-Mn-za-m}; print "decrypted: $love_letter\n"; > *deleted spam* > my car is a jaguar > encrypted: Ubj V ybir gurr. > decrypted: How I love thee. The above examples all look for fixed patterns within the string. Regular expressions also allow you to look for patterns with different types of "wildcards". 20.1 Variable Interpolation The braces that surround the pattern act as double-quote marks, subjecting the pattern to one pass of variable interpolation as if the pattern were contained in double-quotes. This allows the pattern to be contained within variables and interpolated during the regular expression. my $actual = "Toyota"; my $wanted = "Jaguar"; my $car = "My car is a Toyota\n"; $car =~ s{$actual}{$wanted}; print $car; > My car is a Jaguar 20.2 Wildcard Example In the example below, we process an array of lines, each containing the pattern {filename: } followed by one or more non-whitespace characters forming the actual filename. Each line also contains the pattern {size: } followed by one or more digits that indicate the actual size of that file. my @lines = split "\n", <<"MARKER" filename: output.txt size: 1024 filename: input.dat size: 512 filename: address.db size: 1048576 MARKER ; foreach my $line (@lines) { #################################### # \S is a wildcard meaning # "anything that is not white-space". # the "+" means "one or more" #################################### if($line =~ m{filename: (\S+)}) { my $name = $1; ########################### # \d is a wildcard meaning # "any digit, 0-9". ########################### $line =~ m{size: (\d+)}; my $size = $1; print "$name,$size\n"; } } > output.txt,1024 > input.dat,512 > address.db,1048576 20.3 Defining a Pattern A pattern can be a literal pattern such as {Free Offer}. It can contain wildcards such as {\d}. It can also contain metacharacters such as the parenthesis. Notice in the above example, the parenthesis were in the pattern but did not occur in the string, yet the pattern matched. 20.4 Metacharacters Metacharacters do not get interpreted as literal characters. Instead they tell perl to interpret the metacharacter (and sometimes the characters around metacharacter) in a different way. The following are metacharacters in perl regular expression patterns: \ | ( ) [ ] { } ^ $ * + ? . \ (backslash) if next character combined with this backslash forms a character class shortcut, then match that character class. If not a shortcut, then simply treat next character as a non-metacharacter. | alternation: (patt1 | patt2) means (patt1 OR patt2) ( ) grouping (clustering) and capturing (?: ) grouping (clustering) only. no capturing. (somewhat faster) . match any single character (usually not "\n") [ ] define a character class, match any single character in class * (quantifier): match previous item zero or more times + (quantifier): match previous item one or more times ? (quantifier): match previous item zero or one time { } (quantifier): match previous item a number of times in given range ^ (position marker): beginning of string (or possibly after "\n") $ (position marker): end of string (or possibly before "\n") Examples below. Change the value assigned to $str and re-run the script. Experiment with what matches and what does not match the different regular expression patterns. my $str = "Dear sir, hello and goodday! " ." dogs and cats and sssnakes put me to sleep." ." zzzz. Hummingbirds are ffffast. " ." Sincerely, John"; # | alternation # match "hello" or "goodbye" if($str =~ m{hello|goodbye}){warn "alt";} # () grouping and capturing # match 'goodday' or 'goodbye' if($str =~ m{(good(day|bye))}) {warn "group matched, captured '$1'";} # . any single character # match 'cat' 'cbt' 'cct' 'c%t' 'c+t' 'c?t' ... if($str =~ m{c.t}){warn "period";} # [] define a character class: 'a' or 'o' or 'u' # match 'cat' 'cot' 'cut' if($str =~ m{c[aou]t}){warn "class";} # * quantifier, match previous item zero or more # match '' or 'z' or 'zz' or 'zzz' or 'zzzzzzzz' if($str =~ m{z*}){warn "asterisk";} # + quantifier, match previous item one or more # match 'snake' 'ssnake' 'sssssssnake' if($str =~ m{s+nake}){warn "plus sign";} # ? quantifier, previous item is optional # match only 'dog' and 'dogs' if($str =~ m{dogs?}){warn "question";} # {} quantifier, match previous, 3 <= qty <= 5 # match only 'fffast', 'ffffast', and 'fffffast' if($str =~ m{f{3,5}ast}){warn "curly brace";} # ^ position marker, matches beginning of string # match 'Dear' only if it occurs at start of string if($str =~ m{^Dear}){warn "caret";} # $ position marker, matches end of string # match 'John' only if it occurs at end of string if($str =~ m{John$}){warn "dollar";} > alt at ... > group matched, captured 'goodday' at ... > period at ... > class at ... > asterisk at ... > plus sign at ... > question at ... > curly brace at ... > caret at ... > dollar at ... 20.5 Capturing and Clustering Parenthesis Normal parentheses will both cluster and capture the pattern they contain. Clustering affects the order of evaluation similar to the way parentheses affect the order of evaluation within a mathematical expression. Normally, multiplication has a higher precedence than addition. The expression "2 + 3 * 4" does the multiplication first and then the addition, yielding the result of "14". The expression "(2 + 3) * 4" forces the addition to occur first, yielding the result of "20". Clustering parentheses work in the same fashion. The pattern {cats?} will apply the "?" quantifier to the letter "s", matching either "cat" or "cats". The pattern {(cats)?} will apply the "?" quantifier to the entire pattern within the parentheses, matching "cats" or null string. 20.5.1 $1, $2, $3, etc Capturing parentheses Clustering parentheses will also Capture the part of the string that matched the pattern within parentheses. The captured values are accessible through some "magical" variables called $1, $2, $3, ... Each left parenthesis increments the number used to access the captured string. The left parenthesis are counted from left to right as they occur within the pattern, starting at 1. my $test="Firstname: John Lastname: Smith"; ############################################ # $1 $2 $test=~m{Firstname: (\w+) Lastname: (\w+)}; my $first = $1; my $last = $2; print "Hello, $first $last\n"; > Hello, John Smith Because capturing takes a little extra time to store the captured result into the $1, $2, <85> variables, sometimes you just want to cluster without the overhead of capturing. In the below example, we want to cluster "day|bye" so that the alternation symbol "|" will go with "day" or "bye". Without the clustering parenthesis, the pattern would match "goodday" or "bye", rather than "goodday" or "goodbye". The pattern contains capturing parens around the entire pattern, so we do not need to capture the "day|bye" part of the pattern, therefore we use cluster-only parentheses. if($str =~ m{(good(?:day|bye))}) {warn "group matched, captured '$1'";} Cluster-only parenthesis don't capture the enclosed pattern, and they don't count when determining which magic variable, $1, $2, $3 ..., will contain the values from the capturing parentheses. my $test = 'goodday John'; ########################################## # $1 $2 if($test =~ m{(good(?:day|bye)) (\w+)}) { print "You said $1 to $2\n"; } > You said goodday to John 20.5.2 Capturing parentheses not capturing If a regular expression containing capturing parentheses does not match the string, the magic variables $1, $2, $3, etc will retain whatever PREVIOUS value they had from any PREVIOUS regular expression. This means that you MUST check to make sure the regular expression matches BEFORE you use the $1, $2, $3, etc variables. In the example below, the second regular expression does not match, therefore $1 retains its old value of 'be'. Instead of printing out something like "Name is Horatio" or "Name is" and failing on an undefined value, perl instead keeps the old value for $1 and prints "Name is 'be'", instead. my $string1 = 'To be, or not to be'; $string1 =~ m{not to (\w+)}; # matches, $1='be' warn "The question is to $1"; my $string2 = 'that is the question'; $string2 =~ m{I knew him once, (\w+)}; # no match warn "Name is '$1'"; # no match, so $1 retains its old value 'be' > The question is to be at ./script.pl line 7. > Name is 'be' at ./script.pl line 11. 20.6 Character Classes The "." metacharacter will match any single character. This is equivalent to a character class that includes every possible character. You can easily define smaller character classes of your own using the square brackets []. Whatever characters are listed within the square brackets are part of that character class. Perl will then match any one character within that class. [aeiouAEIOU] any vowel [0123456789] any digit 20.6.1 Metacharacters Within Character Classes Within the square brackets used to define a character class, all previously defined metacharacters cease to act as metacharacters and are interpreted as simple literal characters. Characters classes have their own special metacharacters. \ (backslash) demeta the next character - (hyphen) Indicates a consecutive character range, inclusively. [a-f] indicates the letters a,b,c,d,e,f. Character ranges are based off of ASCII numeric values. ^ If it is the first character of the class, then this indicates the class is any character EXCEPT the ones in the square brackets. Warning: [^aeiou] means anything but a lower case vowel. This is not the same as "any consonant". The class [^aeiou] will match punctuation, numbers, and unicode characters. 20.7 Shortcut Character Classes Perl has shortcut character classes for some more common classes. /*shortcut*/ /*class*/ /*description*/ \d [0-9] any *d*igit \D [^0-9] any NON-digit \s [ \t\n\r\f] any white*s*pace \S [^ \t\n\r\f] any NON-whitespace \w [a-zA-Z0-9_] any *w*ord character (valid perl identifier) \W [^a-zA-Z0-9_] any NON-word character 20.8 Greedy (Maximal) Quantifiers Quantifiers are used within regular expressions to indicate how many times the previous item occurs within the pattern. By default, quantifiers are "greedy" or "maximal", meaning that they will match as many characters as possible and still be true. * match zero or more times (match as much as possible) + match one or more times (match as much as possible) ? match zero or one times (match as much as possible) {count} match exactly "count" times {min, } match at least "min" times (match as much as possible) {min,max} match at least "min" and at most "max" times *(match as much as possible)* 20.10 Position Assertions / Position Anchors Inside a regular expression pattern, some symbols do not translate into a character or character class. Instead, they translate into a "position" within the string. If a position anchor occurs within a pattern, the pattern before and after that anchor must occur within a certain position within the string. ^ Matches the beginning of the string. If the /m (multiline) modifier is present, matches "\n" also. $ Matches the end of the string. If the /m (multiline) modifier is present, matches "\n" also. \A Match the beginning of string only. Not affected by /m modifier. \z Match the end of string only. Not affected by /m modifier. \Z Matches the end of the string only, but will chomp() a "\n" if that was the last character in string. \b word "b"oundary A word boundary occurs in four places. 1) at a transition from a \w character to a \W character 2) at a transition from a \W character to a \w character 3) at the beginning of the string 4) at the end of the string \B NOT \b \G usually used with /g modifier (probably want /c modifier too). Indicates the position after the character of the last pattern match performed on the string. If this is the first regular expression begin performed on the string then \G will match the beginning of the string. Use the pos() function to get and set the current \G position within the string. 20.10.1 The \b Anchor Use the \b anchor when you want to match a whole word pattern but not part of a word. This example matches "jump" but not "jumprope": my $test1='He can jump very high.'; if($test1=~m{\bjump\b}) { print "test1 matches\n"; }