Lesson 22e: Regular Expression: Subpatterns

Subpatterns

You can extract and manipulate subpatterns in regular expressions.

To designate a subpattern, surround its part of the pattern with
parenthesis (same as with the grouping operator). This example has
just one subpattern, (.+) :

1
 /Who's afraid of the big bad w(.+)f/

You can combine parenthesis and quantifiers to quantify entire
subpatterns:

1
/Who's afraid of the big (bad )?wolf\?/;

This matches "Who's afraid of the big bad wolf?"
as well as "Who's afraid of the big wolf?"

This also shows how to literally match the special characters -- put a
backslash (\) in front of them.

Matching Subpatterns

Once a subpattern matches, you can refer to it later within the same
regular expression. The first subpattern becomes \1, the second \2,
the third \3, and so on.

1
2
3
4
  while ($line = <FASTA>) {
    chomp $line;
    print "I'm scared!\n" if /Who's afraid of the big bad w(.)\1f/
  }

This loop will print "I'm scared!" for the following matching lines:

  • Who's afraid of the big bad woof
  • Who's afraid of the big bad weef
  • Who's afraid of the big bad waaf

but not

  • Who's afraid of the big bad wolf
  • Who's afraid of the big bad wife

In a similar vein, /\b(\w+)s love \1 food\b/ will match "dogs
love dog food", but not "dogs love monkey food".

Using Subpatterns Outside the Regular Expression Match

Outside the regular expression match statement, the matched
subpatterns (if any) can be found the variables $1, $2, $3, and
so forth.

Example. Extract 50 base pairs upstream and 25 base pairs downstream
of the TATTAT consensus transcription start site:

1
2
3
4
5
6
  while (my $line = <FASTA> ) {
    chomp $line;
    next unless $line =~ /(.{50})TATTAT(.{25})/;
    my $upstream = $1;
    my $downstream = $2;
  }

Extracting Subpatterns Using Arrays

If you assign a regular expression match to an array, it will
return a list of all the subpatterns that matched. Alternative
implementation of previous example:

1
2
3
4
  while ($line = <FASTA> ) {
    chomp $line;
    my ($upstream,$downstream) = $line =~ /(.{50})TATTAT(.{25})/;
  }

If the regular expression doesn't match at all, then it returns an
empty list. Since an empty list is FALSE, you can use it in a logical
test:

1
2
3
4
5
6
  while (my $line = <FASTA> ) {
    chomp $line;
    next unless my($upstream,$downstream) = $line =~ /(.{50})TATTAT(.{25})/;
    print "upstream = $upstream\n";
    print "downstream = $downstream\n";
  }

Subpatterns and Greediness

By default, regular expressions are "greedy". They try to match as
much as they can. For example:

1
2
3
$h = 'The fox ate my box of doughnuts';
$h =~ /(f.+x)/;
$subpattern = $1;

Because of the greediness of the match, $subpattern will
contain "fox ate my box" rather than just "fox".

To match the minimum number of times, put a ? after the qualifier,
like this:

1
2
3
$h = 'The fox ate my box of doughnuts';
$h =~ /(f.+?x)/;
$subpattern = $1;

Now $subpattern will contain "fox". This is called lazy
matching.

Lazy matching works with any quantifier, such as +?, *? and {2,50}?.

Print Friendly

Leave a Reply

Your email address will not be published. Required fields are marked *