Lesson 24: System calls


You will often have a need to execute commands that you would normally execute on the command line in you script. For example you might want to execute a series of scripts in one script.

There are two common ways to do this.

System

system is a function that you can supply the command line statement to be executed as the first argument and the exit status is returned. Output cannot be captured in a variable using system, but see backticks `` below for this feature.

  • 0 for no errors
  • -1 for errors
  • print "$!" for the reason)

Code:

1
2
my $sys = system ("date");
print "sys: $sys\n";

Output:

%% ./system.pl
Wed May  9 07:55:04 PDT 2012
sys: 0

Backticks ``

Backticks can be used to execute a command in your script. The output is the output of the command. This output can be captured in a variable. Now you can do things to the contents of the variable

Code:

1
2
3
4
my $output = `echo "using backticks is helpful"`;
##can do stuff to output
$output = uc ($output);
print "$output\n";

Output:

%% system.pl
USING BACKTICKS IS HELPFUL

Exercises

  1. Create a script that uses a system call, using the system function, to run one of your already written scripts. Collect the output of the system function in a variable and print it to the screen. This output is a code to indicate the success of the call.
  2. Change the script to run your system call using backticks. Collect your output and print it to the screen

Lesson 23: Subroutines or Custom functions


Is there a block of code that you use more than once in your script? If so you should write a subroutine.

  • A subroutine is a custom function
  • Allows you to reduce the chances of introducing an error into repetitive blocks of code
  • If you decide to change your block of code, you only have to change it in one place
  • Simplifies the flow of your script. Now you have a useful function name instead of many lines of code
  • You can pass arguments to the subroutine
  • You can have your subroutine return values

To make a subroutine

  1. place the subroutine below the place you want to use the function
  2. use the function sub
  3. give it an informative name.
  4. arguments come in on a special array called '@_' or the magic carpet array. It is very similar to @ARGV.
  5. use the return function to return values.

Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
my $answer = doSomeMath (3 , 4 , 6);
print "the answer is $answer\n";
 
##subroutines
sub doSomeMath {
  my @numbers = @_;
  my $sum;
  foreach my $num (@numbers){
    ## adding $num to the previous value of $sum
    $sum += $num;
  }
  my $product = $sum * 2;
  foreach my $num (@numbers){
    ## multiplying each $num to the previous value of $product
    $product *= $num;
  }
  return $product;
}

Output:

%% perl sub.pl
the answer is 1872

Note:

  • Arrays can be passed into and out of a subroutine, but if more than one is passed in or out, the contents will be merged into one list.
  • Hashes are lists of key/value pairs so they can also be passed into and out of a subroutine.
  • More complicated datastructures do not get pass around nicely unless you pass just a reference to the subroutine.
  • Exercises

    1. Create a factorial subroutine that takes one number as an argument calculates the factorial of that number and then returns the one result

    Lesson 22i: Regular Expression: Restriction Digest Exercises

    Exercises

    1. The enzyme ApoI has a restriction site: R^AATTY where R and Y are degenerate nucleotideides. See the IUPAC table to identify the nucleotide possibilities for the R and Y.

      Write a regular expression that will match occurrences of the site in a sequence. (hint: what are you going to do about the actual cut site, represented by the ‘^’?)

    2. Use the regular expression you just wrote to find all the restriction sites in the following sequence. Be sure to think about how to handle the newlines!
      GAATTCAAGTTCTTGTGCGCACACAAATCCAATAAAAACTATTGTGCACACAGACGCGAC
      TTCGCGGTCTCGCTTGTTCTTGTTGTATTCGTATTTTCATTTCTCGTTCTGTTTCTACTT
      AACAATGTGGTGATAATATAAAAAATAAAGCAATTCAAAAGTGTATGACTTAATTAATGA
      GCGATTTTTTTTTTGAAATCAAATTTTTGGAACATTTTTTTTAAATTCAAATTTTGGCGA
      AAATTCAATATCGGTTCTACTATCCATAATATAATTCATCAGGAATACATCTTCAAAGGC
      AAACGGTGACAACAAAATTCAGGCAATTCAGGCAAATACCGAATGACCAGCTTGGTTATC
      AATTCTAGAATTTGTTTTTTGGTTTTTATTTATCATTGTAAATAAGACAAACATTTGTTC
      CTAGTAAAGAATGTAACACCAGAAGTCACGTAAAATGGTGTCCCCATTGTTTAAACGGTT
      GTTGGGACCAATGGAGTTCGTGGTAACAGTACATCTTTCCCCTTGAATTTGCCATTCAAA
      ATTTGCGGTGGAATACCTAACAAATCCAGTGAATTTAAGAATTGCGATGGGTAATTGACA
      TGAATTCCAAGGTCAAATGCTAAGAGATAGTTTAATTTATGTTTGAGACAATCAATTCCC
      CAATTTTTCTAAGACTTCAATCAATCTCTTAGAATCCGCCTCTGGAGGTGCACTCAGCCG
      CACGTCGGGCTCACCAAATATGTTGGGGTTGTCGGTGAACTCGAATAGAAATTATTGTCG
      CCTCCATCTTCATGGCCGTGAAATCGGCTCGCTGACGGGCTTCTCGCGCTGGATTTTTTC
      ACTATTTTTGAATACATCATTAACGCAATATATATATATATATATTTAT
      
    3. Determine the site(s) of the cut in the above sequence. Print out the sequence with “^” at the cut site.

      Hints:
      Use subpatterns (parentheses and $1, $2) to find the cut site within the pattern.
      Use s///

      Example: if the pattern is GACGT^CT the following sequence

      AAAAAAAAGACGTCTTTTTTTAAAAAAAAGACGTCTTTTTTT

      would be cut like this:

      AAAAAAAAGACGT^CTTTTTTTAAAAAAAAGACGT^CTTTTTTT

    4. Now that you’ve done your restriction digest, determine the lengths of your fragments and sort them by length (in the same order they would separate on an electrophoresis gel).

      Hint: take a look at the split man page or think about storing your matches in an array. With one of these two approaches you should be able to convert this string:

      AAAAAAAAGACGT^CTTTTTTTAAAAAAAAGACGT^CTTTTTTT

      into this array:

      (“AAAAAAAAGACGT”,”CTTTTTTTAAAAAAAAGACGT”,”CTTTTTTT”)

    Lesson 22h: Regular Expression: Testing your RE

    Testings Your Regular Expressions

    To be sure that you are getting what you think you want you can use the following
    "Magic" Perl Automatic Match Variables $&, $`, and $'

    Everything before the pattern begins is stored in:

    $&

    Everything within the pattern is stored in:

    $`

    Everything after the pattern ends is stored in:

    $'

    Code:

    1
    2
    3
    4
    
    if ("Hello there, neighbor" =~ /\s(\w+),/){
            print "That actually matched '$&'.\n";
            print "That was ($`) ($&) ($').\n";
    }

    Output:

    That actually matched ' there,'.
    That was (Hello) ( there,) ( neighbor).
    

    Lesson 22g: Regular Expression : Global and other options

    Regular Expression Options

    Regular expression matches and substitutions have a whole set of
    options which you can toggle on by appending one or more of the i,
    m, s, g, e
    or x modifiers to the end of the operation.

    See Programming Perl
    Page 153 for more information. Some example:

    1
    2
    3
    
    $string = 'Big Bad WOLF!';
    print "There's a wolf in the closet!" if $string =~ /wolf/i;
    # i is used for a case insensitive match
    i
    Case insensitive match.

    g
    Global match (see below).

    e
    Evalute right side of s/// as an expression.

    m
    Treat string as multiple lines. ^ and $ will match at start
    and end of internal lines, as well as at beginning and end of
    whole string. Use \A and \Z to match beginning and end of whole
    string when this is turned on.

    s
    Treat string as a single line. “.” will match any character at
    all, including newline.

    o
    Defining that a variable used as a pattern will never change, so perl will not attempt to interpolate the variable.

    Global Matches

    Adding the g modifier to the pattern causes the match to be
    global. Called in a scalar context (such as an if or
    while statement), it will match as many times as it can.

    This will match all codons in a DNA sequence, printing them out on
    separate lines:

    Code:

    1
    2
    3
    4
    
      $sequence = 'GTTGCCTGAAATGGCGGAACCTTGAA';
      while ( $sequence =~ /(.{3})/g ) {
        print $1,"\n";
      }

    Output:

    GTT
    GCC
    TGA
    AAT
    GGC
    GGA
    ACC
    TTG
    

    If you perform a global match in a list context (e.g. assign
    its result to an array), then you get a list of all the subpatterns
    that matched from left to right. This code fragment gets arrays of
    codons in three reading frames:

    1
    2
    3
    
    @frame1 = $sequence =~ /(.{3})/g;
    @frame2 = substr($sequence,1) =~ /(.{3})/g;
    @frame3 = substr($sequence,2) =~ /(.{3})/g;

    The position of the most recent match can be determined by using the
    pos function. The pos function returns the position where the next
    attempt begins. Remember that pos will return in 0-base notation, the first postion is 0 not 1.
    Code:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    #file:pos.pl
    my $seq = "XXGGATCCXX";
     
    if ( $seq =~ /(GGATCC)/gi ){
      my $pos = pos($seq);
      print "Our Sequence: $seq\n";
      print '$pos = ', "1st postion after the match: $pos\n";
      print '$pos - length($1) = 1st postion of the match: ',($pos-length($1)),"\n";
      print '($pos - length($1))-1 = 1st postion before the the match: ',($pos-length($1)-1),"\n";
    }

    Output:

    ~]$ ./pos.pl
    Our Sequence: XXGGATCCXX
    $pos = 1st postion after the match: 8
    $pos - length(GGATCC) = 1st postion of the match: 2
    ($pos - length(GGATCC))-1 = 1st postion before the the match: 1
    

    Lesson 22f: Regular Expression Part VI s/// tr///

    String Substitution

    String substitution allows you to replace a pattern or character range
    with another one using the s/// and tr/// functions.

    The s/// Function

    s/// has two parts: the regular expression and the string
    to replace it with: s/expression/replacement/.

    1
    2
    3
    4
    5
    
    $h = "Who's afraid of the big bad wolf?";
    $i = "He had a wife.";
     
    $h =~ s/w.+f/goat/;  # yields "Who's afraid of the big bad goat?"
    $i =~ s/w.+f/goat/;  # yields "He had a goate."

    If you extract pattern matches, you can use them in the replacement
    part of the substitution:

    1
    2
    3
    4
    
    $h = "Who's afraid of the big bad wolf?";
     
    $h =~ s/(\w+) (\w+) wolf/$2 $1 wolf/;
    # yields "Who's afraid of the bad big wolf?"

    Using a Variable in the Substitution Part

    Yes you can:

    1
    2
    3
    4
    
    $h = "Who's afraid of the big bad wolf?";
    $animal = 'hyena';
    $h =~ s/(\w+) (\w+) wolf/$2 $1 $animal/;
    # yields "Who's afraid of the bad big hyena?"

    Translating Character Ranges

    The tr/// function allows you to translate one set of
    characters into another. Specify the source set in the first part of
    the function, and the destination set in the second part:

    1
    2
    
    $h = "Who's afraid of the big bad wolf?";
    $h =~ tr/ao/AO/; # yields "WhO's AfrAid Of the big bAd wOlf?";

    tr/// returns the number of characters transformed, which is
    sometimes handy for counting the number of a particular character
    without actually changing the string.

    This example counts N's in a series of DNA sequences:

    Code:

    1
    2
    3
    4
    5
    6
    
     
      while (my $line = ) {
        chomp $line;   # assume one sequence per line
        my $count = $line =~ tr/Nn/Nn/;
        print "Sequence $line contains $count Ns\n";
      }

    Output:

    (~) 50% count_Ns.pl sequence_list.txt
    Sequence 1 contains 0 Ns
    Sequence 2 contains 3 Ns
    Sequence 3 contains 1 Ns
    Sequence 4 contains 0 Ns
    ...
    

    Lesson 22e: Regular Expression: Subpatterns

    Subpatterns

    You can extract and manipulate subpatterns in regular expressions.

    To designate a subpattern, surround its part of the pattern with
    parenthesis (same as with the grouping operator). This example has
    just one subpattern, (.+) :

    1
    
     /Who's afraid of the big bad w(.+)f/

    You can combine parenthesis and quantifiers to quantify entire
    subpatterns:

    1
    
    /Who's afraid of the big (bad )?wolf\?/;

    This matches "Who's afraid of the big bad wolf?"
    as well as "Who's afraid of the big wolf?"

    This also shows how to literally match the special characters -- put a
    backslash (\) in front of them.

    Matching Subpatterns

    Once a subpattern matches, you can refer to it later within the same
    regular expression. The first subpattern becomes \1, the second \2,
    the third \3, and so on.

    1
    2
    3
    4
    
      while ($line = <FASTA>) {
        chomp $line;
        print "I'm scared!\n" if /Who's afraid of the big bad w(.)\1f/
      }

    This loop will print "I'm scared!" for the following matching lines:

    • Who's afraid of the big bad woof
    • Who's afraid of the big bad weef
    • Who's afraid of the big bad waaf

    but not

    • Who's afraid of the big bad wolf
    • Who's afraid of the big bad wife

    In a similar vein, /\b(\w+)s love \1 food\b/ will match "dogs
    love dog food", but not "dogs love monkey food".

    Using Subpatterns Outside the Regular Expression Match

    Outside the regular expression match statement, the matched
    subpatterns (if any) can be found the variables $1, $2, $3, and
    so forth.

    Example. Extract 50 base pairs upstream and 25 base pairs downstream
    of the TATTAT consensus transcription start site:

    1
    2
    3
    4
    5
    6
    
      while (my $line = <FASTA> ) {
        chomp $line;
        next unless $line =~ /(.{50})TATTAT(.{25})/;
        my $upstream = $1;
        my $downstream = $2;
      }

    Extracting Subpatterns Using Arrays

    If you assign a regular expression match to an array, it will
    return a list of all the subpatterns that matched. Alternative
    implementation of previous example:

    1
    2
    3
    4
    
      while ($line = <FASTA> ) {
        chomp $line;
        my ($upstream,$downstream) = $line =~ /(.{50})TATTAT(.{25})/;
      }

    If the regular expression doesn't match at all, then it returns an
    empty list. Since an empty list is FALSE, you can use it in a logical
    test:

    1
    2
    3
    4
    5
    6
    
      while (my $line = <FASTA> ) {
        chomp $line;
        next unless my($upstream,$downstream) = $line =~ /(.{50})TATTAT(.{25})/;
        print "upstream = $upstream\n";
        print "downstream = $downstream\n";
      }

    Subpatterns and Greediness

    By default, regular expressions are "greedy". They try to match as
    much as they can. For example:

    1
    2
    3
    
    $h = 'The fox ate my box of doughnuts';
    $h =~ /(f.+x)/;
    $subpattern = $1;

    Because of the greediness of the match, $subpattern will
    contain "fox ate my box" rather than just "fox".

    To match the minimum number of times, put a ? after the qualifier,
    like this:

    1
    2
    3
    
    $h = 'The fox ate my box of doughnuts';
    $h =~ /(f.+?x)/;
    $subpattern = $1;

    Now $subpattern will contain "fox". This is called lazy
    matching.

    Lazy matching works with any quantifier, such as +?, *? and {2,50}?.

    Lesson 22d: Regular Expression: Alternatives and Grouping

    Alternatives and Grouping

    A set of alternative patterns can be specified with the | symbol:

    1
    
    /wolf|sheep/;   # matches "wolf" or "sheep"

    Use parenthesis with alternative patterns:

    1
    2
    
    /big bad (wolf|sheep)/;
    # matches "big bad wolf" or "big bad sheep"

    Exercises

    1. Create a regular expression that will match a string with this pattern ATG followed by a C or a T. Test your regular expression with these two strings:
      • GCTGATGCGTTA
      • GCTATGGCT

    Lesson 22c: Regular Expression: Binding Operator and Variable Patterns

    Specifying the String to Match

    The Binding operator (=~) is used to "bind" the string to be searched and the pattern.

    1
    2
    
    $h = "Who's afraid of Virginia Woolf?";
    $h =~ /Woo?lf/;

    The one line version of the 'if statement' can be combined with a regular expression:

    1
    2
    
    $h = "Who's afraid of Virginia Woolf?";
    print "I'm afraid!\n" if $h =~ /Woo?lf/;

    There's also an equivalent "not match" operator !~, which
    reverses the sense of the match:

    1
    2
    
    $h = "Who's afraid of Virginia Woolf?";
    print "I'm not afraid!\n" if $h !~ /Woo?lf/;

    Matching with a Variable Pattern

    You can use a scalar variable for all or part of a regular
    expression. For example:

    1
    2
    
    $pattern = '/usr/local';
    print "matches" if $file =~ /^$pattern/;

    Exercises

    1. Create a script with a regular expression within an if-statement.
    2. Design the regular express to match an entire sentence, up to the ending period in the provided string.

      my $str = "This is a paragraph. A Paragraph is usually made up of more than one sentence.";
    3. Modify your regular expression to take '.' , '?', and '!; into account as ending punctuation.

    Lesson 22a: Regular Expressions: Overview

    Regular expressions is a language you can use within perl to identify patterns in text.

    A regular expression is a string template against which you can match a piece of text. They are something like shell wildcard expressions, but much more powerful.

    Examples of Regular Expressions (more details to follow!!)

    This bit of code loops through each line of a file. Finds all lines containing an EcoRI site, and bumps up a counter:

    Code:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    
    #!/usr/bin/perl -w
    #file: EcoRI1.pl
     
    use strict;
     
    my $filename = "example.fasta";
    open (FASTA , '>' ,"$filename") or print "$filename does not exist\n";
    my $sites;
     
    while (my $line = <FASTA>) {
      chomp $line;
     
      if ($line =~ /GAATTC/){
        print "Found an EcoRI site!\n";
        $sites++;
      }
    }
     
    if ($sites){
      print "$sites EcoRI sites total\n";
    }else{
      print "No EcoRI sites were found\n";
    }
     
    #note: if $sites is declared inside while loop you would not be able to
    #print it outside the loop

    Output:~]$ ./EcoRI1.pl
    Found an EcoRI site!
    Found an EcoRI site!
    .
    .
    .
    Found an EcoRI site!
    Found an EcoRI site!
    34 EcoRI sites total
    
    

    This does the same thing, but counts one type of methylation site (Pu-C-X-G) instead. /[GA]C.?G/

    - G or an A

    [GA]

    - followed by a C

    C

    - followed by one of anything, but could be nothing

    .?

    - followed by a G

    G

    Code:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    
    #file:methy.pl
    while (my $line = <FASTA> ) {
      	chomp $line;
     
      	if ($line =~ /[GA]C.?G/){
        	  $sites++;
      	}
    }
    if ($sites){
    	print "$sites Methylation Sites total\n";
    }else{
    	print "No Methylation Sites were found\n";
    }

    Output:

    ~]$ ./methy.pl
    723 Methylation Sites total