Lesson 24: System calls


You will often have a need to execute commands that you would normally execute on the command line in you script. For example you might want to execute a series of scripts in one script.

There are two common ways to do this.

System

system is a function that you can supply the command line statement to be executed as the first argument and the exit status is returned. Output cannot be captured in a variable using system, but see backticks `` below for this feature.

  • 0 for no errors
  • -1 for errors
  • print "$!" for the reason)

Code:

1
2
my $sys = system ("date");
print "sys: $sys\n";

Output:

%% ./system.pl
Wed May  9 07:55:04 PDT 2012
sys: 0

Backticks ``

Backticks can be used to execute a command in your script. The output is the output of the command. This output can be captured in a variable. Now you can do things to the contents of the variable

Code:

1
2
3
4
my $output = `echo "using backticks is helpful"`;
##can do stuff to output
$output = uc ($output);
print "$output\n";

Output:

%% system.pl
USING BACKTICKS IS HELPFUL

Exercises

  1. Create a script that uses a system call, using the system function, to run one of your already written scripts. Collect the output of the system function in a variable and print it to the screen. This output is a code to indicate the success of the call.
  2. Change the script to run your system call using backticks. Collect your output and print it to the screen

Lesson 23: Subroutines or Custom functions


Is there a block of code that you use more than once in your script? If so you should write a subroutine.

  • A subroutine is a custom function
  • Allows you to reduce the chances of introducing an error into repetitive blocks of code
  • If you decide to change your block of code, you only have to change it in one place
  • Simplifies the flow of your script. Now you have a useful function name instead of many lines of code
  • You can pass arguments to the subroutine
  • You can have your subroutine return values

To make a subroutine

  1. place the subroutine below the place you want to use the function
  2. use the function sub
  3. give it an informative name.
  4. arguments come in on a special array called '@_' or the magic carpet array. It is very similar to @ARGV.
  5. use the return function to return values.

Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
my $answer = doSomeMath (3 , 4 , 6);
print "the answer is $answer\n";
 
##subroutines
sub doSomeMath {
  my @numbers = @_;
  my $sum;
  foreach my $num (@numbers){
    ## adding $num to the previous value of $sum
    $sum += $num;
  }
  my $product = $sum * 2;
  foreach my $num (@numbers){
    ## multiplying each $num to the previous value of $product
    $product *= $num;
  }
  return $product;
}

Output:

%% perl sub.pl
the answer is 1872

Note:

  • Arrays can be passed into and out of a subroutine, but if more than one is passed in or out, the contents will be merged into one list.
  • Hashes are lists of key/value pairs so they can also be passed into and out of a subroutine.
  • More complicated datastructures do not get pass around nicely unless you pass just a reference to the subroutine.
  • Exercises

    1. Create a factorial subroutine that takes one number as an argument calculates the factorial of that number and then returns the one result

    Lesson 22i: Regular Expression: Restriction Digest Exercises

    Exercises

    1. The enzyme ApoI has a restriction site: R^AATTY where R and Y are degenerate nucleotideides. See the IUPAC table to identify the nucleotide possibilities for the R and Y.

      Write a regular expression that will match occurrences of the site in a sequence. (hint: what are you going to do about the actual cut site, represented by the ‘^’?)

    2. Use the regular expression you just wrote to find all the restriction sites in the following sequence. Be sure to think about how to handle the newlines!
      GAATTCAAGTTCTTGTGCGCACACAAATCCAATAAAAACTATTGTGCACACAGACGCGAC
      TTCGCGGTCTCGCTTGTTCTTGTTGTATTCGTATTTTCATTTCTCGTTCTGTTTCTACTT
      AACAATGTGGTGATAATATAAAAAATAAAGCAATTCAAAAGTGTATGACTTAATTAATGA
      GCGATTTTTTTTTTGAAATCAAATTTTTGGAACATTTTTTTTAAATTCAAATTTTGGCGA
      AAATTCAATATCGGTTCTACTATCCATAATATAATTCATCAGGAATACATCTTCAAAGGC
      AAACGGTGACAACAAAATTCAGGCAATTCAGGCAAATACCGAATGACCAGCTTGGTTATC
      AATTCTAGAATTTGTTTTTTGGTTTTTATTTATCATTGTAAATAAGACAAACATTTGTTC
      CTAGTAAAGAATGTAACACCAGAAGTCACGTAAAATGGTGTCCCCATTGTTTAAACGGTT
      GTTGGGACCAATGGAGTTCGTGGTAACAGTACATCTTTCCCCTTGAATTTGCCATTCAAA
      ATTTGCGGTGGAATACCTAACAAATCCAGTGAATTTAAGAATTGCGATGGGTAATTGACA
      TGAATTCCAAGGTCAAATGCTAAGAGATAGTTTAATTTATGTTTGAGACAATCAATTCCC
      CAATTTTTCTAAGACTTCAATCAATCTCTTAGAATCCGCCTCTGGAGGTGCACTCAGCCG
      CACGTCGGGCTCACCAAATATGTTGGGGTTGTCGGTGAACTCGAATAGAAATTATTGTCG
      CCTCCATCTTCATGGCCGTGAAATCGGCTCGCTGACGGGCTTCTCGCGCTGGATTTTTTC
      ACTATTTTTGAATACATCATTAACGCAATATATATATATATATATTTAT
      
    3. Determine the site(s) of the cut in the above sequence. Print out the sequence with “^” at the cut site.

      Hints:
      Use subpatterns (parentheses and $1, $2) to find the cut site within the pattern.
      Use s///

      Example: if the pattern is GACGT^CT the following sequence

      AAAAAAAAGACGTCTTTTTTTAAAAAAAAGACGTCTTTTTTT

      would be cut like this:

      AAAAAAAAGACGT^CTTTTTTTAAAAAAAAGACGT^CTTTTTTT

    4. Now that you’ve done your restriction digest, determine the lengths of your fragments and sort them by length (in the same order they would separate on an electrophoresis gel).

      Hint: take a look at the split man page or think about storing your matches in an array. With one of these two approaches you should be able to convert this string:

      AAAAAAAAGACGT^CTTTTTTTAAAAAAAAGACGT^CTTTTTTT

      into this array:

      (“AAAAAAAAGACGT”,”CTTTTTTTAAAAAAAAGACGT”,”CTTTTTTT”)

    Lesson 22h: Regular Expression: Testing your RE

    Testings Your Regular Expressions

    To be sure that you are getting what you think you want you can use the following
    "Magic" Perl Automatic Match Variables $&, $`, and $'

    Everything before the pattern begins is stored in:

    $&

    Everything within the pattern is stored in:

    $`

    Everything after the pattern ends is stored in:

    $'

    Code:

    1
    2
    3
    4
    
    if ("Hello there, neighbor" =~ /\s(\w+),/){
            print "That actually matched '$&'.\n";
            print "That was ($`) ($&) ($').\n";
    }

    Output:

    That actually matched ' there,'.
    That was (Hello) ( there,) ( neighbor).
    

    Lesson 22g: Regular Expression : Global and other options

    Regular Expression Options

    Regular expression matches and substitutions have a whole set of
    options which you can toggle on by appending one or more of the i,
    m, s, g, e
    or x modifiers to the end of the operation.

    See Programming Perl
    Page 153 for more information. Some example:

    1
    2
    3
    
    $string = 'Big Bad WOLF!';
    print "There's a wolf in the closet!" if $string =~ /wolf/i;
    # i is used for a case insensitive match
    i
    Case insensitive match.

    g
    Global match (see below).

    e
    Evalute right side of s/// as an expression.

    m
    Treat string as multiple lines. ^ and $ will match at start
    and end of internal lines, as well as at beginning and end of
    whole string. Use \A and \Z to match beginning and end of whole
    string when this is turned on.

    s
    Treat string as a single line. “.” will match any character at
    all, including newline.

    o
    Defining that a variable used as a pattern will never change, so perl will not attempt to interpolate the variable.

    Global Matches

    Adding the g modifier to the pattern causes the match to be
    global. Called in a scalar context (such as an if or
    while statement), it will match as many times as it can.

    This will match all codons in a DNA sequence, printing them out on
    separate lines:

    Code:

    1
    2
    3
    4
    
      $sequence = 'GTTGCCTGAAATGGCGGAACCTTGAA';
      while ( $sequence =~ /(.{3})/g ) {
        print $1,"\n";
      }

    Output:

    GTT
    GCC
    TGA
    AAT
    GGC
    GGA
    ACC
    TTG
    

    If you perform a global match in a list context (e.g. assign
    its result to an array), then you get a list of all the subpatterns
    that matched from left to right. This code fragment gets arrays of
    codons in three reading frames:

    1
    2
    3
    
    @frame1 = $sequence =~ /(.{3})/g;
    @frame2 = substr($sequence,1) =~ /(.{3})/g;
    @frame3 = substr($sequence,2) =~ /(.{3})/g;

    The position of the most recent match can be determined by using the
    pos function. The pos function returns the position where the next
    attempt begins. Remember that pos will return in 0-base notation, the first postion is 0 not 1.
    Code:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    #file:pos.pl
    my $seq = "XXGGATCCXX";
     
    if ( $seq =~ /(GGATCC)/gi ){
      my $pos = pos($seq);
      print "Our Sequence: $seq\n";
      print '$pos = ', "1st postion after the match: $pos\n";
      print '$pos - length($1) = 1st postion of the match: ',($pos-length($1)),"\n";
      print '($pos - length($1))-1 = 1st postion before the the match: ',($pos-length($1)-1),"\n";
    }

    Output:

    ~]$ ./pos.pl
    Our Sequence: XXGGATCCXX
    $pos = 1st postion after the match: 8
    $pos - length(GGATCC) = 1st postion of the match: 2
    ($pos - length(GGATCC))-1 = 1st postion before the the match: 1
    

    Lesson 22f: Regular Expression Part VI s/// tr///

    String Substitution

    String substitution allows you to replace a pattern or character range
    with another one using the s/// and tr/// functions.

    The s/// Function

    s/// has two parts: the regular expression and the string
    to replace it with: s/expression/replacement/.

    1
    2
    3
    4
    5
    
    $h = "Who's afraid of the big bad wolf?";
    $i = "He had a wife.";
     
    $h =~ s/w.+f/goat/;  # yields "Who's afraid of the big bad goat?"
    $i =~ s/w.+f/goat/;  # yields "He had a goate."

    If you extract pattern matches, you can use them in the replacement
    part of the substitution:

    1
    2
    3
    4
    
    $h = "Who's afraid of the big bad wolf?";
     
    $h =~ s/(\w+) (\w+) wolf/$2 $1 wolf/;
    # yields "Who's afraid of the bad big wolf?"

    Using a Variable in the Substitution Part

    Yes you can:

    1
    2
    3
    4
    
    $h = "Who's afraid of the big bad wolf?";
    $animal = 'hyena';
    $h =~ s/(\w+) (\w+) wolf/$2 $1 $animal/;
    # yields "Who's afraid of the bad big hyena?"

    Translating Character Ranges

    The tr/// function allows you to translate one set of
    characters into another. Specify the source set in the first part of
    the function, and the destination set in the second part:

    1
    2
    
    $h = "Who's afraid of the big bad wolf?";
    $h =~ tr/ao/AO/; # yields "WhO's AfrAid Of the big bAd wOlf?";

    tr/// returns the number of characters transformed, which is
    sometimes handy for counting the number of a particular character
    without actually changing the string.

    This example counts N's in a series of DNA sequences:

    Code:

    1
    2
    3
    4
    5
    6
    
     
      while (my $line = ) {
        chomp $line;   # assume one sequence per line
        my $count = $line =~ tr/Nn/Nn/;
        print "Sequence $line contains $count Ns\n";
      }

    Output:

    (~) 50% count_Ns.pl sequence_list.txt
    Sequence 1 contains 0 Ns
    Sequence 2 contains 3 Ns
    Sequence 3 contains 1 Ns
    Sequence 4 contains 0 Ns
    ...
    

    Lesson 22e: Regular Expression: Subpatterns

    Subpatterns

    You can extract and manipulate subpatterns in regular expressions.

    To designate a subpattern, surround its part of the pattern with
    parenthesis (same as with the grouping operator). This example has
    just one subpattern, (.+) :

    1
    
     /Who's afraid of the big bad w(.+)f/

    You can combine parenthesis and quantifiers to quantify entire
    subpatterns:

    1
    
    /Who's afraid of the big (bad )?wolf\?/;

    This matches "Who's afraid of the big bad wolf?"
    as well as "Who's afraid of the big wolf?"

    This also shows how to literally match the special characters -- put a
    backslash (\) in front of them.

    Matching Subpatterns

    Once a subpattern matches, you can refer to it later within the same
    regular expression. The first subpattern becomes \1, the second \2,
    the third \3, and so on.

    1
    2
    3
    4
    
      while ($line = <FASTA>) {
        chomp $line;
        print "I'm scared!\n" if /Who's afraid of the big bad w(.)\1f/
      }

    This loop will print "I'm scared!" for the following matching lines:

    • Who's afraid of the big bad woof
    • Who's afraid of the big bad weef
    • Who's afraid of the big bad waaf

    but not

    • Who's afraid of the big bad wolf
    • Who's afraid of the big bad wife

    In a similar vein, /\b(\w+)s love \1 food\b/ will match "dogs
    love dog food", but not "dogs love monkey food".

    Using Subpatterns Outside the Regular Expression Match

    Outside the regular expression match statement, the matched
    subpatterns (if any) can be found the variables $1, $2, $3, and
    so forth.

    Example. Extract 50 base pairs upstream and 25 base pairs downstream
    of the TATTAT consensus transcription start site:

    1
    2
    3
    4
    5
    6
    
      while (my $line = <FASTA> ) {
        chomp $line;
        next unless $line =~ /(.{50})TATTAT(.{25})/;
        my $upstream = $1;
        my $downstream = $2;
      }

    Extracting Subpatterns Using Arrays

    If you assign a regular expression match to an array, it will
    return a list of all the subpatterns that matched. Alternative
    implementation of previous example:

    1
    2
    3
    4
    
      while ($line = <FASTA> ) {
        chomp $line;
        my ($upstream,$downstream) = $line =~ /(.{50})TATTAT(.{25})/;
      }

    If the regular expression doesn't match at all, then it returns an
    empty list. Since an empty list is FALSE, you can use it in a logical
    test:

    1
    2
    3
    4
    5
    6
    
      while (my $line = <FASTA> ) {
        chomp $line;
        next unless my($upstream,$downstream) = $line =~ /(.{50})TATTAT(.{25})/;
        print "upstream = $upstream\n";
        print "downstream = $downstream\n";
      }

    Subpatterns and Greediness

    By default, regular expressions are "greedy". They try to match as
    much as they can. For example:

    1
    2
    3
    
    $h = 'The fox ate my box of doughnuts';
    $h =~ /(f.+x)/;
    $subpattern = $1;

    Because of the greediness of the match, $subpattern will
    contain "fox ate my box" rather than just "fox".

    To match the minimum number of times, put a ? after the qualifier,
    like this:

    1
    2
    3
    
    $h = 'The fox ate my box of doughnuts';
    $h =~ /(f.+?x)/;
    $subpattern = $1;

    Now $subpattern will contain "fox". This is called lazy
    matching.

    Lazy matching works with any quantifier, such as +?, *? and {2,50}?.

    Lesson 22d: Regular Expression: Alternatives and Grouping

    Alternatives and Grouping

    A set of alternative patterns can be specified with the | symbol:

    1
    
    /wolf|sheep/;   # matches "wolf" or "sheep"

    Use parenthesis with alternative patterns:

    1
    2
    
    /big bad (wolf|sheep)/;
    # matches "big bad wolf" or "big bad sheep"

    Exercises

    1. Create a regular expression that will match a string with this pattern ATG followed by a C or a T. Test your regular expression with these two strings:
      • GCTGATGCGTTA
      • GCTATGGCT

    Lesson 22c: Regular Expression: Binding Operator and Variable Patterns

    Specifying the String to Match

    The Binding operator (=~) is used to "bind" the string to be searched and the pattern.

    1
    2
    
    $h = "Who's afraid of Virginia Woolf?";
    $h =~ /Woo?lf/;

    The one line version of the 'if statement' can be combined with a regular expression:

    1
    2
    
    $h = "Who's afraid of Virginia Woolf?";
    print "I'm afraid!\n" if $h =~ /Woo?lf/;

    There's also an equivalent "not match" operator !~, which
    reverses the sense of the match:

    1
    2
    
    $h = "Who's afraid of Virginia Woolf?";
    print "I'm not afraid!\n" if $h !~ /Woo?lf/;

    Matching with a Variable Pattern

    You can use a scalar variable for all or part of a regular
    expression. For example:

    1
    2
    
    $pattern = '/usr/local';
    print "matches" if $file =~ /^$pattern/;

    Exercises

    1. Create a script with a regular expression within an if-statement.
    2. Design the regular express to match an entire sentence, up to the ending period in the provided string.

      my $str = "This is a paragraph. A Paragraph is usually made up of more than one sentence.";
    3. Modify your regular expression to take '.' , '?', and '!; into account as ending punctuation.

    Lesson 22b: Regular Expression: Atoms and Quantifiers

    Regular Expression Bits and Pieces

    A regular expression is normally delimited by two slashes ("/").
    Everything between the slashes is a pattern to match. Patterns can
    be made up of the following Atoms:

    1. Ordinary characters: a-z, A-Z, 0-9 and some punctuation. These
      match themselves.

    2. The "." character, which matches everything except the newline.

    3. A bracket list of characters, such as [AaGgCcTtNn], [A-F0-9], or
      [^A-Z] (the last means anything BUT A-Z).

    4. Certain predefined character sets:
      \d
      The digits [0-9]
      \w
      A word character [A-Za-z_0-9]
      \s
      White space [ \t\n\r]
      \D
      A non-digit
      \W
      A non-word
      \S
      Non-whitespace
    5. Anchors:
      ^
      Matches the beginning of the string
      $
      Matches the end of the string
      \b
      Matches a word boundary (between a \w and a \W)

    Examples:

    • /g..t/ matches "gaat", "goat", and "gotta get a goat" (twice)
    • /g[gatc][gatc]t/ matches "gaat", "gttt", "gatt", and
      "gotta get an agatt" (once)
    • /\d\d\d-\d\d\d\d/ matches 376-8380, and 5128-8181, but not
      055-98-2818.
    • /^\d\d\d-\d\d\d\d/ matches 376-8380 and 376-83801, but not
      5128-8181.
    • /^\d\d\d-\d\d\d\d$/ only matches telephone numbers.
    • /\bcat/ matches "cat", "catsup" and "more catsup please"
      but not "scat".
    • /\bcat\b/ only text containing the word "cat".

    Quantifiers

    By default, an atom matches once. This can be modified by following
    the atom with a quantifier:

    ?
    atom matches zero or exactly once
    *
    atom matches zero or more times
    +
    atom matches one or more times
    {3}
    atom matches exactly three times
    {2,4}
    atom matches between two and four times, inclusive
    {4,}
    atom matches at least four times

    Examples:

    • /goa?t/ matches "goat" and "got". Also any text that contains these words.
    • /g.+t/ matches "goat", "goot", and "grant", among others.
    • /g.*t/ matches "gt", "goat", "goot", and "grant", among others.
    • /^\d{3}-\d{4}$/ matches US telephone numbers (no extra text allowed).

    Exercises:

    1. Design a pattern to recognize an email address.
    2. Design a pattern to recognize the id portion of a sequence in a FASTA file
      >SEQ_ID_1
      ATGCTGCGCGTGCATGATGCT
      >SEQ_ID_2
      CGCGTGCATGATGCTGCGCGT