Lesson 22b: Regular Expression: Atoms and Quantifiers

Regular Expression Bits and Pieces

A regular expression is normally delimited by two slashes ("/").
Everything between the slashes is a pattern to match. Patterns can
be made up of the following Atoms:

  1. Ordinary characters: a-z, A-Z, 0-9 and some punctuation. These
    match themselves.

  2. The "." character, which matches everything except the newline.

  3. A bracket list of characters, such as [AaGgCcTtNn], [A-F0-9], or
    [^A-Z] (the last means anything BUT A-Z).

  4. Certain predefined character sets:
    \d
    The digits [0-9]
    \w
    A word character [A-Za-z_0-9]
    \s
    White space [ \t\n\r]
    \D
    A non-digit
    \W
    A non-word
    \S
    Non-whitespace
  5. Anchors:
    ^
    Matches the beginning of the string
    $
    Matches the end of the string
    \b
    Matches a word boundary (between a \w and a \W)

Examples:

  • /g..t/ matches "gaat", "goat", and "gotta get a goat" (twice)
  • /g[gatc][gatc]t/ matches "gaat", "gttt", "gatt", and
    "gotta get an agatt" (once)
  • /\d\d\d-\d\d\d\d/ matches 376-8380, and 5128-8181, but not
    055-98-2818.
  • /^\d\d\d-\d\d\d\d/ matches 376-8380 and 376-83801, but not
    5128-8181.
  • /^\d\d\d-\d\d\d\d$/ only matches telephone numbers.
  • /\bcat/ matches "cat", "catsup" and "more catsup please"
    but not "scat".
  • /\bcat\b/ only text containing the word "cat".

Quantifiers

By default, an atom matches once. This can be modified by following
the atom with a quantifier:

?
atom matches zero or exactly once
*
atom matches zero or more times
+
atom matches one or more times
{3}
atom matches exactly three times
{2,4}
atom matches between two and four times, inclusive
{4,}
atom matches at least four times

Examples:

  • /goa?t/ matches "goat" and "got". Also any text that contains these words.
  • /g.+t/ matches "goat", "goot", and "grant", among others.
  • /g.*t/ matches "gt", "goat", "goot", and "grant", among others.
  • /^\d{3}-\d{4}$/ matches US telephone numbers (no extra text allowed).

Exercises:

  1. Design a pattern to recognize an email address.
  2. Design a pattern to recognize the id portion of a sequence in a FASTA file
    >SEQ_ID_1
    ATGCTGCGCGTGCATGATGCT
    >SEQ_ID_2
    CGCGTGCATGATGCTGCGCGT

Print Friendly

Leave a Reply

Your email address will not be published. Required fields are marked *