Lab 17 - Regular Expressions

Due by 11:59pm on 2024-03-19.

Starter Files

Download lab17.zip. Inside the archive, you will find starter files for the questions in this lab.

Topics

Regular Expressions

Regular expressions are a way to describe sets of strings that meet certain criteria, and are incredibly useful for pattern matching.

The simplest regular expression is one that matches a sequence of characters, like aardvark to match any "aardvark" substrings in a string.

However, you typically want to look for more interesting patterns. We recommend using an online tool like regexr.com or regex101.com for trying out patterns, since you'll get instant feedback on the match results.

What would RegEx Match?

For each of the following regular expressions, labeled as Q:, suggest a string that would be fully matched. If necessary, refer to RegEx patterns down below starting with 'Character Classes'.

Q: #[a-f0-9]{6}

Choose the number of the correct choice:
0) A hexadecimal color code with 3 letters and 3 numbers
1) A hexadecimal color code that starts with letters and ends with numbers, like #gg1234
2) Any 6-digit hexadecimal color code, like #fdb515
3) Any hexadecimal color code with 0-6 digits

Q: (fizz(buzz|)|buzz)

Choose the number of the correct choice:
0) Only fizzbuzz or buzz
1) Only fizzbuzzbuzz
2) Only fizz
3) Only fizzbuzz, fizz, and buzz
4) Only fizzbuzz

Q: [-+]?\d*\.?\d+

Choose the number of the correct choice:
0) Only signed numbers like +1000, -1.5
1) Only signed or unsigned integers like +1000, -33
2) Signed or unsigned numbers like +1000, -1.5, .051
3) Only unsigned numbers like 0.051

Answers:

Character Classes

A character class makes it possible to search for any one of a set of characters. You can specify the set or use pre-defined sets.

Class	Description
`[abc]`	Matches a, b, or c
`[a-z]`	Matches any character between a and z
`[^A-Z]`	Matches any character that is not between A and Z.
`\w`	Matches any "word" character. Equivalent to `[A-Za-z0-9_]`
`\d`	Matches any digit. Equivalent to `[0-9]`.
`[0-9]`	Matches a single digit in the range 0 - 9. Equivalent to `\d`
`\s`	Matches any whitespace character (spaces, tabs, line breaks).
`.`	Matches any character besides new line.

Character classes can be combined, like in [a-zA-Z0-9].

Combining patterns

There are multiple ways to combine patterns together in regular expressions.

Combination	Description
`AB`	A match for A followed immediately by one for B. Example: `x[:;]y` matches "x:y" or "x;y"
`A\|B`	Matches either A or B. Example: `\d+\|Inf` matches either a sequence containing 1 or more digits or "Inf"

A pattern can be followed by one of these quantifiers to specify how many instances of the pattern can occur.

Quantifier	Description
`*`	0 or more occurrences of the preceding pattern. Example: `[a-z]*` matches any sequence of lower-case letters or the empty string.
`+`	1 or more occurrences of the preceding pattern. Example: `\d+` matches any non-empty sequence of digits.
`?`	0 or 1 occurrences of the preceding pattern. Example: `[-+]?` matches an optional sign.
`{1,3}`	Matches the specified quantity of the preceding pattern. {1,3} will match from 1 to 3 instances. {3} will match exactly 3 instances. {3,} will match 3 or more instances. Example: `\d{5,6}` matches either 5 or 6 digit numbers.

Groups

Parentheses are used similarly as in arithmetic expressions, to create groups. For example, (Mahna)+ matches strings with 1 or more "Mahna", like "MahnaMahna". Without the parentheses, Mahna+ would match strings with "Mahn" followed by 1 or more "a" characters, like "Mahnaaaa".

Anchors

Anchor	Description
`^`	Matches the beginning of a string. Example: `^(I\|You)` matches I or You at the start of a string.
`$`	Normally matches the empty string at the end of a string or just before a newline at the end of a string. Example: `(\.edu\|\.org\|\.com)$` matches .edu, .org, or .com at the end of a string.
`\b`	Matches a "word boundary", the beginning or end of a word. Example: `s\b` matches s characters at the end of words.

Special characters

The following special characters are used above to denote types of patterns:

\ ( ) [ ] { } + * ? | $ ^ .

That means if you actually want to match one of those characters, you have to escape it using a backslash. For example, $1\+3$ matches "(1 + 3)".

Using regular expressions in Python

Many programming languages have built-in functions for matching strings to regular expressions. We'll use the Python re module in CS 111, but you can also use similar functionality in SQL, JavaScript, Excel, shell scripting, etc.

The search method searches for a pattern anywhere in a string:

re.search(r"(Mahna)+", "Mahna Mahna Ba Dee Bedebe")

That method returns back a match object, which is considered truth-y in Python and can be inspected to find the matching strings.

For more details, please consult the re module documentation or the re tutorial.

Want more? Try your hand at regex golf or read up on some powerful regex features at regular-expressions.info, I'm especially fond of non-capturing groups, backreferences, and lookaround.

Required Questions

Q1: CS Classes

On reddit.com, there is an /r/byu subreddit for discussions about everything BYU. However, there is such a large amount of CS-related posts that those posts are auto-tagged so that readers can choose to ignore them or read only them.

Write a regular expression that finds strings that resemeble a CS class. The REGEX expression should match course names of the form “CS111”, “CS 111”, and “C S 111” with optional spaces after the C and the S. It should also match an optional “R” at the end of the number.

import re

def cs_classes(post):
    """
    Returns a True or False if post contains strings in the form
    “CS111”, “CS 111”, and “C S 111” with optional spaces after the C and the S.  
    It should also match an optional “R” at the end of the number. 

    >>> cs_classes("Is it unreasonable to take CS111 in the summer?")
    True
    >>> cs_classes("how do I become a TA for C S 111? That job sounds so fun!")
    True
    >>> cs_classes("Can I take ECON101 as a CS major?")
    False
    >>> cs_classes("Should I do the lab lites or regular labs in EE16A?")
    False
    >>> cs_classes("What are some good CS upper division courses? I was thinking about C S111R")
    True
    """
    return bool(re.search(__________, post))

Q2: Roman Numerals

Write a regular expression that finds any string of letters that resemble a Roman numeral and aren't part of another word. A Roman numeral is made up of the letters I, V, X, L, C, D, M and is at least one letter long.

For the purposes of this problem, don't worry about whether or not a Roman numeral is valid. For example, "VIIIII" is not a Roman numeral, but it is fine if your regex matches it.

import re

def roman_numerals(text):
    """
    Finds any string of letters that could be a Roman numeral
    (made up of the letters I, V, X, L, C, D, M).

    >>> roman_numerals("Sir Richard IIV, can you tell Richard VI that Richard IV is on the phone?")
    ['IIV', 'VI', 'IV']
    >>> roman_numerals("My TODOs: I. Groceries II. Learn how to count in Roman IV. Profit")
    ['I', 'II', 'IV']
    >>> roman_numerals("I. Act 1 II. Act 2 III. Act 3 IV. Act 4 V. Act 5")
    ['I', 'II', 'III', 'IV', 'V']
    >>> roman_numerals("Let's play Civ VII")
    ['VII']
    >>> roman_numerals("i love vi so much more than emacs.")
    []
    >>> roman_numerals("she loves ALL editors equally.")
    []
    """
    return re.findall(__________, text)

Hints:

Use regex101.com or regexr.com for trying out patterns.

Anchors may be useful here.

Q3: Time for Times

You're given a body of text and told that within it are some times. Write a regular expression which, for a few examples, would match the following:

['05:24', '7:23', '23:59', '12:22', '00:00']

but would not match these invalid "times"

['05:64', '70:23']

import re

def match_time(text):
    """
    >>> match_time("At 05:24AM, I had sesame bagels with cream cheese before my coffee at 7:23.")
    ['05:24AM', '7:23']
    >>> match_time("At 23:59 I was sound asleep as the time turned to 00:00.")
    ['23:59', '00:00']
    >>> match_time("Mix water in a 1:2 ratio with chicken stock.")
    []
    >>> match_time("At 2:00 I pinged 127.0.0.1:80.")
    ['2:00']
    """
    return re.findall(__________, text)

Q4: Most Common Area Code

Write a function which takes in a body of text and finds the most common area code. Area codes must be part of a valid phone number.

To solve this problem, we will first write a regular expression which finds valid phone numbers and captures the area code. See the docstring of area_codes for specifics on what qualifies as a valid phone number.

import re

def area_codes(text):
    """
    Finds all phone numbers in text and captures the area code. Phone numbers
    have 10 digits total and may have parentheses around the area code, and
    hyphens or spaces after the third and sixth digits.

    >>> area_codes('(111) 111 1111, 1234567890 and 123 345 6789 should be matched.')
    ['111', '123', '123']
    >>> area_codes("1234567890 should, but 54321 and 654 456 78901 should not match")
    ['123']
    >>> area_codes("no matches for 12 3456 7890 or 09876-54321")
    []
    """
    return re.findall(__________, text)

Hint: You may find non-capturing groups or capturing groups helpful to use for this question.

For example, if we are trying to extract 'pple' from 'apple' using capturing groups:
>>> import re
>>> string = "apple"
>>> regex_pattern = r"a(pple)" # Anything within the parenthesis is what will be returned
>>> re.findall(regex_pattern, string)
['pple']

Now that we can get an area code, we just need to find the most common area code from a list. You may find the list.count method useful.

def most_common_code(text):
    """
    Takes in an input string which contains at least one phone number (and
    may contain more) and returns the most common area code among all phone
    numbers in the input. If there are multiple area codes with the same
    frequency, return the first one that appears in the input text.

    >>> most_common_code('(501) 333 3333')
    '501'
    >>> input_text = '''
    ... (123) 000 1234 and 12454, 098-123-0941, 123 451 0951 and 410-501-3021 has
    ... some phone numbers. '''
    >>> most_common_code(input_text)
    '123'
    """
    # Write your code here

Submit

Submit the lab17.py file on Canvas to Gradescope in the window on the assignment page.