To use regular expressions fully in Guile Scheme or Guix we have to be able to use quantifiers. Quantifiers are metacharacters used to specify that the preceding element must be matched a specified number of times. Each quantifier specifies a different number of required matches. Quantifiers are used all the time when creating regular expressions - it's probably impossible to write a regex without using them!
In this post we'll explore all the quantifier metacharacters. We'll also look at the interval expression format which gives us further control over matching. The quantifiers in Guile Scheme are greedy meaning that they perform the longest match they can. This has implications for creating regexs so we'll also look at some strategies for dealing with greedy matching.
This is part 4 in a series called A side (aside?) of Guile, introducing the Guile Scheme language in the context of using it for packaging in Guix. As regular expressions is such a complex topic it's covered over three posts:
The quantifier metacharacters specify how many times the preceding character (or group) should be matched. There are a few of them:
Metacharacter | Explanation |
---|---|
? | Question mark indicates that the preceding character or group is optional. |
. | Dot matches any single character except the null character. |
* | Asterisk matches the preceding element or group zero or more times. |
+ | Plus sign indicates that the preceding character or group should match once or more. |
{ ... } | Squiggly brackets can be used to specify how many times a character should match. The Grep manual calls them an "interval expression". Effectively it's an alternative format to the other quantifiers as it's very flexible. |
All the quantifiers are greedy which means they'll match as much as possible within the search.
The question mark makes the preceding character or group optional. The classic example is if you want to find alternative spellings such as "colour" or "color":
1 => (map match:substring (list-matches "colou?r" "In the USA it's color, but to Brits it's colour")) 2 ("color" "colour") 3 4 => (map match:substring (list-matches "Nov(ember)?" "It is November at last, Nov 5th has fireworks!")) 5 ("November" "Nov")
The way this works is first the engine searches for each character 'c', 'o', 'l', 'o' and matches those, if it finds 'r' next that this is a match. Alternatively, it finds 'u' which is an optional match, finally it has to then find 'r' to fully match.
This can also be specified using the squiggly brackets interval expression form:
=> (map match:substring (list-matches "colou{0,1}r" "In the USA it's color, but to Brits it's colour")) ("color" "colour")
This says that the preceding character 'u', can be present 0 times or a maximum of once: technically you can do {,1} which is a GNU extension.
The dot will match any character. Unlike the other quantifiers it only matches once.
1 ;; matches any character with Xat 2 => (string-match ".at" "late") 3 #("late" (0 . 3)) 4 5 ;; dot metacharacter: matches any character, so in this case at the start of a word 6 => (map match:substring (list-matches ".at" "the cat, bat and sat late on the mat")) 7 ("cat" "bat" "sat" "lat" "mat")
The first example uses the dot to find any character and then 'at' for a match, consequently the back of 'late' matches this regexp.
It can also be specified using the squiggly brackets interval expression:
=> (map match:substring (list-matches "\\w{1,1}at" "the cat, bat and sat late on the mat")) ("cat" "bat" "sat" "lat" "mat")
In this case it's finding a alphanumeric word character \\w a minimum of once, and a maximum of once - followed by 'at'.
The asterisk matches zero or more occurrences of the preceding element or group. Just like the question mark quantifier it has the effect of making the character (or group) that's being searched for optional.
1 => (string-match "se*" "seen") 2 $1 = #("seen" (0 . 3)) 3 4 ;; this matches because the e is optional, if it's there "zero occurences" 5 ;; then that's fine - so really we're matching for anything with 's' at the start 6 => (string-match "se*" "sail") 7 8 => (map match:substring (list-matches "se*" "seen see seal")) 9 $1 = ("see" "see" "se") 10 11 => (map match:substring (list-matches "sa*l" "sail salve slow slain")) 12 $1 = ("sal" "sl" sl")
In the first example (line 2) the engine matches "see" in "seen" because the asterisk is matching the two e's in the word.
As the asterisk matches zero or more occurrences of the character it means it's removed from the search, so in the second example (line 6) we search for "se*", but it will match any string with "s" at the start. While it isn't a useful search like this it does illustrate the optional nature clearly.
The next example (line 8) shows searching in a string containing multiple matches, where some words contain "se" and others "see". The last example (line 11), shows a slightly more useful search where it searches for "sa*l" which will match "sal" and "sl" - so it matches the "sal" of "salve", "sl" of "slow" and "sl" of "slain".
The alternative is to specify this using a squiggly brackets interval expression:
=> (string-match "sla{0,}" "slain") #("slain" (0 . 3))
This specifies that it should find the 'a' a minimum of zero times (so it's optional), and an unlimited maximum number of times.
The plus matches one or more occurrences of the preceding character or group:
1 => (string-match "se+" "seen") 2 $1 = #("seen" (0 . 3)) 3 4 => (string-match "se+" "sail") 5 $1 = #f 6 7 => (map match:substring (list-matches "se+" "seen see seal")) 8 $1 = ("see" "see" "se") 9 10 (map match:substring (list-matches "sa+l" "sail salve slow slain")) 11 $1 = ("sal")
The plus means that a match on the character is required (at least once). In example 1 (line 1) it matches the "se" part of "seen". Notice that in example 2 (line 4) it's different from using the asterisk quantifier as it doesn't match, this is because the "e" must be in the search string, and it's not there in "sail".
Example 3 (line 7) is the same as using the asterisk quantifier. Example 4 (line 10) is different since the search requires that the search string has the "a", so it's searching for "sal" (or multiple a's), consequently it matches the part of "salve", but doesn't match "slow" etc.
Again, the squiggly brackets can be used to specify this:
=> (map match:substring (list-matches "ste{1,}" "peal deal steal steel paste")) ("ste" "stee" "ste")
The brackets specify that the preceding character 'e' must be there a minimum of once, and a maximum of unlimited (as it's blank).
We've seen the various quantifiers shortcuts (question mark, plus, etc), the alternative syntax is called an interval expression by the Grep manual. It's specified in brackets after an element, and it's the minimum and maximum times the preceding item will match:
Interval expression | Explanation |
---|---|
{n} | preceding item is matched exactly n times |
{min,max} | preceding items is matched min times, but not more than max times |
{min,} | preceding items is matched min or more times, the missing max is a shortcut for 'unlimited'. |
{,max} | preceding item is matched max times, the missing min is a shortcut for 'unlimited' |
Lets see some examples:
1 ;; find abbbc precisely 2 => (map match:substring (list-matches "ab{3}c" "ac abc abbc abbbcde")) 3 ("abbbc") 4 5 ;; find 'a' followed by at least two b's and a maximum of three 6 => (map match:substring (list-matches "ab{2,3}c" "ac abc abbc abbbcde")) 7 ("abbc" "abbbc")
The first one (line 2) specifies that the 'b' must be matched precisely three times. The second example (line 6) specifies that the 'b' can appear a minimum of twice, and a maximum of three times so this matches both 'abbc' and 'abbbc'.
Guile's quantifiers are greedy meaning that they will match as much as possible. This can be quite surprising as some common approaches won't work and there isn't a way to turn it off.
Let's look at a simple example using a quantifier and the way Guile's regex engine matches as much as possible:
1 => (match:substring (string-match "a+" "aaaa")) 2 ("aaaa" (0 . 4)) 3 4 => (match:substring (string-match "a+" "aaaa")) 5 "aaaa"
In this case (line 1) we've told it to match one or more occurrences of the letter "a", so notice it continues to the end of the string since using the plus quantifier matches once "or more".
The common strategy for avoiding this is to use lazy matching (more descriptively called 'reluctant'), but this isn't possible in Guile, the manual says "there is no support for non-greedy variants" [1]. So to be clear you can't convert to lazy matching by using a question mark after a quantifier - this does not work in Guile.
A good example of the challenge of greedy matching is say you want to search for each html tag as you're going to change it: this example is from Regular-expression.info's tutorial. A simple regexp might be this:
=> (match:substring (string-match "<.+>" "This is a <em>first</em> test")) "<em>first</em>"
However, this matches past the first end bracket because the quantifiers are greedy. It repeats the .+ as much as it can so it goes past the first > and only when it goes beyond the second one does it fail since the space is not a character.
One option is to negate the search:
;; An odd way to create something similar is to match all characters except for the ;; one that terminates the search (match:substring (string-match "<[^>]*>" "This is a <em>first</em> test")) "<em>"
This works because the engine starts by matching the first <. Next the [^>]* part means it searches for anything which is NOT >, so it matches the 'em' part of the html tag. It then moves right to the > which it's been told not to match, so this [^>]* greedy section of the regex finishes. The next part is that it should match a > which it does. So the full match is "<em>".
Other strategies are to use capture groups or to use character sets to be more precise about the match.
=> (match:substring (string-match "<\\w+>" "This is a <em>first</em> test")) $1 = "<em>"
This character set regex works by being more specific about what matches, as \\w+ will only match alphanumeric characters and the underscore. See the previous post for more details on character sets.
Capture groups are explained in the next post, but in short they let you split the regexp into smaller units which are each captured separately. A regexp always captures the correct number of capture groups: if you tell it to find two capture groups in the string, it will return two groups. This can be used to limit greediness.
=> (string-match "(a+)." "aaaa") $1 = #("aaaa" (0 . 4) (0 . 3))
The (a+) part tells Guile to find one or more instances of "a" and to put them into a capture group (the brackets). The dot tells it to find at least one other character. In order to satisfy this requirement the engine has to give up the last "a", so the capture group has three a's in it. To retrieve what was matched in the group you access the match structure's submatch like this:
=> (match:substring (string-match "(a+)." "aaaa") 1) "aaa"
A common situation where greediness comes into play is when matching an entire word, for this it's best to use a character class and special backslash expressions. Lets say we want to match on the first word in a string:
1 ;; using ".*+" will match everything 2 => (string-match ".*" "Ganymede is the largest moon in the Solar System") 3 #("Ganymede is the largest moon in the Solar System" (0 . 48)) 4 5 ;; use a character class for word, and divide on the space at the end of the word 6 => (match:substring (string-match "\\w*\\>" "Ganymede is the largest moon in the Solar System")) 7 "Ganymede"
The problem is that the quantifier .* is greedy so it matches the entire string. Instead we can be more precise about matching by using the special backslash shortcut \\w which matches on a word character (any alphanumeric character, along with underscore). The asterisk quantifier tells the engine to find this in zero or more occurrences so it continues to match multiple characters. When it gets to the end of the word, the special backslash shortcut \\> matches which is for an empty string at the end of a word.
We can also use this to find multiple matches within a string:
;; the * matches zero or more times - s.* will match sN or jus s ;; because Guile uses the greedy operator this just matches the WHOLE string => (list-matches "s.*" "sawn, sown, seed, translate") (#("sawn, sown, seed, translate" (0 . 27))) ;; find all instances of words starting with 's' (map match:substring (list-matches "\\<s\\w+\\>" "sawn, sown, seed, translate")) ("sawn" "sown" "seed")
In this example the regexp wants to match whole words beginning with 's', so we don't want to match 'slate' inside translate. To do this the regexp uses the special backslash shortcut of \\< to find the empty string at the beginning of a word; then the next character has to be 's'. The rest is the same as the previous example.
[1] | See the Regexp Functions section |
Quantifiers are quite easy to understand, but the impact of greedy matching can lead to some unexpected situations. Here's some resources to dig into this topic further:
Between this post on quantifiers and the previous one on the various character classes we're well on our way through regular expressions in Guile! Personally I like the interval expression format as it's very explicit and I'm forever checking if I've interpreted a quantifier correctly.
In the next post we'll look at how we can group and capture sections of a regular expression.
Do you have a favourite quantifier, or a trick for dealing with Guile Scheme's greedy matching? Get in touch by email (details on the About page), or I can be reached on Mastodon, futurile@mastodon.social.