A side of Guile: Regular expressions in Guile Scheme (Groups and escaping)

When using regular expressions in Guile Scheme or Guix it's useful to be able to pick one part from several options which is the role of alternation. The role of capturing groups is to grab a part of a match which we can use later, often this is to replace some text during a substitution. Capturing groups give us a lot of precision. Finally, to use regexps successfully we have to handle quoting (backslash escaping) so that characters are correctly evaluated. This is the source of a lot of confusion and often makes regexs look like a string salad as we land-up adding more and more backslashes!

This is part 5 in a series called A side (aside?) of Guile, introducing the Guile Scheme language in the context of using it for packaging in Guix. As regular expressions are such a complex topic it's covered in three posts:

Regular expressions in Guile Scheme (part 1): introduces Guile's POSIX Extended Regular Expressions (ERE), the various metacharacters, character sets, POSIX character classes and the backslash shortcuts.
Regular expressions in Guile Scheme (part 2): covers quantifiers and greedy matching.

String anchors

String anchors are a different way of matching as they as they match a position within the string, rather than a character. If you've used Vim you'll be quite used to the concept:

String anchor	Explanation
^	Start of the string
$	End of the string
\`	Backtick is start of the string, same as ^
\'	Single quote is the end of the string, same as $
\b	See POSIX character classes for word boundaries

Lets see some examples:

 1 => (string-match "^[HC]at" "Cat fur was thick on the hat")
 2 #("Cat fur was thick on the hat" (0 . 3))
 3 
 4 => (string-match "^[HC]at" "Later that day, Cat fur was thick on the hat")
 5 ;; this is #f because at doesn't have H or C at the start
 6 
 7 ;; [hc]at$ matches hat or cat at the end of the line
 8 => (string-match "[hc]at$" "Later that day, Cat fur was thick on the hat")
 9 #("Later that day, Cat fur was thick on the hat" (41 . 44))

In the first example (line 1) the regexp searches for Hat or Cat at the start of a line. It uses the ^ to find the start of the string, and then it looks for [HC] which are the characters either H or C, these have to be followed by 'at'. To demonstrate the string anchor working by anchoring on the start of the string the second example (line 4) shows it not matching as the match is later in the string. The third example (line 8) shows the same idea but this time the regexp looks for the end of the line using the string anchor $.

Word boundaries are also string anchors:

1 => (map match:substring (list-matches "\\bS\\w+" "Sapphire is my favourite"))
2 ("Sapphire")
3 
4 => (map match:substring (list-matches "\\w+$" "Sapphire is my favourite"))
5 ("favourite")

In the first example (line 1), it searches for the start of a word, then it looks for 'S' and then the rest of a word (alphanumeric characters and underscore) must follow. The second example is the same idea, but this time it looks at the end of the string.

Alternation

Use alternation to match a single regular expression from several options. We can think of it as a 'OR' where it matches one group or an alternative.

1  => (map match:substring (list-matches "carrots|onions" "I like carrots and onions, but not leeks"))
2  ("carrots" "onions")

According to the RegexBuddy page on alternation there are performance implications if you use the GNU POSIX variant with multiple quantifiers or combination of quantifiers and alternation as the standard requires the longest match which means the Regex engine has to try all combinations to find the longest match.

Grouping and Capturing

Parentheses are used to group regular expressions together. A group can be used to apply a quantifier or alternation to part of the regex. This means a submatch can be created for a portion of the string that's being matched. I find this really useful to splitting up regex's and simplifying how I think about them.

The other useful property is that a group can be captured and returned as the result of the regex: this is commonly used in Guix to replace part of a string.

To illustrate this lets say that we have an ordering system where we want to use a regex to retrieve parts of the order. The order number is something like A458-20251201, which is the first letter is either A or B, then a three digit number, then a hyphen and finally another number of indeterminate length. We can create a regex like this:

(string-match "^[AB][[:digit:]]{3}-[[:digit:]]*" "A458-20251201")
#("A458-20251201" (0 . 13))

(match:substring (string-match "^[AB][[:digit:]]{3}-[[:digit:]]*" "A458-20251201"))
"A458-20251201"

It uses the ^ anchor to find the start of the string, then [AB] looks for either A or B. Next it matches [[:digit:]]{3} three digits, then the hyphen and finally [[:digit:]]* matches all the other digits.

If we wanted to access the two parts of the order number separately we can use groups like this:

=> (string-match "^([AB][[:digit:]]{3})-([[:digit:]]*)" "A458-20251201")
#("A458-20251201" (0 . 13) (0 . 4) (5 . 13))

This is exactly the same regexp, but this time we've grouped sections of it by using the brackets, these will create submatches. The result line demonstrates that it's captured the whole order number where it shows (0 . 13), and then two subsequent match objects from the two groups we defined in (0 . 4) and (5 . 13).

We can access these in a couple of ways:

=> (match:substring (string-match "^([AB][[:digit:]]{3})-([[:digit:]]*)" "A458-20251201") 1)
"A458"

The match structure contains all matches that were found. The first one is the entire match (so the entire string in this case), and the first match group is then in 1, the second group in 2 and so forth. We can access them using match:substring and provide the group that we want.

We can see how many matches were found with:

=> (match:count (string-match "^([AB][[:digit:]]{3})-([[:digit:]]*)" "A458-20251201"))
3

The match:count procedure returns the number of matched subexpressions from the match object. The entire regular expression match counts as the first one, and any failed matches will also be counted.

Note that Guix's Substitute* function lets you assign a match group to a var name or set of var names. This is very useful for manipulating a string. For example, create a file called orders.txt and put A458-20251201 in it. We can use substitute* to alter it like this:

;; guix shell guile guile-readline guile-colorized -- guix repl
=> (use-modules (guix) (guix build utils))

(substitute* "orders.txt" ( ("^([AB][[:digit:]]{3})-([[:digit:]]*)" all part1 part2) (string-append part1 "-v2-" part2)))

This calls substitute* on the orders.txt file. The first part matches the string splitting it into the two parenthesised subexpressions: the whole match is stored in the var 'all' (can be any names you want), and the subexpressions are in part1 and part2. The string-append uses these vars to alter the string.

There's a similar function in guile-lib called match-bind for more general usage.

One thing to be aware of is that many Regexp implementations have the ability to not capture a group, but that's not the case for Guile's implementation. Also, remember that we covered using groups for lazy (reluctant) matching in the second post on regexs.

Grouping: back-references

Guile's regexp also supports back-references, so the contents of a capturing group can be used in a part of the regexp. If you're packaging for Guix in most cases what you're looking for is the subexpression naming capability that's part of the substitute* procedure.

At a basic level a back-references is a way of referring back to the result of an earlier part of the regexp. Simply the regexp captures a sub-expression by using brackets, and that result can be accessed with a slash and a number. It's useful wherever you have duplication or repetition within a string that's being searched for.

The best example is finding duplicate words, in fact it's the only example I can find!

1  => (string-match "\\b(\\w+)\\s+\\1" "Paris in the the spring")
2  $51 = #("Paris in the the spring" (9 . 16) (9 . 12)

The regexp starts by looking for a word boundary using \\b and then there's a group ( ... ), which captures any alphanumeric characters using \\w, after the end of the word is a space using \\s+. The result is of the first capture group is that we have a word in it, it then uses this result (the actual word) and checks for a match. The \\1 is a way of referring to the result of the first capture group. Consequently, as it moves through the string it eventually captures "the" and then checks for "the" again - which in this case it finds.

Summarising with fold

We've already seen the match:substring used with the map procedure to return all matches. Along with, how match:count can do the same thing. Another scenario is if we want to summarise all matches.

Imagine that we have a string with the order backlog, and in this particular case we want to count all the "A" product orders that are in the backlog:

1  => (define order-backlog "A425-20250312, A440-20250906, B200-20251020, A458-20251201")
2  => (define waiting-cat-a (make-regexp "\\<A\\w+\\>-[[:digit:]]*"))
3 
4  => (fold-matches waiting-cat-a order-backlog 0 (lambda (match count) (+ 1 count)))
5  $ = 3

Line 2 uses make-regexp to compile the regexp, this means it could be used multiple times. Then we call the fold-matches procedure with an anonymous function. As with other fold functions we provide an initial condition (0) for the count var that's used in the anonymous function.

Backslash escaping

Now we get to the knotty problem of backslash escapes and why regexs look like a symbol salad of tomfoolery.

The basic need for quoting is what if we want to use one of the regexp special characters in the search string as a literal, for example how do we use "$" in the search string. We do this by preceding the metacharacter with a backslash (\). This is often called quoting or a backslash escape.

Now the more complex part. We also have to account for the fact that before the regexp evaluates the code Guile itself (the reader) processes it. Guile also considers certain characters special, for example if Guile sees \n it will replace it with a new line.

This means there are two levels of escaping to do:

Escape the Guile readers matching
Escape the regexp engines matching

An obvious special character that needs to be escaped is the quotation mark since Guile uses this to define a string. Imagine that we have a string like this:

"Bob said, "Yes, we need more eggs" and reached for the box."

Since we have quotes inside the string we need to escape them:

# escape Guile's reader from being confused about the inner string
"Bob said, \"Yes, we need more eggs\" and reached for the box."

Practically, the Scheme code will look like this:

1 ;; guix shell guile guile-readline guile-colorized -- guix repl
2 => (use-modules (ice-9 regex))
3 
4 => (define dialog "Bob said, \"Yes, we need more eggs\" and reached for the box.")
5 
6 => (string-match "Bob said, \"Yes, we need more eggs\" and" dialog)
7 $1 = #("Bob said, \"Yes, we need more eggs\" and reached for the box." (15 . 53))

The next wrinkle is that the backslash itself is used by Guile Scheme in strings for things like newline (\n). So to use backslash to escape something in the regex we have to escape the backslash itself! If we had this:

# A string with one special regex character in it
"Using exponent notation 3^3 means"

# step 1: We escape the regex character
"Using exponent notation 3\^3 means"

# step 2: Then to escape the backslash itself we add another one
"Using exponent notation 3\\^3 means"

Regex uses the '^' so we need to escape this using the backslash, which is what we do in the first step. However, as backslash is a special character for Guile we have to escape it with it's own backslash.

This leads to a more complex situation - what if we want to match some text with a backslash in it. For example, imagine that part of a string is "\heading" and we want to match on it:

# string contains an embedded \heading
"\heading Bob and Mary go fishing"

# the backslash has to be escaped so that Guile doesn't evaluate it
(define search-string "\\heading Bob and Mary go fishing")

# 1. escape the regexp engine's matching of a backslash
# to match \heading we need \\heading
string-match("^\\heading*" search-string)

# 2. escape Guile's interpretation of backlash
# to get \\heading, we need to escape each backslash \\\\heading
(string-match "^\\\\heading*" search-string)
#("\\heading Bob and Mary go fishing" (0 . 8))

As we can see in the first step we're escaping "\heading" with a backslash, so we land-up with "\\heading". Then in the second step we have to prevent Guile (reader) from handling them as special characters, so each backslash receives a further backslash - leading to a total of four!

And, now you know why a lot of regexp searches look so wild!

Backslash escaping: regexp-quote procedure

The regexp-quote procedure will quote each special character found in str with a backslash:

(regexp-quote str)

If you've already escaped the string to get past Guile's evaluation of backslashes, then this function can be useful to convert all special characters to be escaped for the regexp engine.

=> (regexp-quote "\\heading")
"\\\\heading"

But, it's not that useful because it escapes everything, which you generally don't want. So you need to find only the part of the regexp you want to escape.

Guile Regexp resources

There are a lot of good resources on regular expressions, but not that many on Guile or Scheme's use of them generally. Here's what I used:

Guile Manual: Regular Expressions

The manual covers the functions, using the match structure and explains backslash escaping.
Computerphile: introduction to Regular expressions

This is a fun video where Professor Brailsford introduces the history of regular expressions and the concepts.
Regex 101

Online regex parser, it's useful to seeing how something can be parsed in other languages.
Regular-expressions.info's tutorial

This is a very extensive tutorial, bear in mind that parts won't match Guile's implementation.
Rexegg - regular expressions tutorial

Similar extensive tutorial and site with lots of materials.
Grep manual

This covers GNU's implementation of ERE and GNU specific extensions.
Javascript and regular expression centric site

Slevithan's site is worth deep-diving if you'd like to know more about some key concepts.

Everything Regexp!

With this third post we've covered all the main points of Guile Scheme's POSIX Extended Regular Expressions (ERE) from the perspective of packaging in Guix. Parts we've missed out are primarily around handling a stream (e.g. a file) where a string might have multiple new lines within it and how you do regexs for that. We also haven't covered all the procedures, so check out the Guile manual as there are other options.

For Guix the main way we use regexs is to alter text, replacing (for example) part of a library string in a source with the version that's in Guix. We'll cover that in a future post.

I'd love to know if you found this post series a useful introduction to Guile Scheme's regex, is there anything that I missed out or you think needs clarification? Email me (details on the About page), or I can be reached on Mastodon, futurile@mastodon.social.

Posted in Tech Sunday 11 May 2025
Tagged with tech guix guile scheme

‹‹A side of Guile: Regular expressions in Guile Scheme (Quantifiers) NeoMutt: introduction to CLI email››

Futurile