Altering the source code of a package is a very common action when packaging something for Guix. To do that we often search through a source file using a regular expression, and then substitute some replacement text. It's so common that Guix has a special substitute* function for this. And it's used over 9500 times in the Guix repository! To use it well we need to understand Guile's POSIX Extended Regular Expressions which is what we're going to spend some time on - as it's such a complex topic it's going to be over three posts.
This is part 3 in a series called A side (aside?) of Guile, introducing the Guile Scheme language in the context of using it for packaging in Guix.
Guile uses POSIX Extended Regular Expressions (often shortened to ERE) with GNU extensions. Regexp implementations vary across programming languages, the main missing element of Guile's implementation is that regexps are greedy and there's no easy way to make it lazy - all the details are covered in part 5.
These posts are written from the perspective of using regexps for packaging in Guix, and we'll cover some specific Guix capabilities in a later post. If you're using regular expressions in a general Guile Scheme program then see the manual for some functions that I'm not covering. If the built-in capabilities aren't sufficient then Scheme has more extensive regexp capabilities in SRFI-115: Scheme Regexp but Guile Scheme doesn't implement it. There's also IrRegular Expressions which is a portable library that could be used.
All the functions and match structures for regular expressions are are in the (ice-9 regex) module. To access them from inside a REPL do:
$ guix shell guile guile-readline guile-colorized -- guix repl
Then in the REPL we can do:
1 => (use-modules (ice-9 regex)) 2 3 ;; a successful search 4 => (string-match "test" "A test of the interface") 5 $1 = #("A test of the interface" (2 . 6)) 6 7 ;; a failing search 8 => (string-match "fire" "Where there's smoke, there's a conflaguration") 9 $1 = #f
The string-match procedure accept a pattern ("test") as the first parameter and searches for it in the supplied str ("A test of the interface"). Technically it compiles the pattern into a regexp structure and uses that against the supplied string.
If it's successful it returns a match structure which describes what was matched by the regular expression. In this case (line 5) it's showing that the match starts at character 2 and ends at character 6. If it fails it returns #f if there was no match, which is the second example (line 8).
A match structure always has:
There's various fun things you can do with the returned match structure. For example to see the actual match use the match:substring procedure:
=> (match:substring (string-match "test" "a test of the interface") $1 = "test"
We often want a regexp to match multiple times, for this we can use the list-matches procedure to return multiple non-overlapping match structures:
=> (list-matches "test" "A test, for testers, of the testing example") (#("A test, for testers, of the testing example" (2 . 6)) #("A test, for testers, of the testing example" (12 . 16)) #("A test, for testers, of the testing example" (28 . 32)))
The match structure that's returned shows three matches. To access the matched text use match:substring:
=> (map match:substring (list-matches "test" "A test, for testers, of the testing example")) $1 = ("test" "test" "test")
It's also possible to use a subexpression to extract portions of a match into a group, those submatches are added to the match structure: grouping and the functions for accessing submatches are covered in part 5.
There are lots of specific procedures to create regexp's (e.g. make-regexp), substitute text (e.g. regexp-substitute) and do complex operations with matches (e.g. fold-matches) which are covered in the manual. However, Guix has it's own substitution function so this post is going to focus on these few simple procedures and then a later post will cover the specific Guix functions.
The way that matching works in Guile's implementation is that the patterns and input strings are converted to the locale's encoding and then passed to the C library's regular expression routines. Note that the returned match structure always points to characters in the strings (not to individual bytes): one impact is that using characters that cannot be represented in the locale encoding can lead to "surprising results" as the manual puts it.
Also, Nul bytes (#nul) cannot be used in regex patterns or inputs strings since the C functions treat that as the end of a string.
For general searching Guile's regexp can operate over multiple lines of text, again see the manual for more details on this.
As usual a regexp can match anywhere in a string. As a human it's slightly easier to understand how the matching works if we use words in the string as if it were some text from a book - so most of my examples do this. However, regular expressions actually examine the string as a continuous stream. I always "know" this, and yet find it difficult to reason about sometimes!
Lets start by looking at some simple matches:
1 => (list-matches "cat" "ducat") 2 $1 = (#("ducat" (2 . 5))) 3 4 => (list-matches "cat" "concatenate") 5 $1 = (#("concatenate" (3 . 6))) 6 7 => (list-matches "mur" "A murmuration of starlings murmured in the murky sky") 8 $1 = (#("A murmuration of starlings murmured in the murky sky" (2 . 5)) 9 #("A murmuration of starlings murmured in the murky sky" (5 . 8)) 10 #("A murmuration of starlings murmured in the murky sky" (27 . 30)) 11 #("A murmuration of starlings murmured in the murky sky" (30 . 33)) 12 #("A murmuration of starlings murmured in the murky sky" (43 . 46)))
This first example (line 1) shows a match for 'cat' at the end of the string "ducat" (line 2). The second example (line 4) shows the match within the middle of the string and the third demonstrates multiple matches within the string (formatted for this post).
Often, we're searching strings for numbers or other unicode characters:
1 => (list-matches "194" "190484501132848557101194987939279746462") 2 $1 = (#("190484501132848557101194987939279746462" (21 . 24))) 3 4 => (string-match "[$\u2713]" "Learn Guile regexp: ✓") 5 $1 = #("Learn Guile regexp: ✓" (20 . 21)) 6 7 8 => (string-match "Attendees 45" "Attendees\t45") 9 $1 = #f 10 11 => (string-match "Attendees\t45" "Attendees\t45") 12 $1 = #("Attendees\t45" (0 . 12))
The first search (line 1) is for the number 194 within the long string of numbers. The second example (line 4) searches for the unicode character using Guile's character type. Line 7 and Line 10 are one example, they show that white space can be searched for, in this case a tab character with the text.
The special characters provide ways to create a pattern that matches the text more closely. Creating these patterns is the magic of regex!
Metacharacter | Explanation |
---|---|
^ | Cart matches the start of the search string. For line-based it matches the start position of any line. A type of string anchor. |
. | Dot matches any single character except the null character. Within a bracket expression it matches a literal dot. A type of quantifier. |
[ ] | A bracket expression. Matches a single character that is contained within the brackets. If a hyphen is used it matches a range ([a-z] matches a through z). These can be mixed so [a-cx-z] matches a,b,c and x,y,z. The hyphen is treated as literal if it's last or first after the ^ if present e.g. [abc-] [^-abc]. Backslash escapes are not allowed. The ] character can be included in a bracket expression if it's the first character after the ^ if present []abc] |
[^ ] | Matches a single character that is not within the brackets. [^abc] matches anything other than 'a', 'b' 'c'. |
$ | Matches the ending position of the string, or the position just before a string-ending newline. A type of string anchor. |
( ) | Defines a marked subexpression, also called a capturing group. Guile doesn't have any non-capturing groups. |
\n | Matches what the nth marked subexpression matches. A back-reference. |
* | Asterisk matches the preceding element zero or more times. So ab*c matches ac, abc ,abbbc. Guile's version is GREEDY. A type of quantifier. |
+ | Plus sign indicates that the regular expression should match one or more occurrences of the previous element. So ab+c matches abc abbc abbbc ... but not ac. Guile's version is GREEDY. A type of quantifier. |
? | Question mark indicates that the regular expression should match zero or one occurrences of the previous atom or regexp ab?c matches only ac or abc. Guile's version is GREEDY. A type of quantifier. |
{n,n} | Matches the preceding element at least m and not more than n times. So a{2,3} matches aa, aaa only. A type of quantifier, called an 'interval expression' by the Grep manual: the most flexible format. |
| | The vertical bar is the alternation or set union operator. It matches either the expression before or after the operator. |
\ | Backslash escapes so it matches the literal dot, + or ?. Escaping is a complex topic due to interaction with Guile itself, see further details in part 5. |
The rest of this post covers character sets, part 4 explains quantifiers, and part 5 deals with anchors, capturing groups and escaping.
The square brackets allow us to match one of a set of characters. For example, if we want to find words that start with particular letters:
1 ;; Example 1: find 'h'at or 'c'at 2 => (string-match "[hc]at" "hat") 3 $1 = #("hat" (0 . 3)) 4 => (string-match "[hc]at" "cat") 5 $1 = #("cat" (0 . 3)) 6 7 ;; Doesn't match on mat as 'm' isn't in the set 8 => (string-match "[hc]at" "mat") 9 $1 = #f 10 11 ;; [sz] matches apologise or apologize 12 (string-match "apologi[sz]e" "we apologize")
In the first example (line 1) we're looking for any string which has "hat" or "cat": it matches both, but fails on "mat" since that doesn't start with 'h' or 'c'. A common use for this sort of character set is alternative spellings (e.g. apologise), see example 2 (line 11).
We can also match multiple times within the string:
;; [hc]at matches hat or cat, but not sat, late or mat =>(map match:substring (list-matches "[hc]at" "The cat in the hat, sat late on the mat")) ("cat "hat")
In addition to specifying specific letters, we can specify a range of letters or numbers by using a hyphen. Imagine that we want to find anything in the range "a-f" and then "at" which would cover words like bat and cat, but not late or mat:
;; [b-f]at matches anything that starts with b, c, d, e or f => (map match:substring (list-matches "[b-f]at" "The cat in hat, right next to the bat, sat late on the mat")) ("cat" "bat")
It's particularly useful when matching a range of numbers.
=> (string-match "[$\u20AC][0-9][0-9]" "€50 Misc") #("€50 Misc" (0 . 3))
Recall from earlier that Guile's regexp engine can match any unicode character, so the first set of square brackets is doing that [$\u20AC]. The second and third character set match any digit [0-9].
The carat symbol is used to negate a character set, meaning it searches for everything that is not in the set.
1 => (map match:substring (list-matches "[^h]at" "The cat in the hat, sat late on the mat")) 2 ("cat" "sat" "lat" "mat")
In this example, we're matching on every instance of "[character]at" as long as the character is not a "h", therefore excluding 'hat' from the result.
;; don't find words starting with b or c => (map match:substring (list-matches "\\<[^bc]\\w+" "bat cat battle cattle saddle basalt vassal")) ("saddle" "vassal")
This regexp looks a bit strange because we haven't covered the POSIX character classes yet. But, for now the \\< matches on the start of a word, then the [^bc] part matches anything that is NOT 'b' or 'c'. Finally the \\w+ matches the rest of the word. The result is that we're finding words that don't begin with b or c.
The POSIX character classes enable us to search for different types of characters and punctuation. They are all within square brackets with hyphens, so for example [:blank] is for a space or a tab. To use them they have to be placed inside square brackets, but this isn't the same usage as a character set.
1 => (string-match "[[:space:]]debate[[:blank:]]" "combat bat debate concatenate cat hat") 2 ("combat bat debate concatenate cat hat" (10 . 18)) 3 4 => (match:substring (string-match "[[:space:]]debate[[:blank:]]" "combat bat debate concatenate cat hat")) 5 " debate "
The first example (line 1) starts by searching for a space with [[:space:]] then it looks for debate and finally any blank character by using [[:blank:]]. As we can see from line 4, when we retrieve the result using match:substring, it's finding both the word and the space around it.
The full list of POSIX character classes is:
Class | Explanation |
---|---|
[:alnum:] | Alphanumeric characters: a-z, 0-9 and the underscore character |
[:alpha:] | Alphabetic characters |
[:blank:] | Space or tab |
[:cntrl:] | Control characters |
[:digit:] | Digits |
[:graph:] | Visible characters |
[:lower:] | Lowercase characters |
[:print:] | Visible characters and the space character |
[:punct:] | Punctuation characters [-!"#$&'()*+/:;<=>?@[]^_`{}~] |
[:space:] | Whitespace characters, including spaces tabs, new lines, carriage returns etc. |
[:upper:] | Uppercase letters |
[:xdigit] | Hexadecimal |
For another example lets say we have a string with three digit numbers in it and we want to return those numbers:
1 => (map match:substring (list-matches "[[:digit:]][[:digit:]][[:digit:]]" "421,brady 55,sarah 9099,leah")) 2 ("421" "909") 3 4 ;; can also be shortened to this 5 => (map match:substring (list-matches "[[:digit:]]{3}" "421,brady 55,sarah 9099,leah"))
Here are two ways that we can achieve this. The first shows using three character classes to match on the numbers. The second shortens the repetition of the search pattern by using quantification {3} - as you can imagine this specifies that [[:digit:]] is matched three times. We'll cover all the quantification metacharacters in the next post.
There are a set of special shortcuts that can be used for the character classes. These all start with a backslash, and as Guile's reader treats backslashes as special characters we have to escape them with a second backslash.
Backslash | Explanation |
---|---|
\b | Match the empty string at the edge of a word |
\B | Match the empty string provided it's not the edge of a word |
\< | Match the empty string at the beginning of a word |
\> | Match the empty string at the end of a word |
\w | Matches a character within a word, synonym for [_[:alnum:]] which means all a-z characters, 0-9 and the underscore. |
\W | Matches a character which is NOT within a word |
\s | Match whitespace, a synonym for [[:space:]] |
\S | Match non-whitespace, a synonym for [^[:space:]] |
;; find a word which starts with sal =>(map match:substring (list-matches "\\bsal" "salve salivate basalt vassal ")) ("sal" "sal")
In this example the \\b is used to match on the empty string at the start of a word, so it explicitly checks that the next character is alphanumeric. It then matches if the next characters are 'sal'. As we can see it's two backslashes, as one of them is to escape the Guile REPL's treatment of backslashes.
Commonly, we want to actually find a whole word, for this we can use the \w backslash expression:
1 ;; match a whole words that start with "sal" 2 => (map match:substring (list-matches "\\bsal\\w+" "salve salivate basalt vassal ")) 3 ("salve" "salivate") 4 5 ;; match the end of a word - then find the rest of the word 6 => (map match:substring (list-matches "\\w+sal\\>" "salve salivate basalt vassal")) 7 ("vassal")
In the first case (line 2) it matches the start of a alphanumeric using the \\b shortcut, and that 'sal` follows it. The next backslash shortcut is \\w which matches any alphanumeric character (a-z,1-9 and underscore), finally the plus sign is a quantifier which tells it to match that one or more times - which takes it to the end of the word. The match ends when it hits the next space character, because a space character is not alphanumeric.
The second one shows using the backslash shortcut to match the end of a word, so it's a bit easier to read it from right to left. It finds the end of a word with \\>, before that the characters 'sal' have to be there, and then before that more alphanumeric characters (using \\w+ again).
It's often possible to shorten the regexp by using quantifiers to repeat character classes.
;; find the numbers 0-9 - uses + quantifier to find it 1 or more times => (map match:substring (list-matches "[0-9]+" "€50 Misc 501 Tickets")) ("50" "501")
In this example it uses the character class [0-9] to find a digit, the + is a quantifier that specifies that a digit has to be there one or more times. Consequently, it would match '5', '50', '501' or any number length.
There are also 'Equivalence classes' and 'collating symbols' which I'm not covering as I don't use them, or understand them!
That's it for now - we've covered all the basics of Guile's POSIX Extended Regular Expressions (ERE) implementation. We can do literal matching and we've seen all the metacharacters that Guile's regex engine understands. We explored character sets, POSIX character classes and the special backslash shortcuts.
In the next post we'll dive into quantifiers so that we can specify precisely how many times something will match.
I'd love to know if you found this post a useful introduction to Guile Scheme's regex - or maybe you have a nice example of using regex? Email me (details on the About page), or I can be reached on Mastodon, futurile@mastodon.social.