Writing the Regexp


Prev		Next

CSS3 selectors are complex. For example, blockquote > div p^[31], div.stub *:not(:lang(fr))^[32], *|*[a|foo~="bar"], *|*[|class~="bar"]^[33], and stub ~ [|attribute^=start]:not([|attribute~=mid])[|attribute*=dle][|attribute$=end] ~ t^[34] are all valid CSS3 selectors. And while these are probably somewhat complicated for real-life applications, they are simple compared to what a CSS3 selector could be.

So how does one write a regular expression for something this complex? The answer, of course, is rather than trying to write the regular expression directly, you write a program to generate the regular expression. I have used this approach in the past, finding that it is generally not too difficult to manually convert a small EBNF grammar or other set of formal rules into a small program to generate a corresponding regular expression.^[35] Typically each non-terminal becomes a variable, defined in terms of constants (for the terminals) and the variables that have been defined so far (for the non-terminals).

As a trivial example, Example 2, “A Perl program that generates a regular expression” is a small Perl program that generates a POSIX extended regular expression that matches an integer, as defined by the EBNF provided in the Wikipedia page on EBNF.^[36]

Example 2. A Perl program that generates a regular expression

#!/usr/bin/env perl
#
# Copyleft 2019 Syd Bauman and Northeastern University Digital
# Scholarship Group.
# 
# No parameters; reads no input. Writes out a regular expression
# that matches an integer, where integer is defined by the EBNF
# in https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form:
# | digit excluding zero = "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
# | digit                = "0" | digit excluding zero ;
# | natural number = digit excluding zero, { digit } ;
# | integer = "0" | [ "-" ], natural number ;
# The resulting regexp is intended to be a POSIX ERE, but would
# also work as a PCRE or a W3C regular expression, and probably
# lots of others. (But not a POSIX BRE or an Emacs LISP regexp.)

$digit_sans_zero = "(1|2|3|4|5|6|7|8|9)"; # could be just "[1-9]" :-)
$digit = "(0|$digit_sans_zero)";
$natural_number = "($digit_sans_zero($digit)*)";
$integer = "(0|(-?$natural_number))";

print STDOUT "$integer\n";
exit 0;

While I am sure there has been much written on this general approach,^[37] I was not looking for general-purpose (regular) grammar to regular expression conversion, I was just looking to convert a particular grammar to a regular expression.

^[31]Nondeterministic matching of descendant and child combinators

^[32]NEGATED :lang() pseudo-class

^[33]Attribute space-separated value selector with declared namespace

^[34]Dynamic handling of attribute selectors

^[35]Of course, since an EBNF grammar can represent any context-free language (Chomsky Type 2), there are some EBNFs that cannot be represented by a regular language (Chomsky Type 3), although some regular expression languages (e.g., PCRE) have extensions that allow them to represent any context-free grammar.

^[36]Readers who are well versed in PCRE will know that the EBNF can be represented directly in the regular expression, e.g.:

(?(DEFINE)
          (?<digit_sans_zero> (1|2|3|4|5|6|7|8|9)           )
          (?<digit>           (0|(?&digit_sans_zero))       )
          (?<natural_number>  (?&digit_sans_zero)(?&digit)* )
          (?<integer>         (0|(-?(?&natural_number)))    )
          )^(?&integer)$

While this is impressive, and very useful in its own right, it is not helpful to me here as I am interested in generating a W3C regular expression, not in using PCRE.

^[37]A quick web search demonstrates that discussions of this topic come in all flavors, from stackoverflow posts to class slides, to full academic papers. See, e.g. [FA2RE], [TPoRE], [REBNF] [NFA2RE], [DFA2RE], and [REGGE].