CSS3 selectors are complex. For example,
blockquote > div p
[31],
div.stub *:not(:lang(fr))
[32],
*|*[a|foo~="bar"], *|*[|class~="bar"]
[33], and
stub ~ [|attribute^=start]:not([|attribute~=mid])[|attribute*=dle][|attribute$=end] ~ t
[34]
are all valid CSS3 selectors. And while these are probably somewhat complicated for
real-life applications, they are simple compared to what a CSS3 selector could be.
So how does one write a regular expression for something this complex? The answer, of course, is rather than trying to write the regular expression directly, you write a program to generate the regular expression. I have used this approach in the past, finding that it is generally not too difficult to manually convert a small EBNF grammar or other set of formal rules into a small program to generate a corresponding regular expression.[35] Typically each non-terminal becomes a variable, defined in terms of constants (for the terminals) and the variables that have been defined so far (for the non-terminals).
As a trivial example, Example 2, “A Perl program that generates a regular expression” is a small Perl program that generates a POSIX extended regular expression that matches an integer, as defined by the EBNF provided in the Wikipedia page on EBNF.[36]
Example 2. A Perl program that generates a regular expression
#!/usr/bin/env perl # # Copyleft 2019 Syd Bauman and Northeastern University Digital # Scholarship Group. # # No parameters; reads no input. Writes out a regular expression # that matches an integer, where integer is defined by the EBNF # in https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form: # | digit excluding zero = "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ; # | digit = "0" | digit excluding zero ; # | natural number = digit excluding zero, { digit } ; # | integer = "0" | [ "-" ], natural number ; # The resulting regexp is intended to be a POSIX ERE, but would # also work as a PCRE or a W3C regular expression, and probably # lots of others. (But not a POSIX BRE or an Emacs LISP regexp.) $digit_sans_zero = "(1|2|3|4|5|6|7|8|9)"; # could be just "[1-9]" :-) $digit = "(0|$digit_sans_zero)"; $natural_number = "($digit_sans_zero($digit)*)"; $integer = "(0|(-?$natural_number))"; print STDOUT "$integer\n"; exit 0;
While I am sure there has been much written on this general approach,[37] I was not looking for general-purpose (regular) grammar to regular expression conversion, I was just looking to convert a particular grammar to a regular expression.
[35] Of course, since an EBNF grammar can represent any context-free language (Chomsky Type 2), there are some EBNFs that cannot be represented by a regular language (Chomsky Type 3), although some regular expression languages (e.g., PCRE) have extensions that allow them to represent any context-free grammar.
[36] Readers who are well versed in PCRE will know that the EBNF can be represented directly in the regular expression, e.g.:
(?(DEFINE) (?<digit_sans_zero> (1|2|3|4|5|6|7|8|9) ) (?<digit> (0|(?&digit_sans_zero)) ) (?<natural_number> (?&digit_sans_zero)(?&digit)* ) (?<integer> (0|(-?(?&natural_number))) ) )^(?&integer)$
While this is impressive, and very useful in its own right, it is not helpful to me here as I am interested in generating a W3C regular expression, not in using PCRE.