https://buttondown.email/hillelwayne/archive/why-do-regexes-use-and-as-line-anchors/
Last week I fell into a bit of a rabbit hole: why do regular expressions use $ and ^ as line anchors?1
This talk brings up that they first appeared in Ken Thompson's port of the QED text editor. In his manual he writes: b) "^" is a regular expression which matches character at the beginning of a line.
c) "$" is a regular expression which matches character before the character (usually at the end of a line)
QED was the precursor to ed, which was instrumental in popularizing regexes, so a lot of its design choices stuck.
Okay, but then why did Ken Thompson choose those characters?
(Score: 1) by shrewdsheep on Thursday March 28 2024, @10:04AM (1 child)
By most definitions, you cannot even do that. Regular expression can only tokenize streams which is their use in many grammars (regular expressions do not have memory and cannot match opening/closing parentheses, for example, at least not to arbitrary levels). The concept of being turing complete is orthogonal to the Chomsky hierarchy you mention.
(Score: 3, Informative) by Anonymous Coward on Thursday March 28 2024, @03:22PM
In formal language theory, "parse" simply means "for a language (set of strings) L and a string w, determine whether or not w is a member the set L". A regular language is simply one that can be parsed by a deterministic finite automaton (DFA), which is exactly the set of languages that can be described by regular expressions with the alternation, concatenation and kleene closure operators.
Languages with balanced parentheses are trivially shown to be not regular if the parentheses can be nested to arbitrary depth. So the fact that regular expressions cannot match arbitrary nesting depths of open/close parentheses does not contradict GP's point, as such a language is not a regular language. On the other hand, balanced parentheses with a finite maximum nesting depth is regular (basically, the states of a DFA can be used to count things but only to a finite bound).
Nevertheless, if you're talking about real-world regex implementations on computers then the GP's statement is not true, because real-world regex implementations have different operators and are usually much more computationally powerful than a DFA. For example, perl-compatible regexes have recursive match operators and absolutely can match balanced parentheses:
pcregrep '^([(](?1)*[)])*$' <<'EOF'
()
(())
((()())())()
(((())())()
EOF
prints only lines with balanced parentheses:
()
(())
((()())())()
.NET regexes have an explicit "balancing group" operator which can be used for this sort of thing. It is also apparently possible with Java regexes [drregex.com] which have neither recursion nor balancing groups, but do have lookahead and forward reference operators.