https://buttondown.email/hillelwayne/archive/why-do-regexes-use-and-as-line-anchors/
Last week I fell into a bit of a rabbit hole: why do regular expressions use $ and ^ as line anchors?1
This talk brings up that they first appeared in Ken Thompson's port of the QED text editor. In his manual he writes: b) "^" is a regular expression which matches character at the beginning of a line.
c) "$" is a regular expression which matches character before the character (usually at the end of a line)
QED was the precursor to ed, which was instrumental in popularizing regexes, so a lot of its design choices stuck.
Okay, but then why did Ken Thompson choose those characters?
(Score: 2, Interesting) by Anonymous Coward on Thursday March 28 2024, @04:28AM (1 child)
A deterministic finite automaton (DFA) is in some sense simpler to understand than a regular expression and are probably a good way to introduce the concept of a regular language. However, regular expressions (in the formal language theory sense: with alternation, concatenation and Kleene closure operators) only correspond nicely to nondeterministic finite automata (NFA). Of course any NFA can be converted to an equivalent DFA but then it won't look much like the original regular expression anymore. I'm not really sure if introducing the concept of nondeterministic automata to someone struggling to understand regular expressions is really all that helpful -- It sounds a bit like the burrito effect [wordpress.com].
But more importantly, Unix regexes bear only a passing resemblance to their formal language cousins and they have operators which provide substantially more computational power than what is possible with an NFA. Here's a particularly extreme example: the following grep command matches any line whose length is a composite number greater than 1:
grep '^\(...*\)\1\1*$'
This regex is simply impossible to understand as a DFA (or NFA). I don't know if backreferences were supported in Thompson's original implementation but this works in UNIX V7 grep (ca. 1979). Furthermore, not even pushdown automata (equivalent in expressive power to general context-free grammars) can do this. Maybe you can do it with a linearly-bounded automaton (corresponding to some length-increasing grammar) but I did not attempt to prove that.
(Score: 3, Interesting) by krishnoid on Thursday March 28 2024, @09:59PM
They don't correspond directly, but understanding a DFA can help understand the shorthand for automata-theoretic regular expressions (Kleene star, et al). That makes it easier to extend DFAs to NFAs, and then add on the additional concepts used in actual programming-language regexes.
For example, DFAs don't have additional storage; the DFA is pretty much just the state machine description and the current state, with no knowledge of previous input. Parentheses and backreferences within a regex require additional storage, which are better understood when considered as a useful programming language extension, to the mathematically-pure DFA and corresponding regular expressions.