Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Wednesday March 27, @08:12PM   Printer-friendly
from the I-didn't-know-that-... dept.

https://buttondown.email/hillelwayne/archive/why-do-regexes-use-and-as-line-anchors/

Last week I fell into a bit of a rabbit hole: why do regular expressions use $ and ^ as line anchors?1

This talk brings up that they first appeared in Ken Thompson's port of the QED text editor. In his manual he writes: b) "^" is a regular expression which matches character at the beginning of a line.

c) "$" is a regular expression which matches character before the character (usually at the end of a line)

QED was the precursor to ed, which was instrumental in popularizing regexes, so a lot of its design choices stuck.

Okay, but then why did Ken Thompson choose those characters?


Original Submission

 
This discussion was created by janrinok (52) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 1) by shrewdsheep on Thursday March 28, @10:04AM (1 child)

    by shrewdsheep (5215) on Thursday March 28, @10:04AM (#1350652)

    You cannot. Regular expressions are not Turing complete, you can parse only regular languages with them.

    By most definitions, you cannot even do that. Regular expression can only tokenize streams which is their use in many grammars (regular expressions do not have memory and cannot match opening/closing parentheses, for example, at least not to arbitrary levels). The concept of being turing complete is orthogonal to the Chomsky hierarchy you mention.

  • (Score: 3, Informative) by Anonymous Coward on Thursday March 28, @03:22PM

    by Anonymous Coward on Thursday March 28, @03:22PM (#1350698)

    you can parse only regular languages with them.

    By most definitions, you cannot even do that. Regular expression can only tokenize streams which is their use in many grammars (regular expressions do not have memory and cannot match opening/closing parentheses,

    In formal language theory, "parse" simply means "for a language (set of strings) L and a string w, determine whether or not w is a member the set L". A regular language is simply one that can be parsed by a deterministic finite automaton (DFA), which is exactly the set of languages that can be described by regular expressions with the alternation, concatenation and kleene closure operators.

    Languages with balanced parentheses are trivially shown to be not regular if the parentheses can be nested to arbitrary depth. So the fact that regular expressions cannot match arbitrary nesting depths of open/close parentheses does not contradict GP's point, as such a language is not a regular language. On the other hand, balanced parentheses with a finite maximum nesting depth is regular (basically, the states of a DFA can be used to count things but only to a finite bound).

    Nevertheless, if you're talking about real-world regex implementations on computers then the GP's statement is not true, because real-world regex implementations have different operators and are usually much more computationally powerful than a DFA. For example, perl-compatible regexes have recursive match operators and absolutely can match balanced parentheses:

    pcregrep '^([(](?1)*[)])*$' <<'EOF'
    ()
    (())
    ((()())())()
    (((())())()
    EOF

    prints only lines with balanced parentheses:

    ()
    (())
    ((()())())()

    .NET regexes have an explicit "balancing group" operator which can be used for this sort of thing. It is also apparently possible with Java regexes [drregex.com] which have neither recursion nor balancing groups, but do have lookahead and forward reference operators.