Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 18 submissions in the queue.
posted by janrinok on Wednesday March 27, @08:12PM   Printer-friendly
from the I-didn't-know-that-... dept.

https://buttondown.email/hillelwayne/archive/why-do-regexes-use-and-as-line-anchors/

Last week I fell into a bit of a rabbit hole: why do regular expressions use $ and ^ as line anchors?1

This talk brings up that they first appeared in Ken Thompson's port of the QED text editor. In his manual he writes: b) "^" is a regular expression which matches character at the beginning of a line.

c) "$" is a regular expression which matches character before the character (usually at the end of a line)

QED was the precursor to ed, which was instrumental in popularizing regexes, so a lot of its design choices stuck.

Okay, but then why did Ken Thompson choose those characters?


Original Submission

 
This discussion was created by janrinok (52) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 4, Funny) by Barenflimski on Wednesday March 27, @10:09PM (10 children)

    by Barenflimski (6836) on Wednesday March 27, @10:09PM (#1350566)

    Every time I have to regex, my brain turns to mush. I do it just enough to get good at it, and then I don't have to do it again for a year or more.

    With that being said, it is fairly useful for things. To think you can parse just about anything with keyboard characters I otherwise would never use, is pretty cool.

    As far as I'm concerned, every time I use Regex, I enter a rabbit hole that turns me into a robot and breaks my brain.

    Starting Score:    1  point
    Moderation   +2  
       Funny=2, Total=2
    Extra 'Funny' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   4  
  • (Score: 5, Interesting) by Rosco P. Coltrane on Wednesday March 27, @10:31PM (3 children)

    by Rosco P. Coltrane (4757) on Wednesday March 27, @10:31PM (#1350573)

    Every time I have to regex, my brain turns to mush

    Regular expression are well worth learning. They can solve many problems you wouldn't think apply to them very elegantly.

    For instance, I have this software that displays information on a special display: The information is organized in pages of icons, with the two last icons being reserved for arrows to jump from page to page. The total list of icons changes regularly and I don't know how because it's fed from another process, but if the icons displayed on whatever page the user is currently on doesn't change, I don't want to disturb the display and jar the user each time the rest of the icons that aren't displayed are updated of course.

    So each time my software receives a new list of icons to display, it must try to find a sequence of icons that matches more or less what the user is currently seeing to silently update the page number the user is currently seeing rather than setting the page number back to 0 and update the display unexpectedly.

    Well, you can write a search algorithm to find the sequence of icons matching the currently displayed set of icons of course, or you can write a single regex: encode all the icons as a string of comma-separeted hashes, themselves newline-separated, with a line number at the beginning, then simply match the comma-separated list of icon hashes corresponding to the page currently displayed and recover the leading line number in the match to find the new current page number.

    In other words, one line of regex replaces an entire search function.

    Of course, that's just an example. They're also very suited for input processing - as in, one regex can match several types of lines with various different fields, validate the different lines and split the fields all in one match sequence - rather than an endless sequence of case... or if...else to process your input safely.

    I consider regular expressions - actually, regular expressionS, as there's more than one variant - a separate language any seasoned programmer should master completely. They will make your code more concise and more efficient in any language to a degree you can't even begin to suspect. And if you use vi - which uses its own slightly different variant of regular expressions, which makes total sense when you understand why - you'll become an editing speed demon too.

    Take the time to learn regular expressions: trust me on that one, it's an investment that will pay back a thousand times.

    • (Score: 2) by ChrisMaple on Thursday March 28, @03:32PM (2 children)

      by ChrisMaple (6964) on Thursday March 28, @03:32PM (#1350703)

      I use regexes occasionally, and they're a powerful time saver. However, long, complicated regexes are difficult to debug end seldom worth the effort. Reading someone else's regexes is even more difficult; like APL, it's a write-only language.

      • (Score: 2) by VLM on Thursday March 28, @04:32PM

        by VLM (445) on Thursday March 28, @04:32PM (#1350712)

        long, complicated regexes are difficult to debug end seldom worth the effort

        At some point it turns into "lets scrap it and replace with a simple parser"

        It would be interesting to see a compiler that uses regex instead of a parser design.

        You can regex a simple machine code assembler, but much more than the simplest and forget regex design it's parser time. Its pretty easy to make an assembler that uses regex to assemble a "nop" but "movb r0 (r1)+" (Its macro-11) would take a thundering lot of regex. That assembly language would, when thought about like C, be like write the contents of the first byte of variable/register r0 to memory using r1 as the pointer then increment the pointer for later use, essentially strcpy a single type char variable into a char array and get ready to copy the next char. C of course is just slightly fancied up PDP11 assembly, it's only on inferior processors that C looks like a separate language.

      • (Score: 2) by Rosco P. Coltrane on Thursday March 28, @05:19PM

        by Rosco P. Coltrane (4757) on Thursday March 28, @05:19PM (#1350727)

        Reading someone else's regexes is even more difficult; like APL, it's a write-only language.

        That's because programmers somehow forget all rules of readability when they write regexes for some reason.

        My regexes are spread over several lines and indented. You can read them just fine even if they're really complicated. I always take the time to make my code readable for everybody else as a common courtesy, and regexes are an integral part of my code, so they get the same treatment for the same purpose.

  • (Score: 4, Interesting) by krishnoid on Wednesday March 27, @11:53PM (2 children)

    by krishnoid (1156) on Wednesday March 27, @11:53PM (#1350598)

    If you start with the underlying computer science concept of a finite state machine [youtu.be], it becomes *way* easier to visualize -- assuming you can think in the manner of a finite state machine. I couldn't find a good example site for the diagrams corresponding to common regular expressions, but there might be some out there.

    • (Score: 2, Interesting) by Anonymous Coward on Thursday March 28, @04:28AM (1 child)

      by Anonymous Coward on Thursday March 28, @04:28AM (#1350629)

      If you start with the underlying computer science concept of a finite state machine [youtu.be], it becomes *way* easier to visualize -- assuming you can think in the manner of a finite state machine. I couldn't find a good example site for the diagrams corresponding to common regular expressions, but there might be some out there.

      A deterministic finite automaton (DFA) is in some sense simpler to understand than a regular expression and are probably a good way to introduce the concept of a regular language. However, regular expressions (in the formal language theory sense: with alternation, concatenation and Kleene closure operators) only correspond nicely to nondeterministic finite automata (NFA). Of course any NFA can be converted to an equivalent DFA but then it won't look much like the original regular expression anymore. I'm not really sure if introducing the concept of nondeterministic automata to someone struggling to understand regular expressions is really all that helpful -- It sounds a bit like the burrito effect [wordpress.com].

      But more importantly, Unix regexes bear only a passing resemblance to their formal language cousins and they have operators which provide substantially more computational power than what is possible with an NFA. Here's a particularly extreme example: the following grep command matches any line whose length is a composite number greater than 1:

          grep '^\(...*\)\1\1*$'

      This regex is simply impossible to understand as a DFA (or NFA). I don't know if backreferences were supported in Thompson's original implementation but this works in UNIX V7 grep (ca. 1979). Furthermore, not even pushdown automata (equivalent in expressive power to general context-free grammars) can do this. Maybe you can do it with a linearly-bounded automaton (corresponding to some length-increasing grammar) but I did not attempt to prove that.

      • (Score: 3, Interesting) by krishnoid on Thursday March 28, @09:59PM

        by krishnoid (1156) on Thursday March 28, @09:59PM (#1350767)

        They don't correspond directly, but understanding a DFA can help understand the shorthand for automata-theoretic regular expressions (Kleene star, et al). That makes it easier to extend DFAs to NFAs, and then add on the additional concepts used in actual programming-language regexes.

        For example, DFAs don't have additional storage; the DFA is pretty much just the state machine description and the current state, with no knowledge of previous input. Parentheses and backreferences within a regex require additional storage, which are better understood when considered as a useful programming language extension, to the mathematically-pure DFA and corresponding regular expressions.

  • (Score: 2) by Mojibake Tengu on Thursday March 28, @05:07AM (2 children)

    by Mojibake Tengu (8598) on Thursday March 28, @05:07AM (#1350630) Journal

    To think you can parse just about anything with keyboard characters...

    You cannot. Regular expressions are not Turing complete, you can parse only regular languages with them.

    https://en.wikipedia.org/wiki/Chomsky_hierarchy [wikipedia.org]

    Though Chomsky got already cancelled for being against the Cancel Cult so your own cybernetics reality may differ now...

    --
    Rust programming language offends both my Intelligence and my Spirit.
    • (Score: 1) by shrewdsheep on Thursday March 28, @10:04AM (1 child)

      by shrewdsheep (5215) on Thursday March 28, @10:04AM (#1350652)

      You cannot. Regular expressions are not Turing complete, you can parse only regular languages with them.

      By most definitions, you cannot even do that. Regular expression can only tokenize streams which is their use in many grammars (regular expressions do not have memory and cannot match opening/closing parentheses, for example, at least not to arbitrary levels). The concept of being turing complete is orthogonal to the Chomsky hierarchy you mention.

      • (Score: 3, Informative) by Anonymous Coward on Thursday March 28, @03:22PM

        by Anonymous Coward on Thursday March 28, @03:22PM (#1350698)

        you can parse only regular languages with them.

        By most definitions, you cannot even do that. Regular expression can only tokenize streams which is their use in many grammars (regular expressions do not have memory and cannot match opening/closing parentheses,

        In formal language theory, "parse" simply means "for a language (set of strings) L and a string w, determine whether or not w is a member the set L". A regular language is simply one that can be parsed by a deterministic finite automaton (DFA), which is exactly the set of languages that can be described by regular expressions with the alternation, concatenation and kleene closure operators.

        Languages with balanced parentheses are trivially shown to be not regular if the parentheses can be nested to arbitrary depth. So the fact that regular expressions cannot match arbitrary nesting depths of open/close parentheses does not contradict GP's point, as such a language is not a regular language. On the other hand, balanced parentheses with a finite maximum nesting depth is regular (basically, the states of a DFA can be used to count things but only to a finite bound).

        Nevertheless, if you're talking about real-world regex implementations on computers then the GP's statement is not true, because real-world regex implementations have different operators and are usually much more computationally powerful than a DFA. For example, perl-compatible regexes have recursive match operators and absolutely can match balanced parentheses:

        pcregrep '^([(](?1)*[)])*$' <<'EOF'
        ()
        (())
        ((()())())()
        (((())())()
        EOF

        prints only lines with balanced parentheses:

        ()
        (())
        ((()())())()

        .NET regexes have an explicit "balancing group" operator which can be used for this sort of thing. It is also apparently possible with Java regexes [drregex.com] which have neither recursion nor balancing groups, but do have lookahead and forward reference operators.