Slash Boxes

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 18 submissions in the queue.
posted by janrinok on Wednesday March 27, @08:12PM   Printer-friendly
from the I-didn't-know-that-... dept.

Last week I fell into a bit of a rabbit hole: why do regular expressions use $ and ^ as line anchors?1

This talk brings up that they first appeared in Ken Thompson's port of the QED text editor. In his manual he writes: b) "^" is a regular expression which matches character at the beginning of a line.

c) "$" is a regular expression which matches character before the character (usually at the end of a line)

QED was the precursor to ed, which was instrumental in popularizing regexes, so a lot of its design choices stuck.

Okay, but then why did Ken Thompson choose those characters?

Original Submission

This discussion was created by janrinok (52) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Interesting) by Rosco P. Coltrane on Wednesday March 27, @10:31PM (3 children)

    by Rosco P. Coltrane (4757) on Wednesday March 27, @10:31PM (#1350573)

    Every time I have to regex, my brain turns to mush

    Regular expression are well worth learning. They can solve many problems you wouldn't think apply to them very elegantly.

    For instance, I have this software that displays information on a special display: The information is organized in pages of icons, with the two last icons being reserved for arrows to jump from page to page. The total list of icons changes regularly and I don't know how because it's fed from another process, but if the icons displayed on whatever page the user is currently on doesn't change, I don't want to disturb the display and jar the user each time the rest of the icons that aren't displayed are updated of course.

    So each time my software receives a new list of icons to display, it must try to find a sequence of icons that matches more or less what the user is currently seeing to silently update the page number the user is currently seeing rather than setting the page number back to 0 and update the display unexpectedly.

    Well, you can write a search algorithm to find the sequence of icons matching the currently displayed set of icons of course, or you can write a single regex: encode all the icons as a string of comma-separeted hashes, themselves newline-separated, with a line number at the beginning, then simply match the comma-separated list of icon hashes corresponding to the page currently displayed and recover the leading line number in the match to find the new current page number.

    In other words, one line of regex replaces an entire search function.

    Of course, that's just an example. They're also very suited for input processing - as in, one regex can match several types of lines with various different fields, validate the different lines and split the fields all in one match sequence - rather than an endless sequence of case... or if...else to process your input safely.

    I consider regular expressions - actually, regular expressionS, as there's more than one variant - a separate language any seasoned programmer should master completely. They will make your code more concise and more efficient in any language to a degree you can't even begin to suspect. And if you use vi - which uses its own slightly different variant of regular expressions, which makes total sense when you understand why - you'll become an editing speed demon too.

    Take the time to learn regular expressions: trust me on that one, it's an investment that will pay back a thousand times.

    Starting Score:    1  point
    Moderation   +3  
       Insightful=1, Interesting=1, Informative=1, Total=3
    Extra 'Interesting' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   5  
  • (Score: 2) by ChrisMaple on Thursday March 28, @03:32PM (2 children)

    by ChrisMaple (6964) on Thursday March 28, @03:32PM (#1350703)

    I use regexes occasionally, and they're a powerful time saver. However, long, complicated regexes are difficult to debug end seldom worth the effort. Reading someone else's regexes is even more difficult; like APL, it's a write-only language.

    • (Score: 2) by VLM on Thursday March 28, @04:32PM

      by VLM (445) on Thursday March 28, @04:32PM (#1350712)

      long, complicated regexes are difficult to debug end seldom worth the effort

      At some point it turns into "lets scrap it and replace with a simple parser"

      It would be interesting to see a compiler that uses regex instead of a parser design.

      You can regex a simple machine code assembler, but much more than the simplest and forget regex design it's parser time. Its pretty easy to make an assembler that uses regex to assemble a "nop" but "movb r0 (r1)+" (Its macro-11) would take a thundering lot of regex. That assembly language would, when thought about like C, be like write the contents of the first byte of variable/register r0 to memory using r1 as the pointer then increment the pointer for later use, essentially strcpy a single type char variable into a char array and get ready to copy the next char. C of course is just slightly fancied up PDP11 assembly, it's only on inferior processors that C looks like a separate language.

    • (Score: 2) by Rosco P. Coltrane on Thursday March 28, @05:19PM

      by Rosco P. Coltrane (4757) on Thursday March 28, @05:19PM (#1350727)

      Reading someone else's regexes is even more difficult; like APL, it's a write-only language.

      That's because programmers somehow forget all rules of readability when they write regexes for some reason.

      My regexes are spread over several lines and indented. You can read them just fine even if they're really complicated. I always take the time to make my code readable for everybody else as a common courtesy, and regexes are an integral part of my code, so they get the same treatment for the same purpose.