https://buttondown.email/hillelwayne/archive/why-do-regexes-use-and-as-line-anchors/
Last week I fell into a bit of a rabbit hole: why do regular expressions use $ and ^ as line anchors?1
This talk brings up that they first appeared in Ken Thompson's port of the QED text editor. In his manual he writes: b) "^" is a regular expression which matches character at the beginning of a line.
c) "$" is a regular expression which matches character before the character (usually at the end of a line)
QED was the precursor to ed, which was instrumental in popularizing regexes, so a lot of its design choices stuck.
Okay, but then why did Ken Thompson choose those characters?
(Score: 5, Touché) by dwilson98052 on Wednesday March 27 2024, @09:37PM (2 children)
... when people write them backwards...
it's ^ $
not $ ^
(Score: 3, Funny) by Rosco P. Coltrane on Wednesday March 27 2024, @10:12PM (1 child)
What if it's a regex in Arabic?
(Score: 4, Touché) by RamiK on Thursday March 28 2024, @11:17AM
^مرحبا.*$ matches مرحبا بالعالم fine.
compiling...
(Score: 3, Interesting) by Rosco P. Coltrane on Wednesday March 27 2024, @10:00PM (8 children)
His mind will be blown when he realizes ^ and $ move the cursor to the beginning and the end of the line too.
(Score: 4, Insightful) by janrinok on Wednesday March 27 2024, @10:46PM (7 children)
The same question arises - why did they choose those characters? It all seems to go back to the same origin. They were the only characters available on teletypes which was a primary interface device when I started in the late 70s.
I am not interested in knowing who people are or where they live. My interest starts and stops at our servers.
(Score: 5, Interesting) by Rosco P. Coltrane on Wednesday March 27 2024, @11:00PM (2 children)
I wasn't answering the question, just pointing out that whatever weird choices people of yesteryear made for whatever reason still echo in software of today.
Vi of course is the next evolutionary step after ed, so it's normal that is uses ^and $ for the same purpose as ed.
It's just that... Think about it: you can install vim and any modern system today - and I do mean ANY system: there's a port of vim for every OS known to man - and you can still hit ^and $ for quick navigation.
I bet Ken Thompson picked those characters on a whim. I pick command and command line arguments on a whim too when I write utilities at my company, and years down the line, they've turned into a sort of de-facto "standard" within my company. It never ceases to amaze me.
Similarly, I bet Ken Thompson never ceases to be amazed that his split-second decisions of decades ago are used far and wide all over the world, on every OS, by millions of people, so long after he made those split-second decisions.
Still, whatever the reason, the fact is that Unix tools are very consistent. I learned those conventions at school and I still use them today, only a few years away from retirement. I would argue that this is the ultimate user-friendliness - and as the saying goes, Unix is very user-friendly, it's just very particular with which friends it chooses 🙂
(Score: 4, Informative) by KritonK on Thursday March 28 2024, @06:40AM (1 child)
Although I do use $ to go to the end of the line, I use 0 to go to the beginning of the line. It's much easier to type.
(Score: 4, Informative) by Geoff Clare on Friday March 29 2024, @11:03AM
They are actually slightly different. If the line is indented, 0 goes to the very beginning but ^ goes to the first character after the indent.
(Score: 2) by martyb on Wednesday March 27 2024, @11:11PM (3 children)
I can vouch for that! I learned to program using a (60?) column, continuous feed output (having 500 lines? inches?).
Earplug were optional, but recommended! Then again, the computer was a multiprocessing, multi-user PDP/8E having ~24KB of *core* memory!
Wit is intellect, dancing.
(Score: 4, Funny) by janrinok on Wednesday March 27 2024, @11:22PM
I am not interested in knowing who people are or where they live. My interest starts and stops at our servers.
(Score: 4, Funny) by Rosco P. Coltrane on Wednesday March 27 2024, @11:44PM (1 child)
You kids had it easy. When I learned programming, we had to punch the cards with a hammer and a chisel!
(Score: 2) by turgid on Thursday March 28 2024, @09:43PM
In my day, we had Hovis.
I refuse to engage in a battle of wits with an unarmed opponent [wikipedia.org].
(Score: 4, Funny) by Barenflimski on Wednesday March 27 2024, @10:09PM (10 children)
Every time I have to regex, my brain turns to mush. I do it just enough to get good at it, and then I don't have to do it again for a year or more.
With that being said, it is fairly useful for things. To think you can parse just about anything with keyboard characters I otherwise would never use, is pretty cool.
As far as I'm concerned, every time I use Regex, I enter a rabbit hole that turns me into a robot and breaks my brain.
(Score: 5, Interesting) by Rosco P. Coltrane on Wednesday March 27 2024, @10:31PM (3 children)
Regular expression are well worth learning. They can solve many problems you wouldn't think apply to them very elegantly.
For instance, I have this software that displays information on a special display: The information is organized in pages of icons, with the two last icons being reserved for arrows to jump from page to page. The total list of icons changes regularly and I don't know how because it's fed from another process, but if the icons displayed on whatever page the user is currently on doesn't change, I don't want to disturb the display and jar the user each time the rest of the icons that aren't displayed are updated of course.
So each time my software receives a new list of icons to display, it must try to find a sequence of icons that matches more or less what the user is currently seeing to silently update the page number the user is currently seeing rather than setting the page number back to 0 and update the display unexpectedly.
Well, you can write a search algorithm to find the sequence of icons matching the currently displayed set of icons of course, or you can write a single regex: encode all the icons as a string of comma-separeted hashes, themselves newline-separated, with a line number at the beginning, then simply match the comma-separated list of icon hashes corresponding to the page currently displayed and recover the leading line number in the match to find the new current page number.
In other words, one line of regex replaces an entire search function.
Of course, that's just an example. They're also very suited for input processing - as in, one regex can match several types of lines with various different fields, validate the different lines and split the fields all in one match sequence - rather than an endless sequence of case... or if...else to process your input safely.
I consider regular expressions - actually, regular expressionS, as there's more than one variant - a separate language any seasoned programmer should master completely. They will make your code more concise and more efficient in any language to a degree you can't even begin to suspect. And if you use vi - which uses its own slightly different variant of regular expressions, which makes total sense when you understand why - you'll become an editing speed demon too.
Take the time to learn regular expressions: trust me on that one, it's an investment that will pay back a thousand times.
(Score: 2) by ChrisMaple on Thursday March 28 2024, @03:32PM (2 children)
I use regexes occasionally, and they're a powerful time saver. However, long, complicated regexes are difficult to debug end seldom worth the effort. Reading someone else's regexes is even more difficult; like APL, it's a write-only language.
(Score: 2) by VLM on Thursday March 28 2024, @04:32PM
At some point it turns into "lets scrap it and replace with a simple parser"
It would be interesting to see a compiler that uses regex instead of a parser design.
You can regex a simple machine code assembler, but much more than the simplest and forget regex design it's parser time. Its pretty easy to make an assembler that uses regex to assemble a "nop" but "movb r0 (r1)+" (Its macro-11) would take a thundering lot of regex. That assembly language would, when thought about like C, be like write the contents of the first byte of variable/register r0 to memory using r1 as the pointer then increment the pointer for later use, essentially strcpy a single type char variable into a char array and get ready to copy the next char. C of course is just slightly fancied up PDP11 assembly, it's only on inferior processors that C looks like a separate language.
(Score: 2) by Rosco P. Coltrane on Thursday March 28 2024, @05:19PM
That's because programmers somehow forget all rules of readability when they write regexes for some reason.
My regexes are spread over several lines and indented. You can read them just fine even if they're really complicated. I always take the time to make my code readable for everybody else as a common courtesy, and regexes are an integral part of my code, so they get the same treatment for the same purpose.
(Score: 4, Interesting) by krishnoid on Wednesday March 27 2024, @11:53PM (2 children)
If you start with the underlying computer science concept of a finite state machine [youtu.be], it becomes *way* easier to visualize -- assuming you can think in the manner of a finite state machine. I couldn't find a good example site for the diagrams corresponding to common regular expressions, but there might be some out there.
(Score: 2, Interesting) by Anonymous Coward on Thursday March 28 2024, @04:28AM (1 child)
A deterministic finite automaton (DFA) is in some sense simpler to understand than a regular expression and are probably a good way to introduce the concept of a regular language. However, regular expressions (in the formal language theory sense: with alternation, concatenation and Kleene closure operators) only correspond nicely to nondeterministic finite automata (NFA). Of course any NFA can be converted to an equivalent DFA but then it won't look much like the original regular expression anymore. I'm not really sure if introducing the concept of nondeterministic automata to someone struggling to understand regular expressions is really all that helpful -- It sounds a bit like the burrito effect [wordpress.com].
But more importantly, Unix regexes bear only a passing resemblance to their formal language cousins and they have operators which provide substantially more computational power than what is possible with an NFA. Here's a particularly extreme example: the following grep command matches any line whose length is a composite number greater than 1:
grep '^\(...*\)\1\1*$'
This regex is simply impossible to understand as a DFA (or NFA). I don't know if backreferences were supported in Thompson's original implementation but this works in UNIX V7 grep (ca. 1979). Furthermore, not even pushdown automata (equivalent in expressive power to general context-free grammars) can do this. Maybe you can do it with a linearly-bounded automaton (corresponding to some length-increasing grammar) but I did not attempt to prove that.
(Score: 3, Interesting) by krishnoid on Thursday March 28 2024, @09:59PM
They don't correspond directly, but understanding a DFA can help understand the shorthand for automata-theoretic regular expressions (Kleene star, et al). That makes it easier to extend DFAs to NFAs, and then add on the additional concepts used in actual programming-language regexes.
For example, DFAs don't have additional storage; the DFA is pretty much just the state machine description and the current state, with no knowledge of previous input. Parentheses and backreferences within a regex require additional storage, which are better understood when considered as a useful programming language extension, to the mathematically-pure DFA and corresponding regular expressions.
(Score: 2) by Mojibake Tengu on Thursday March 28 2024, @05:07AM (2 children)
You cannot. Regular expressions are not Turing complete, you can parse only regular languages with them.
https://en.wikipedia.org/wiki/Chomsky_hierarchy [wikipedia.org]
Though Chomsky got already cancelled for being against the Cancel Cult so your own cybernetics reality may differ now...
Rust programming language offends both my Intelligence and my Spirit.
(Score: 1) by shrewdsheep on Thursday March 28 2024, @10:04AM (1 child)
By most definitions, you cannot even do that. Regular expression can only tokenize streams which is their use in many grammars (regular expressions do not have memory and cannot match opening/closing parentheses, for example, at least not to arbitrary levels). The concept of being turing complete is orthogonal to the Chomsky hierarchy you mention.
(Score: 3, Informative) by Anonymous Coward on Thursday March 28 2024, @03:22PM
In formal language theory, "parse" simply means "for a language (set of strings) L and a string w, determine whether or not w is a member the set L". A regular language is simply one that can be parsed by a deterministic finite automaton (DFA), which is exactly the set of languages that can be described by regular expressions with the alternation, concatenation and kleene closure operators.
Languages with balanced parentheses are trivially shown to be not regular if the parentheses can be nested to arbitrary depth. So the fact that regular expressions cannot match arbitrary nesting depths of open/close parentheses does not contradict GP's point, as such a language is not a regular language. On the other hand, balanced parentheses with a finite maximum nesting depth is regular (basically, the states of a DFA can be used to count things but only to a finite bound).
Nevertheless, if you're talking about real-world regex implementations on computers then the GP's statement is not true, because real-world regex implementations have different operators and are usually much more computationally powerful than a DFA. For example, perl-compatible regexes have recursive match operators and absolutely can match balanced parentheses:
pcregrep '^([(](?1)*[)])*$' <<'EOF'
()
(())
((()())())()
(((())())()
EOF
prints only lines with balanced parentheses:
()
(())
((()())())()
.NET regexes have an explicit "balancing group" operator which can be used for this sort of thing. It is also apparently possible with Java regexes [drregex.com] which have neither recursion nor balancing groups, but do have lookahead and forward reference operators.
(Score: 5, Insightful) by WizardFusion on Wednesday March 27 2024, @10:34PM (23 children)
Why to websites insist on just using the middle third of the screen with massive amounts or white-space either side.
It's not like I am using a wide screen either, just standard 1080p laptop.
(Score: 5, Informative) by Rosco P. Coltrane on Wednesday March 27 2024, @10:40PM (18 children)
It's formatted for cellphone reading.
People browse on cellphones nowadays, and websites do this sort of garbage rather than write article with text that reflows gracefully, because whatever pile of Javascript powers their website lets them impose portrait mode on everybody, and it's quicker and lazier to ensure their content displays well when it's rendered only in one mode.
(Score: 4, Insightful) by janrinok on Wednesday March 27 2024, @11:27PM (13 children)
We sometimes get complaints from those who browse our site using smartphones etc. Why don't we use Bootstrap? they ask - which is what most sites optimised for smartphones use. We like our quirky olde-worlde appearance.
I am not interested in knowing who people are or where they live. My interest starts and stops at our servers.
(Score: 5, Insightful) by Rosco P. Coltrane on Wednesday March 27 2024, @11:41PM
Well, if you try modern here, you're gonna fail miserably 🙂
SN and /. are living internet history. They SHOULD look like the '90s, and if you try to modernize them, you'll lose what makes them what they are, and the audience that comes here to get their dose of the 90s. /. tried it and failed, and now they're more or less back to their old self and chugging along, doing what SN and /. do best; providing a taste of what the free internet was like a quarter century ago.
I want my mobile browser to render SN as my desktop browser does. It wouldn't be right if it tried to have a mobile version.
(Score: 4, Insightful) by krishnoid on Wednesday March 27 2024, @11:56PM (10 children)
How about a few CSS files, maybe even manually selectable -- phone, tablet, desktop? Then we don't need Javascript, and CSS is really powerful as it is.
(Score: 5, Touché) by janrinok on Wednesday March 27 2024, @11:59PM (3 children)
Are you volunteering for a job? :D
I am not interested in knowing who people are or where they live. My interest starts and stops at our servers.
(Score: 2, Interesting) by shrewdsheep on Thursday March 28 2024, @10:11AM (2 children)
Why the quip? The thread earlier was lining up against any change to the website. In view of the future of the site, I strongly support CSS optimizations and the use of limited javascript to improve usability. The old style can and should always be retained as a tribute to the legacy of the site. The user base has to be broadened for long term survival of SN and I believe that appearance and usability is a good part of it when it comes to attract new readership.
(Score: 4, Informative) by janrinok on Thursday March 28 2024, @10:54AM (1 child)
It was merely asking who would be doing this task? To have selectable CSS pages will require significant Perl code changes and new fields adding to the database so that the user's choice is remembered between log-ins. The displays are created by templates which might have to be changed to cope with different CSS. It is not my area.
No offence was intended. I did include a grin!
I am not interested in knowing who people are or where they live. My interest starts and stops at our servers.
(Score: 2, Interesting) by Anonymous Coward on Thursday March 28 2024, @04:42PM
User-selectable alternate stylesheets were an original design feature of CSS but it's unfortunate that the browser support today is completely useless.
The CSS2 specification actually says that user agents must provide an interface to change between alternate stylesheets [w3.org]. I don't know if this requirement persists in current specifications. Firefox has the choice in a hidden menu but it forgets your selection as soon as you reload the page or follow any link so it's basically unusable. Such a shame.
(Score: 3, Interesting) by bzipitidoo on Thursday March 28 2024, @12:24PM (5 children)
CSS is great, but should even that be necessary? Reflowing of text to fit window sizes is a core feature of plain old HTML. No more reliance on CR/LF for that. If a user is forced to scroll back and forth to view the full width, despite the site not using any JavaScript or set widths or whatever, that seems to me a problem with their environment, not the site. Note also that the browser has the final say over fonts. A site can give relative size differences; the user's system ultimately sets what size "big", "normal" and "small" are.
A common idea is a mobile version of the site. m.soylentnews.org
(Score: 3, Informative) by janrinok on Thursday March 28 2024, @12:50PM
I am not interested in knowing who people are or where they live. My interest starts and stops at our servers.
(Score: 2) by owl on Thursday March 28 2024, @07:29PM (1 child)
Because by far too many designers feel some hugely irrational need to control the layout to a level far more strict than "let html lay out the data based upon the viewport width".
Note that most of these designers are those from the "publishing" environment where positioning text in this corner of a page, and a highlight image over in this other spot, in order to leave room in three other places for three ad slots, was what they were trained to do.
I.e., they never even consider just letting the HTML lay itself out natively. They need to control where it goes, to the pixel, or they feel they have failed.
(Score: 2) by krishnoid on Thursday March 28 2024, @09:21PM
Designers are probably trained for traditional print media, and something that describes layouts and floats using declarative and markup languages is probably going to require a GUI for them to be able to do their thing. They're not coders, after all.
(Score: 3, Insightful) by maxwell demon on Friday March 29 2024, @07:39AM (1 child)
That's the absolute worst thing to do. If I follow a link, it should work equally well if I follow it on a desktop or on a laptop.
The Tao of math: The numbers you can count are not the real numbers.
(Score: 2) by maxwell demon on Friday March 29 2024, @07:41AM
Err, mistyped, and only noticed after submit: I of course meant desktop or phone.
And in addition I now have to wait with posting this correction due to the unreasonable long wait time enforced by the Rehash software.
The Tao of math: The numbers you can count are not the real numbers.
(Score: 3, Interesting) by Tork on Thursday March 28 2024, @12:23AM
Again I wasn't intending to raise this issue, I don't think it really needs to be fixed, it just became on-topic today. On a side note I do appreciate that the font shrinks as you go down the branch. Reading through threads works pretty well. It's one of the reasons I go here first before the green site.
🏳️🌈 Proud Ally 🏳️🌈
(Score: 2) by stormreaver on Thursday March 28 2024, @12:54AM (3 children)
Three-column layouts, where the main content is in the center, were common well before cellphones became widespread. That layout is common because there is a point where horizontal reading becomes less pleasant than vertical reading.
(Score: 3, Informative) by aafcac on Thursday March 28 2024, @02:18AM (2 children)
Yes, also remember that Xerox had their monitors in portrait format back when they were developing their paperless offices for a reason.
It's also worth recognizing that in some places that format of multiple columns makes it easier to fold up to not interfere with other folks on the bus while reading the newspaper.
(Score: 2) by krishnoid on Thursday March 28 2024, @09:27PM (1 child)
And the qwerty keyboard layout is designed to keep the hammers from jamming [dvzine.org].
(Score: 2) by aafcac on Friday March 29 2024, @02:14PM
Yes, and unfortunately one of those two things can be easily remedied as needed. It's one of the reasons why it's worth having a good monitor stand, it should allow you to rotate the screen 90 degrees if you're doing something that benefits from that. Or, just have the window manager tile the window to only take up half of the width.
The keyboard though is a much more annoying problem. Yes, you an use whatever map you like, so long as you have enough keys, but it's a whole thing to learn a new layout and if you're a decent typist it may not even be worth the effort.
(Score: 2) by krishnoid on Wednesday March 27 2024, @11:54PM
And why is there so much vertical whitespace when laptops are landscape-aspect-ratio, and small fonts? Oh, probably same reason.
(Score: 1) by zdammit on Thursday March 28 2024, @08:10AM (2 children)
I don't know about you but I find very wide text layouts hard to read and I like the sites that try to make it less wide. There is a typographic convention that says line width should be 2.x alphabets (x depends on who you ask).
(Score: 2) by VLM on Thursday March 28 2024, @04:40PM
It's been a battle for decades now, that source code should either be like literature or prose with short lines for the casual reader vs source code should be like a math formula where stretching one unitary concept into unneeded extra steps or extra lines is mere obfuscation to be avoided for the simplicity of one line for one concept.
(Score: 3, Touché) by owl on Thursday March 28 2024, @07:31PM
Then narrow your viewing window, and suddenly, you get "less wide" text layouts.
The only reason you get very wide layouts is you have your browser window sized to be "very wide". I.e., the problem is completely under your control.
(Score: 2) by bzipitidoo on Thursday March 28 2024, @05:12AM (7 children)
I've been studying the history of computation, and I can say that the reason for many of the decisions is a desperate hurry. Just as soon as it's good enough, get that new tech out the door before the competition does.
What do you do when in a big hurry? You don't make up something completely new, you take whatever existing ideas there are that are familiar to you and that can be most easily and quickly adapted to the new need. This is why ASCII so slavishly follows the layout of that 19th century invention, the typewriter. Except that for the letters, it kept to a far, far older convention: alphabetic order, which goes back about 3800 years. And why not? You figure that the pioneers of typing worked out a decent choice and arrangement of symbols, why reinvent that wheel? EBCDIC on the other hand ultimately can trace its ancestry to another 19th century invention, the telegraph. The typewriter itself is no exception to this principle, having been inspired by pianos. That is why our computer keyboard is called a "key" board and not a button board or some other term. "key" comes from pianos, which comes from the musical notion of a "key" in which a piece of music is played.
As to programming languages, they borrowed heavily from mathematical notation. The hurrying was a little different in this case. With the success of FORTRAN in the mid 1950s, computer scientists realized people would all go off and do their own thing if they didn't come up with a standard, quick. They produced ALGOL 58, then ALGOL 60. ALGOL was very influential, but it didn't nip in the bud efforts to make more programming languages.
(Score: 2) by hendrikboom on Thursday March 28 2024, @01:24PM (2 children)
And, of course, Algol 68. Which introduced an algebra of types, which became the type system of C after some syntax changes that made it harder to understand.
(Score: 2) by hendrikboom on Thursday March 28 2024, @01:48PM
And some semantic changes that made it significantly less secure. And easier to implement.
(Score: 3, Informative) by bzipitidoo on Thursday March 28 2024, @08:12PM
And then, after ALGOL 68, the language just vanished. Did you ever wonder what happened to it? Pascal, that's what. ALGOL W was an alternate vision for the next version of ALGOL that was rejected in favor of what became ALGOL 68. ALGOL W was then used as the basis for Pascal.
(Score: 2) by ElizabethGreene on Thursday March 28 2024, @02:44PM (2 children)
How do you pronounce ASCII and EBCDIC? Mine are Ass-key and Ebb-ka-dick.
(Score: 2) by tangomargarine on Thursday March 28 2024, @06:01PM
Ebs-dick/Eb-stick
"Is that really true?" "I just spent the last hour telling you to think for yourself! Didn't you hear anything I said?"
(Score: 2) by cereal_burpist on Saturday March 30 2024, @03:45AM
(Score: 2) by krishnoid on Thursday March 28 2024, @09:32PM
Citations requested. There's a lot of value in being able to distinguish which decisions were made quickly [computer.org] from ones that were released after a careful design process, and how they anti/correlate to the quality of the result.
(Score: 3, Interesting) by ElizabethGreene on Thursday March 28 2024, @02:41PM
I sincerely doubt these are the origins, but the way I remember them...
If you put a carrot on a stick, everyone follows it and the buck stops here.
(Score: 2) by VLM on Thursday March 28 2024, @04:16PM (6 children)
Speculation: Probably used an ASR-33
Facts: Noobs think ^ is a shift-6 but on an ASR-33 it's control-n, so we're talking about shift 4 vs control n
Speculation: Ken ran out of keys. This is not a 100+ key modern keyboard, so you're lucky it's shift-4 vs control-n and not even worse key combos.
(Score: 2) by maxwell demon on Friday March 29 2024, @07:47AM (2 children)
For me, ^ doesn't even need shift, it's the key left of the 1 key.
The Tao of math: The numbers you can count are not the real numbers.
(Score: 2) by VLM on Friday March 29 2024, @07:39PM (1 child)
OK fine I give up what is your keyboard? Its not an AZERTY thing... You remap the key for some reason?
(Score: 3, Informative) by cereal_burpist on Saturday March 30 2024, @03:54AM
(Score: 2) by janrinok on Friday March 29 2024, @07:54PM (1 child)
I am not interested in knowing who people are or where they live. My interest starts and stops at our servers.
(Score: 2) by janrinok on Friday March 29 2024, @07:55PM
I am not interested in knowing who people are or where they live. My interest starts and stops at our servers.
(Score: 2) by janrinok on Friday March 29 2024, @08:02PM
I am not interested in knowing who people are or where they live. My interest starts and stops at our servers.