Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 18 submissions in the queue.
posted by hubie on Saturday July 27 2024, @06:05PM   Printer-friendly

https://blog.mattstuchlik.com/2024/07/21/fastest-memory-read.html

Summing ASCII Encoded Integers on Haswell at the Speed of memcpy turned out more popular than I expected, which inspired me to take on another challenge on HighLoad: Counting uint8s. I'm currently only #13 on the leaderboard, ~7% behind #1, but I already learned some interesting things. In this post I'll describe my complete solution (skip to that) including a surprising memory read pattern that achieves up to ~30% higher transfer rates on fully memory bound, single core workloads compared to naive sequential access, while apparently not being widely known (skip to that).

As before, the program is tuned to the input spec and for the HighLoad system: Intel Xeon E3-1271 v3 @ 3.60GHz, 512MB RAM, Ubuntu 20.04. It only uses AVX2, no AVX512.

The Challenge

"Print the number of bytes whose value equals 127 in a 250MB stream of bytes uniformly sampled from [0, 255] sent to standard input."

Nothing much to it!


Original Submission

This discussion was created by hubie (1068) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 0) by Anonymous Coward on Saturday July 27 2024, @06:53PM (6 children)

    by Anonymous Coward on Saturday July 27 2024, @06:53PM (#1365930)

    What ever happened to parallel ports? Why is serial all the rage?

    • (Score: 0) by Anonymous Coward on Saturday July 27 2024, @07:08PM

      by Anonymous Coward on Saturday July 27 2024, @07:08PM (#1365933)

      Copper prices went up.

    • (Score: 0) by Anonymous Coward on Saturday July 27 2024, @07:13PM

      by Anonymous Coward on Saturday July 27 2024, @07:13PM (#1365934)

      cheaper to produce.

      Imagine the cost of 8x SATA controllers all together, to connect one drive. A 16-port SATA controller costs like $500, new. The south bridge (or is it the north bridge?) typically has 4-6 ports total. Those might even be port-multiplied ports.

      Also everyone is all about "thin" lately. Imagine 8 SATA ports in a portable computer now-adays, to connect one drive.

    • (Score: 2, Insightful) by anubi on Sunday July 28 2024, @12:23AM

      by anubi (2828) on Sunday July 28 2024, @12:23AM (#1365950) Journal

      "Why is serial all the rage?"

      Connections are by far the biggest points of failure, as they are subject to real-world physical abuse and contamination. They are costly to implement. And you only get so many pins on a package...routing to and from them requires yet more precious PCB real estate. Physical size costs more money and decreases the functionality/(size, weight) ratio.

      --
      "Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]
    • (Score: 5, Interesting) by tekk on Sunday July 28 2024, @12:28AM

      by tekk (5704) Subscriber Badge on Sunday July 28 2024, @12:28AM (#1365951)

      In addition to what other people said: lengths.

      When stuff is clocked as fast as high as they are these days it's very hard to, at scale, get the wires exactly the same length so that you don't miss your window for the signal.

    • (Score: 4, Informative) by Unixnut on Sunday July 28 2024, @09:38AM

      by Unixnut (5779) on Sunday July 28 2024, @09:38AM (#1365980)

      The faster you clock a parallel interface the harder it is to:

      1. Make sure all the signals arrive at the right time to constitute valid data (jitter). The faster you clock an interface the smaller the window is for all the signals to arrive in a form to correctly represent the data.
      2. Keep RFI under control, each wire is an an antenna, and having them in parallel next to each other means that they interfere with one another (known as crosstalk [wikipedia.org]).

        The higher your clock speed the broader the RFI, which not only affects the signals in the cable, but can radiate out and interfere with other components (or other equipment if your machine case is not a good Faraday cage).

        This is one of the reasons they had to move to 80-pin IDE cables from the 40-pin ones (which is a misnomer, both cables were 40 pin, but one had 80 wires and one 40 wires). Once ATA speeds increased beyond 33MHz the crosstalk became a big problem, so the "80-pin" cables had data signal and GND wire interleaving in order to reduce interference enough to clock it higher reliably.

      3. keep costs down, each wire needs its own copper line, manufacturing, tolerances, etc... nowadays silicon is cheaper to manufacture.

      Due to the issues above it became harder to increase the performance of interfaces. The way to increase performance back then was to add more signal wires rather than clock things higher. Hence you would get ever wider buses like was done with SCSI. The first SCSI was 8-bit wide (renamed to "narrow SCSI"), then 16-bit ("wide SCSI") and 32-bit wide ("Ultra wide SCSI"). However even then it was clear that you can't just keep adding wires to improve performance, at some point things would just get silly.

      Serial interfaces don't have a problem with crosstalk, they have lower RFI (and can be reduced further with differential signalling and good cable shielding), they are cheaper to produce (less wires needed) and they improve cooling inside machines (less cabling to block airflow). However silicon manufacturing technology had not reached the point where it was cheap and good enough to have mass produced high clock speed serial interfaces.

      Once the point was reached where we could design a serial interface with a high enough clock speed to match (or exceed) parallel interface performance for an acceptable cost, the writing was on the wall.

      The venerable parallel port was the first to go, replaced with USB (rs232 was also replaced at the same time, but it lingers on in the embedded, automation and manufacturing worlds), but now pretty much every interface in a computer is serial based. In fact I can't think of a single parallel interface on a modern PC. Everything is serial, including CPU-CPU communication buses.

      Some motherboards come with a legacy parallel port as a set of pin-headers, but that is it (and I doubt any of the MBs made post 2020 do even that). Still I do miss the parallel port, it was excellent for quick simple circuits to interface to the PC for little hacks. Nowadays I must use a microcontroller with USB interface for the same experience, which is more complex to program.

    • (Score: 0) by Anonymous Coward on Sunday July 28 2024, @04:22PM

      by Anonymous Coward on Sunday July 28 2024, @04:22PM (#1366009)

      Parallel is only faster on paper. The more you increase transmission speed, the harder it is to keep all those lines in sync.

  • (Score: 0) by Anonymous Coward on Saturday July 27 2024, @08:20PM

    by Anonymous Coward on Saturday July 27 2024, @08:20PM (#1365939)

    Somebody needs to make cshift faster - it SUCKS ASS.

(1)