Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Wednesday December 16 2015, @06:23AM   Printer-friendly
from the it-all-adds-up dept.

Okay, maybe not everything you know about latency is wrong. But now that I have your attention, we can talk about why the tools and methodologies you use to measure and reason about latency are likely horribly flawed. In fact, they're not just flawed, they're probably lying to your face.

When I went to Strange Loop in September, I attended a workshop called "Understanding Latency and Application Responsiveness" by Gil Tene. Gil is the CTO of Azul Systems, which is most renowned for its C4 pauseless garbage collector and associated Zing Java runtime. While the workshop was four and a half hours long, Gil also gave a 40-minute talk called "How NOT to Measure Latency" which was basically an abbreviated, less interactive version of the workshop. If you ever get the opportunity to see Gil speak or attend his workshop, I recommend you do. At the very least, do yourself a favor and watch one of his recorded talks or find his slide decks online.

The remainder of this [linked] post is primarily a summarization of that talk. You may not get anything out of it that you wouldn't get out of the talk, but I think it can be helpful to absorb some of these ideas in written form. Plus, for my own benefit, writing about them helps solidify it in my head.


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Insightful) by deimios on Wednesday December 16 2015, @07:09AM

    by deimios (201) Subscriber Badge on Wednesday December 16 2015, @07:09AM (#277002) Journal

    OMG! Everything I know is a lie!!! I MUST read this article. Or not.

    • (Score: -1, Troll) by Anonymous Coward on Wednesday December 16 2015, @07:12AM

      by Anonymous Coward on Wednesday December 16 2015, @07:12AM (#277006)

      Well let's see now. If you're not willing to believe that you might be wrong, then you're just a pigheaded asshole. Good day to you, pig.

      • (Score: 5, Insightful) by Anonymous Coward on Wednesday December 16 2015, @07:38AM

        by Anonymous Coward on Wednesday December 16 2015, @07:38AM (#277014)

        I reviewed TFA.

        It appears they are talking about latency in web applications: where you make dozens if not hundreds of requests to external web-servers. The TFA argues that a typical page load will encounter at least one degenerate object load.

        Patient: It hurts when I load hundreds of objects from a dozen web-servers!
        Doctor: Stop doing that!

        • (Score: 5, Informative) by FatPhil on Wednesday December 16 2015, @02:33PM

          by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Wednesday December 16 2015, @02:33PM (#277092) Homepage
          Exactly. If a user will be annoyed by whichever is worst of 100 things, then you should measure the worst of those hundred, all the faster ones are irrelevant. This is common sense, but apparently not well known. But as you say, you do need to ask yourself why you're doing 100 different things in order to handle one request.

          However, it's still a worthwhile talk because it emphasises how easy it is to lie with badly collected samples. And if your test engine is lying to you, then *you must ignore it, no matter what it says* - it can make *worse* look *better*. It also emphasises that you must measure the right thing. Any test engine which tries to issue 100 requests per second, and the moment the server locks up stops issuing requests, or fails to count requests that it didn't even try, is *worse than useless*. Latency should be measured between *wanting to make the request* (wanting to go into the coffee shop, even though the queue is out of the door) and response arriving (getting your coffee). (Which means that if you are forced to not make the request, the time is effectively infinite - put that in your mean calculation!)

          ${DAYJOB} publishes a test engine as well as the server, I should now go and check that our test engine isn't braindead. After seeing the talk, even though there was nothing new or unobvious, I'm tempted to completely revisit that (I didn't write the original), and ensure that the stats gathering in the server itself (which I recently rewrote to be "more accurate") actually makes sense. Fortunately we're in a "1 request 1 response" scenario, so don't have the "slowest of 100" problem.
          --
          Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
          • (Score: 0) by Anonymous Coward on Wednesday December 16 2015, @03:32PM

            by Anonymous Coward on Wednesday December 16 2015, @03:32PM (#277137)

            you do need to ask yourself why you're doing 100 different things in order to handle one request

            Exactly, this is what is wrong with the current web.

            • (Score: 0) by Anonymous Coward on Wednesday December 16 2015, @04:43PM

              by Anonymous Coward on Wednesday December 16 2015, @04:43PM (#277188)

              in general, yet. It's improved with 3rd party content, and javascript, disabled.
              However, I dislike the mutli-icon bitmap sliced and diced at the client end method of avoiding mutiple image loads. Set your caching information correctly so that once the image is loaded it never gets loaded again. Better - don't fill you webpage with lots of graphical fluff - can I have just the facts please ma'am.

              • (Score: 0) by Anonymous Coward on Wednesday December 16 2015, @06:15PM

                by Anonymous Coward on Wednesday December 16 2015, @06:15PM (#277228)

                > However, I dislike the mutli-icon bitmap sliced and diced at the client end method of avoiding mutiple image loads.

                Since you chose not to explain why you think that: Cool story bro!

                • (Score: 2) by FatPhil on Wednesday December 23 2015, @10:30AM

                  by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Wednesday December 23 2015, @10:30AM (#280154) Homepage
                  Presumably it's because it breaks the logic of what an image is. Each image is an atomic thing, logically - it's not a peephole into a larger thing. Slicking them all together and then using peepholes breaks that logic.
                  --
                  Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
    • (Score: 2) by bradley13 on Wednesday December 16 2015, @07:37AM

      by bradley13 (3053) Subscriber Badge on Wednesday December 16 2015, @07:37AM (#277013) Homepage Journal

      I just skimmed TFA, and it really is worth looking at.

      Just one example: Lots and lots of testing code is done in a loop

      for (a zillion tests) {
          start_time = current_time
          run test
          latency = current_time - start_time
      }

      Suppose your system is processing 1000 tests/second, freezes for 10 seconds, and then resumes at 1000/second. Those 10 seconds should impact 10000 tests. But the way this code is written, the freeze will only impact one, single test, because your testing code is frozen along with the system. Since only one test out of zillions is affected, it gets disregarded as noise, and your system gets falsely certified as able to process 1000 transactions/second.

      --
      Everyone is somebody else's weirdo.
      • (Score: 2) by frojack on Wednesday December 16 2015, @09:05AM

        by frojack (1554) Subscriber Badge on Wednesday December 16 2015, @09:05AM (#277019) Journal

        Well, that still makes the headline wrong.

        If you don't know about the freeze, (and at some level all code has freezes waiting on one thing or another) then measuring at a different level of detail simply gives you information at a different level of detail. For some purposes that's just fine. For some purposes that is what you actually want, because you can't always control everything. Knowing about every millisecond and where it comes from doesn't help you if you can't avoid it.

        In general we know about latency at the level we need to control latency at the level we can actually have influence over.

        --
        No, you are mistaken. I've always had this sig.
      • (Score: 2) by TheRaven on Wednesday December 16 2015, @09:51AM

        by TheRaven (270) on Wednesday December 16 2015, @09:51AM (#277026) Journal
        That isn't wrong, but it's measuring something different. Not having read TFA, it sounds like what he's actually talking about is tail latency. If you're doing everything with distributed systems, this is what you measure because it's the thing that affects overall system throughput. The canonical example of this (which spawned a load of poorly thought out papers on ugly hacks to Java VMs in various distributed systems conferences) is the issue that Twitter had with GC pauses: they would send requests to a load of machines, then wait for the response before being able to reply to the end user. Their response speed was limited by the slowest of the workers to respond. If one of them hit a GC pause at the time of the request, this could be a really long time. With enough machines, the probability of at least one of them hitting a GC pause approached 1.
        --
        sudo mod me up
    • (Score: 3, Funny) by DeathMonkey on Wednesday December 16 2015, @07:46PM

      by DeathMonkey (1380) on Wednesday December 16 2015, @07:46PM (#277266) Journal

      OMG! Everything I know is a lie!!! I MUST read this article. Or not.
       
      We have articles now?

  • (Score: 0) by Anonymous Coward on Wednesday December 16 2015, @07:10AM

    by Anonymous Coward on Wednesday December 16 2015, @07:10AM (#277004)

    "IT SLOW!" they scream.

    That's all, folks. That's everything you will ever know about latency. It's everything anyone ever WANTS to know about latency.

    Slow. Slow! SLOW!

  • (Score: 0) by Anonymous Coward on Wednesday December 16 2015, @09:40AM

    by Anonymous Coward on Wednesday December 16 2015, @09:40AM (#277024)

    i wish someone would do a study about the latency when i'm in the car and on the road ...

  • (Score: -1, Troll) by Anonymous Coward on Wednesday December 16 2015, @12:06PM

    by Anonymous Coward on Wednesday December 16 2015, @12:06PM (#277042)

    Because it mentions that one of the causes of latency spikes is garbage collection, and we're old by high priests of java that GC never causes any slowdowns, ever.

  • (Score: 1, Funny) by Anonymous Coward on Wednesday December 16 2015, @01:28PM

    by Anonymous Coward on Wednesday December 16 2015, @01:28PM (#277069)

    headline: Everything You Know about Latency is Wrong
    first sentence: Okay, maybe not everything you know about latency is wrong.

    excuse me while i draw dicks all over this garbage
    8====D~~~
    8====D~~~
    8====D~~~
    8====D~~~

  • (Score: 5, Informative) by kurenai.tsubasa on Wednesday December 16 2015, @02:17PM

    by kurenai.tsubasa (5227) on Wednesday December 16 2015, @02:17PM (#277083) Journal

    Went to TFA. I feel like I've just read Timecube but with C# snippets and graphs.

    I think get it! When your apping app requests 1 page view, 2 major page view axes are created are the opposite sides of the internet. Where the 2 major page view forces join, synergy creates 2 new minor page view points we recognize as median and mean. The 4-equidistant page view points can be considered as Latency Square imprinted upon the circle of Internet. In a single rotation of the Internet cloud, each Page View corner point rotates through the other 3-view Page points, thus creating 16 corners, 96 HTTP GET requests and 4-simultaneous 24-second Lags within a single rotation of Internet – equated to a Higher Order of 99th Percentile Latency Cube!

    • (Score: 0) by Anonymous Coward on Wednesday December 16 2015, @03:59PM

      by Anonymous Coward on Wednesday December 16 2015, @03:59PM (#277158)

      If you alter the page on your end, does it instantly alter via entanglement the page view on the other side of the internet?

    • (Score: 2) by Hyperturtle on Wednesday December 16 2015, @11:05PM

      by Hyperturtle (2824) on Wednesday December 16 2015, @11:05PM (#277372)

      You're already modded high but I wanted to say that a) you referenced the time cube and b) +1

  • (Score: 2) by Fnord666 on Wednesday December 16 2015, @03:23PM

    by Fnord666 (652) on Wednesday December 16 2015, @03:23PM (#277128) Homepage
    From TFA:

    The median is the number that 99.9999999999% of response times will be worse than. This is why median latency is irrelevant.

    Wait, what?

    • (Score: 0) by Anonymous Coward on Wednesday December 16 2015, @05:08PM

      by Anonymous Coward on Wednesday December 16 2015, @05:08PM (#277204)

      you missed the preamble.

      IF a request is considered satisfied when the final result is back from 40 independent sub-requests whose latencies all share the same distribution, then the median is ... (take the log2 of 1 - his-number, that will tell you what what the "40" really is, I eyeballed)

      However, this is equivocation - his "request" is both the outer request and the inner request.

      He also gets a bit anti- means. His claim that a mean of a percentile is meaningless is no more valid than saying that the mean of anything is meaningless. It's not a percentile, it's an approximation to a likely expected percentile, in the same way that the average amount of drink consumed on a booze cruise will be the amount of drink consumed by someone on a booze cruise. Averages are just averages, get over it, they are neither meaningless nor wrong.

      FatPhil (AC as browsing on his phone, while incidentally in the karaoke of a booze cruise ship surrounded by drunk Finns - or "Finns" as I like to call them, our four-legged northern neighbours.)

      • (Score: 0) by Anonymous Coward on Wednesday December 16 2015, @05:39PM

        by Anonymous Coward on Wednesday December 16 2015, @05:39PM (#277218)

        Ooops, karaoke is doing my nut in... s/will be the amount/will *not* be the amount/

        FFFFFUUUUUU!!!!!!!! now in the cafe surrounded by noisy little crack monkeys...

  • (Score: 3, Informative) by Snotnose on Wednesday December 16 2015, @03:37PM

    by Snotnose (1623) on Wednesday December 16 2015, @03:37PM (#277143)

    I've spent half my career working with Real Time Systems. You know, the ones where if you miss an interrupt Bad Things (tm) happen (for example, your stepper motor is forever more 1 step behind). TFA is essentially saying that you can't throw out the worst case reading. In my experience, it's the worst case reading that you care about. You need to figure out why it happens and how to ensure it will never be a problem in your system.

    I'm reminded of the first (and last) time I had to use Microsoft's Real Time OS, Windows CE, about 2000. Called Microsoft to get the maximum interrupt latency. They couldn't guarantee one. Tried to get the max stack usage for a process. They couldn't guarantee one. Tried to get the maximum RAM usage of a process. They couldn't guarantee one. Tried to get the maximum amount of time it would take for a task switch. They couldn't guarantee one. In fact, for every basic requirement of an RTOS Microsoft couldn't guarantee a value. Rather ironic they marketed the OS as WinCE. Needless to say, maybe 1/3 of the way through the project we realized WinCE was killing us and the project got cancelled.

    --
    Relationship status: Available for curbside pickup.
    • (Score: 1) by ThePhilips on Wednesday December 16 2015, @05:13PM

      by ThePhilips (5677) on Wednesday December 16 2015, @05:13PM (#277208)

      TFA is essentially saying that you can't throw out the worst case reading. In my experience, it's the worst case reading that you care about.

      The same experience here. Though I have another experience too: not only a typical developer doesn't care about the worst case, but also customers do not care about the worst case. Customers often want a nice benchmark number to reaffirm their buying decision.

      I personally (having lots of background in network programming) prefer to think about latencies in terms of "depth of the queue". Deeper are your queues - longer items are staying in the queue - higher the potential/worst case latencies.

      Generic Java's GC is too good fit for it: suck up gigabytes of RAM, then spend 30-90 seconds laundering it.

      Context switch latencies are the same too: though task-off + task-on times are fairly short and nearly constant, the length of the queue of tasks, waiting for free CPU to run on, is the source of the variable latencies. Most tasks - longer the queue - higher the latencies. (That's why the old RT axiom: 1 RT task = real time, 2/more RT tasks = that's not real time anymore.) (For fun, under the Windows OS, in the Task Manager enable the "Thread Count" column - and marvel at the counts. Decades of unfixed crappy async APIs do that to your applications and your system.)

      The deep queues are also explanation for the wavy latency curves: the queues are swinging gradually between full (high) and empty (low) states (latencies).

  • (Score: 0) by Anonymous Coward on Wednesday December 16 2015, @09:37PM

    by Anonymous Coward on Wednesday December 16 2015, @09:37PM (#277316)

    Everything You Know about Latency is Wrong

    Hopefully, the people who did have more of a clue than I do.

  • (Score: 2) by kaganar on Wednesday December 16 2015, @11:01PM

    by kaganar (605) on Wednesday December 16 2015, @11:01PM (#277369)

    No, I didn't watch TFV (which seems concerned with layer-7 "latency", not layer 3 or below), but here's a common latency issue that people don't seem to think about but I encounter in different situations all the time at various locations:

    A path through a network seems slow -- everything you do is slow, there's even some time outs for services you're using, but you see this:

    C:\Windows\Sucks>ping -t xxx.xxx.xxx.xxx
    Pinging someserver.blah [xxx.xxx.xxx.xxx] with 32 bytes of data:
    Reply from xxx.xxx.xxx.xxx: bytes=32 time=16ms TTL=44
    Reply from xxx.xxx.xxx.xxx: bytes=32 time=17ms TTL=44
    Reply from xxx.xxx.xxx.xxx: bytes=32 time=17ms TTL=44
    ...

    And then for shits you try this out (same thing, but a packet size with a more representative payload length):

    C:\Windows\Sucks>ping -t -l 2048 xxx.xxx.xxx.xxx
    Pinging someserver.blah [xxx.xxx.xxx.xxx] with 2048 bytes of data:
    Reply from xxx.xxx.xxx.xxx: bytes=2048 time=1924ms TTL=44
    Request timed out
    Reply from xxx.xxx.xxx.xxx: bytes=2048 time=2319ms TTL=44
    ...

    Just take one guess which one your ISP's technical support rep is going to do. I'm not an expert, but this seems to make some intuitive sense: if you're going to have bit-flipping problems because of line noise, what's more likely to get through unscathed without needing retransmission: long messages or short one? Once I explain that, I usually get a bit more cooperation other than "Huh, must be your computer. Anything else I can help you with?"