Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Wednesday December 16 2015, @06:23AM   Printer-friendly
from the it-all-adds-up dept.

Okay, maybe not everything you know about latency is wrong. But now that I have your attention, we can talk about why the tools and methodologies you use to measure and reason about latency are likely horribly flawed. In fact, they're not just flawed, they're probably lying to your face.

When I went to Strange Loop in September, I attended a workshop called "Understanding Latency and Application Responsiveness" by Gil Tene. Gil is the CTO of Azul Systems, which is most renowned for its C4 pauseless garbage collector and associated Zing Java runtime. While the workshop was four and a half hours long, Gil also gave a 40-minute talk called "How NOT to Measure Latency" which was basically an abbreviated, less interactive version of the workshop. If you ever get the opportunity to see Gil speak or attend his workshop, I recommend you do. At the very least, do yourself a favor and watch one of his recorded talks or find his slide decks online.

The remainder of this [linked] post is primarily a summarization of that talk. You may not get anything out of it that you wouldn't get out of the talk, but I think it can be helpful to absorb some of these ideas in written form. Plus, for my own benefit, writing about them helps solidify it in my head.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 3, Informative) by Snotnose on Wednesday December 16 2015, @03:37PM

    by Snotnose (1623) on Wednesday December 16 2015, @03:37PM (#277143)

    I've spent half my career working with Real Time Systems. You know, the ones where if you miss an interrupt Bad Things (tm) happen (for example, your stepper motor is forever more 1 step behind). TFA is essentially saying that you can't throw out the worst case reading. In my experience, it's the worst case reading that you care about. You need to figure out why it happens and how to ensure it will never be a problem in your system.

    I'm reminded of the first (and last) time I had to use Microsoft's Real Time OS, Windows CE, about 2000. Called Microsoft to get the maximum interrupt latency. They couldn't guarantee one. Tried to get the max stack usage for a process. They couldn't guarantee one. Tried to get the maximum RAM usage of a process. They couldn't guarantee one. Tried to get the maximum amount of time it would take for a task switch. They couldn't guarantee one. In fact, for every basic requirement of an RTOS Microsoft couldn't guarantee a value. Rather ironic they marketed the OS as WinCE. Needless to say, maybe 1/3 of the way through the project we realized WinCE was killing us and the project got cancelled.

    --
    My ducks are not in a row. I don't know where some of them are, and I'm pretty sure one of them is a turkey.
    Starting Score:    1  point
    Moderation   +1  
       Informative=1, Total=1
    Extra 'Informative' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   3  
  • (Score: 1) by ThePhilips on Wednesday December 16 2015, @05:13PM

    by ThePhilips (5677) on Wednesday December 16 2015, @05:13PM (#277208)

    TFA is essentially saying that you can't throw out the worst case reading. In my experience, it's the worst case reading that you care about.

    The same experience here. Though I have another experience too: not only a typical developer doesn't care about the worst case, but also customers do not care about the worst case. Customers often want a nice benchmark number to reaffirm their buying decision.

    I personally (having lots of background in network programming) prefer to think about latencies in terms of "depth of the queue". Deeper are your queues - longer items are staying in the queue - higher the potential/worst case latencies.

    Generic Java's GC is too good fit for it: suck up gigabytes of RAM, then spend 30-90 seconds laundering it.

    Context switch latencies are the same too: though task-off + task-on times are fairly short and nearly constant, the length of the queue of tasks, waiting for free CPU to run on, is the source of the variable latencies. Most tasks - longer the queue - higher the latencies. (That's why the old RT axiom: 1 RT task = real time, 2/more RT tasks = that's not real time anymore.) (For fun, under the Windows OS, in the Task Manager enable the "Thread Count" column - and marvel at the counts. Decades of unfixed crappy async APIs do that to your applications and your system.)

    The deep queues are also explanation for the wavy latency curves: the queues are swinging gradually between full (high) and empty (low) states (latencies).