Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Sunday July 21 2019, @05:22AM   Printer-friendly
from the I-have-felt-this-pain dept.

I've had some occasions of late to peer through the looking glass into a world that I hadn't seen much of previously. Specifically, I'm talking about the world of so-called "cloud" stuff, where you basically pay someone else to build and run stuff for you, instead of doing it yourself.

I'll skip the analysis of build vs. buy and just jump straight to the point where you've chosen "buy". Then you've had a whole bunch of fun outages caused by something going wrong with their services. Finally, you reach the point of a sit-down talk with the vendor to figure things out. Maybe they send some sales people too, or perhaps it's just engineers. You talk for a while, and before long, you realize what happened.

[...] This becomes obvious when talking about some problem you experienced at the hands of their system. The whole time, their dashboard stayed green because from their point of view, they had tremendous availability. We're talking 99.999% here! Totally legit!

Meanwhile, you were having a really bad day. Nothing was working. Your business was in shambles. Your customers were at your throat yelling for action, and all you could do is point at the vendor. What happened?

Well, this is the point where you find out that their "99.999%" availability is for their entire system. They see that, and they're good. It's not a problem! Everything is fine.

This also completely misses the fact that for you, everything was failing. It doesn't matter though, since your worst day still won't move the needle on their fail-o-meter. They won't see you. They won't have any idea anything even happened until you complain weeks later. You are the bug on the windscreen of the locomotive. The train has no idea you were ever there.

The problem is that they weren't monitoring from the customer's perspective. Had they done that, it would have been clear that oodles of requests from some subset of customers were failing. They would have also realized that certain customers had all of their requests failing. For those customers, there were no nines to be had that day.

Seriously, if you have a multi-tenant system, you owe it to your customers to monitor it from their point of view. Otherwise, how can you possibly know when you've done something that'll leave them in the cold?


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2, Informative) by hwertz on Sunday July 21 2019, @04:35PM (1 child)

    by hwertz (8141) on Sunday July 21 2019, @04:35PM (#869662)

    I've never seen this... Every shared hosting or "cloud" provider I've seen, the 4 or 5 "9s" are for your instance. With that said... if the machine stays up, but what you set up on there keeps crashing out... well, then it's like you say, you have downtime but they don't.

    To maintain uptime, they do expect your services to be set up so they can be yanked off one machine and moved onto another (since an individual computer can of course fail.) This is complicated to do right, so many people don't do it right -- they put whatever into a cloud provider's systems, then act like it's the cloud provider's fault when migrating the stuff to another computer failed. This isn't like an old IBM mainframe where your processes are just seamlessly moved to another system -- they expect you to handle having your processes shut down (maybe not even cleanly), and then started up somewhere else, and then (hopefully) fed the same transaction info if it was in the middle of performing transaction... you better not let that transaction go through twice! This is very complicated but considered the user's problem, not the cloud providers, if you don't get it right.

    What IS the cloud providers fault?

    In my view,
    1) it'd be way nicer if things could be moved more seamlessly. The current PC technology does finally allow for replicating what IBM did in the 1980s on those dinosaurs, just moving the whole VM over, but I don't know if any cloud provider does. Honestly some of the existing setups are simply overcomplicated.

    2) If you look at some of these storage and database type services, they'll have ones where the service level agreement might promise 99.999% but no limit on how long the service takes to respond (realistically, some of these services will "usually" take 1/10th of a second or whatever but sometimes take 20 or 30 seconds PER TRANSACTION instead, and apparently on a daily basis, not some "whoops we almost crashed our cloud this one time" basis.) And in some cases it's considered "not a failure" to have it FAIL 2 or 3 times in a row as long as it works on the 4th attempt. That is simply crap.

    Starting Score:    1  point
    Moderation   +1  
       Informative=1, Total=1
    Extra 'Informative' Modifier   0  

    Total Score:   2  
  • (Score: 0) by Anonymous Coward on Monday July 22 2019, @12:00AM

    by Anonymous Coward on Monday July 22 2019, @12:00AM (#869771)

    5 9's is a marketing term.

    You can have a service that has hundreds of moving parts (as we called them when I did this). 1 of them is down. Is that a down situation? Does that count against the 9's? It is a method call that on guy uses once a month, and he did not call it during that time.

    What you really want to know is what sort of transaction rates they have. What is their turn around time per method. What is their failure rate per method. etc etc etc.

    What I said above is for *that* particular type of service (SaaS). Oh and more than likely they do not know. Then point at they other type of uptime as *the* metric.

    If you are just chucking your boxes out in the 'cloud' you have a whole different set of reqs. Like box uptime, network latency, storage latency. Basically the same sort of thing but everything in the stat counters in your favorite OS. Which can be overwhelming for anyone.

    Then add to that your own services. Are the h1b guys you and they hired any good? I have worked with some wicked smart ones. Then some I wonder how they manage to velcro their shoes in the morning.

    Then on top of those what are their maintenance windows (they all have them). What is the effect on you during those times. That can look anywhere from reduced service to not working at all to no impact. That can depend on what infrastructure is being upgraded. It *will* vary.

    Reliability is not something you can usually farm out. You build it.

    Hell the place I work has its own 'cloud'. Just last week they upgraded all of our machines. Both production and failover. They broke them all. Lucky it was an easy fix. But what is your bring up from scratch procedure like? You farm it out you take on their oddities and their downtimes. They sell you 99.999 but the reality is more like 95% on just about everything. Just like when you owned the hardware. You build reliability your self and expect the whole thing to disappear on you at any moment.

    just moving the whole VM over
    vSphere which many of them are built on have had that for years. It works OK. But it is *not* seamless. There will be an interruption of service at some point. The big ones have their own way of doing it.