from the another-one-strikes-again-and-another-ones-down-and-another-ones-down-and-another-one-strikes-again dept.
We're updating our story about the outage with new details as we have them. Microsoft and CrowdStrike both say that "the affected update has been pulled,"
[...]
If rebooting multiple times isn't fixing your problem, Microsoft recommends restoring your systems using a backup from before 4:09 UTC on July 18 (just after midnight on Friday, Eastern time), when CrowdStrike began pushing out the buggy update. Crowdstrike says a reverted version of the file was deployed at 5:27 UTC.If these simpler fixes don't work, you may need to boot your machines into Safe Mode so you can manually delete the file that's causing the BSOD errors. For virtual machines, Microsoft recommends attaching the virtual disk to a known-working repair VM so the file can be deleted, then reattaching the virtual disk to its original VM.
[...]
Before you can delete the file on those systems, you'll need the recovery key that unlocks those encrypted disks and makes them readable (normally, this process is invisible, because the system can just read the key stored in a physical or virtual TPM module).This can cause problems for admins who aren't using key management to store their recovery keys, since (by design!) you can't access a drive without its recovery key. If you don't have that key, Cryptography and infrastructure engineer Tony Arcieri on Mastodon compared this to a "self-inflicted ransomware attack," where an attacker encrypts the disks on your systems and withholds the key until they get paid.
And even if you do have a recovery key, your key management server might also be affected by the CrowdStrike bug.
(Score: 5, Insightful) by captain normal on Saturday July 20, @04:39PM (22 children)
Clear your system of MS products.
The Musk/Trump interview appears to have been hacked, but not a DDOS hack...more like A Distributed Denial of Reality.
(Score: 5, Insightful) by Unixnut on Saturday July 20, @05:32PM (10 children)
and anything "cloud" while you are at it. As shown, it is a massive single point of failure.
(Score: 4, Touché) by RS3 on Saturday July 20, @06:01PM (9 children)
I've always thought that. But I don't understand the "thinking" of the decision makers who push critical business infrastructure onto "cloud" and Microsoft products.
Well, thinking about it, maybe they've had (very) incompetent in-house IT people and "cloud" seems simpler.
Also there's the blame-game thing- when things crash, it's easier to blame someone else rather than your own IT people.
(Score: 5, Insightful) by owl on Saturday July 20, @06:25PM
And there you go answering your own "non-understanding". You push it onto the cloud so you can finger "the cloud" for anything that goes wrong, providing CYA for yourself.
(Score: 4, Insightful) by sjames on Saturday July 20, @10:10PM
Unfortunately, they then put their incompetent staff in charge of the cloud set-up...
Perhaps the problem is in management (the ones who hired the incompetent people who hired the incompetent staff), but good luck getting them to order themselves fired.
(Score: 5, Insightful) by Unixnut on Sunday July 21, @02:33PM (6 children)
They did it at all my previous places before the IT team (myself included) were laid off.
From what I've gathered main reasons are non-technical in nature, specifically:
Shifting costs from large capital expenditure (buying servers, building datacentres, etc...) to a simple monthly operational cost.
It results in a reduction in costs ( no need for an IT department anymore and all its staff, except some help-desk people, so a cost reduction in salaries, benefits and any office space + ancillaries)
The accountants prefer operational expenditure because its simpler than dealing with capital expenditure, depreciation, tax laws, etc....
The C-suite love it because it makes the books look good. IT is a cost to businesses not an income generator, so reducing it makes their profits higher (or losses lower) which can boost the share price, increase dividends, and result in larger bonuses for management.
Shareholders love it because it makes the share price go up and increases dividends (for above mentioned reasons).
Its flexible.
When you design your local system you need to do capacity planning. You can design for peak usage (expensive upfront cost, and may result in a lot of systems being idle most of the time), or for max system utilisation (which means you can't quickly add capacity for sudden peaks).
You also have to plan for capacity needs for some time in advance (usually 2-5 years in advance) to make sure you don't run out of capacity as the business grows.
With the cloud this is not necessary, you can scale up your capacity when needed, and scale it down when you don't need it (or need to save some money short term).
If it all goes tits up you can blame someone else.
If an internal system caused an outage there would be egg on the faces of the management, they would be held to account for what happened, there would be losses for the business, etc...
However if a cloud provider goes offline, you just shrug your shoulders and say "not my fault, its with our supplier to resolve". It is very good as a CYA strategy. It also applies to things like security patches, regulations, handling data, etc...
All these rules and regulations get offloaded to another company, and you just have to make sure you ticked the boxes you need to make sure they meet some internal compliance requirements for vetting when you onboard them. There is little to no ongoing paperwork or responsibility.
If there is an outage, it does not just affect you.
If you have an outage local to your IT, it puts you at a disadvantage to your competition who can take advantage of the situation. However if your competitors also have an outage because you are all cloud based, then neither of you benefit.
Logic would say then that keeping your IT in house would give you a competitive advantage, but that advantage needs to offset the businesses advantages in points 1-3 above, which for most it doesn't.
There is a perverse logic that the more people move to cloud, the more it makes sense to move to cloud yourself. Think about it. If you were an online retailer what good is having your website up and running and your local systems able to process orders if all your customers can't buy anything because the banking system is in the cloud and their cards don't work online?
Sure your competitors are fully down but if potential customers can't actually pay for anything then you can't take advantage of the situation to your benefit. At that point you are in the same boat as your competitors, except you are spending more money keeping your system running. You might as well save on capex and go cloud yourself.
At the end of the day, centralisation is always more resource efficient (i.e. cheaper) than decentralisation. It has lots of downsides, including single points of failure, loss of resiliency, monoculture, centralised control etc... but short term it always is easier, which is why lots of companies do it (most think very short term unfortunately).
Also don't forget the cloud companies are pushing for it as well. Some actually make their software "cloud only" so you don't have a choice, others make it very easy to onboard you to the cloud but hard to leave, or hard to even have a hybrid set up. Once you move some things to the cloud, the system is designed to "encourage" you to move more and more there. At the end of the day having a monthly recurring income stream is in their interest, so of course they will push for it.
(Score: 2) by RS3 on Sunday July 21, @06:29PM (4 children)
Awesome answer, thank you.
I admin a small hosting operation. There are a couple of "e-commerce" sites which link to some kind of online payment thing- I'm not sure- I don't do the code / sites themselves. Several times customers wanted to buy something but the payment cloud thing didn't work. But the site itself (our server) still did, and they were able to contact site's owner and pay some other way. All of your points are correct and spot-on regarding $, but I kind of like having more direct control of my own functionality. In my use case, the cost is extremely low for the owner. I don't have time (thankfully) to delve into (soapbox) my feelings about corporate IT, but I'll summarize by saying it tends to be super political, and very inefficient. I'm all about efficiency, and prove it in my work. The admin function is barely a job- not even part-part time, but there's nearly 100% uptime. :) ISP has been pretty good too.
(Score: 3, Insightful) by bmimatt on Sunday July 21, @08:20PM (3 children)
If you run a service/website at a meaningful scale, you can retain much of the control by going hybrid. You'd set up a copy of you hosting on some cloud and use it to expand footprint dynamically, basically on-demand, when traffic reaches certain level. In that kind of setup, you'd be basically using the cloud for 'spillover' traffic and also as a DR site. The biggest question with that setup, and perhaps a drawback, is that you need to load-balance the environments and would probably need need something like Cloudflare, which may or may not be a 'good' thing.
(Score: 2) by RS3 on Sunday July 21, @08:53PM (2 children)
Yes, really good thoughts. I'd run rsync and keep local as warm backup. I'd be doing that now, but there's no budget. As I alluded, it's not my primary job. I kind of wish it was, as I'm pretty good at it, esp. the efficiency part. Some will scoff / lambaste me, but one server is 20+ years old. Rock solid. Certainly not fast, but it. just. runs. :)
(Score: 2) by bmimatt on Monday July 22, @06:44PM (1 child)
Hahahha! In a previous life I was responsible for some FreeBSD boxes - about half of them had 8 years of uptime at some point (they were in-office at a client site, fully fire-walled off).
(Score: 2) by RS3 on Tuesday July 23, @01:15AM
I'm a bit embarrassed to admit I've never tried FreeBSD, or any other BSD, well, not that I'm aware of. Maybe it's running in a router, or my washing machine?
Servers are running an aging Linux CentOS 6, fully updated, somewhat augmented, very tweaked and tuned (I can't keep my fingers out of things- gotta have fast and efficient). I've had more than 365 days uptime. One is currently at 388 days. They run so well there's no reason to mess. I log in, check stuff, make any changes needed (mostly Apache sites), maybe run a diag / status utility on RAID controller, and otherwise live my life. Oh yeah, the one runs some WordPress blogs, so that gets updated every now and then. I don't allow any "automated" updating, nor any clients to update, and I doubt they'd know how or even care.
Uptime would be much longer but for a couple of rare power outages, generator didn't run, or ran out of propane, or UPS batteries died. I don't have physical access any more so building's owner has to deal with / check that stuff. Long complicated story / situation. Ridiculous situation really.
(Score: 2) by pdfernhout on Tuesday July 23, @03:28PM
... where the farming villagers are being shaken down by bandits, and they decide to hire (hungry) samurai to protect them. But they immediately face the next problem of the issue (resulting from their limited security knowledge) of "What makes a good samurai warrior?"
https://en.wikipedia.org/wiki/Seven_Samurai [wikipedia.org]
Arguably, it is the same with hiring IT staff. What makes a good IT person? If management can't answer that, then they may think it is safer to pick a cloud provider who presumably have good IT people on staff?
Of course, that may also be kicking the can down the road, in terms of understanding what make a good cloud provider?
(My supervisor at IBM Research circa 1999 suggested that movie analogy in relation to the challenge of hiring good contract programmers.)
The biggest challenge of the 21st century: the irony of technologies of abundance used by scarcity-minded people.
(Score: 3, Informative) by RamiK on Saturday July 20, @05:59PM
Regrettably anything CrowdStrike-related is likely to be owned by your boss rather than you.
compiling...
(Score: 4, Touché) by https on Saturday July 20, @07:37PM (8 children)
BZZZZT. In this case, MS isn't involved. It's purely Cloudstrike.
I am absolutely befuddled how anybody with money trusted George Kurtz after McAfee.
Offended and laughing about it.
(Score: 2) by krishnoid on Saturday July 20, @09:04PM
In this case it makes me ask if this is the kind of failure that coding in Rust would have minimized -- was it a segv or the like, or something else?
(Score: 2) by RS3 on Saturday July 20, @09:09PM (4 children)
Not speaking for OP, but my thought / question is: would you need Cloudstrike if you didn't run Microsoft's OSes?
(Score: 5, Informative) by linuxrocks123 on Saturday July 20, @09:38PM
You don't "need" this company's bullshit product on any operating system. However, CrowdStrike is available for Linux.
(Score: 3, Informative) by Anonymous Coward on Sunday July 21, @12:53AM (2 children)
Plenty of people/orgs run Windows and don't and didn't run CrowdStrike and had no such problems or problems that CrowdStrike claims to prevent.
Here are more examples of CrowdStrike quality:
https://access.redhat.com/solutions/7068083 [redhat.com]
https://www.mail-archive.com/debian-kernel@lists.debian.org/msg136186.html [mail-archive.com]
You'd need AV or similar too on Linux if you were catering for users who'd click through warnings etc (in a recent Windows "vulnerability" the exploit involves users clicking through at least one warning if not more). There are similar "vulnerabilities" on Desktop Linux if you count those that require users to click through warnings.
And you'd have the same difficulty recovering with Linux if you remove root access and enforce hardware drive encryption, TPM etc to make your system "secure" so that strangers and non-admins can't tamper with the stuff easily, and then your AV caused kernel panics that prevented a successful boot. Actually would Linux actually be easier to recover from in such a scenario (nonadmin with hardware drive encryption and hardware locked down)?
Of course if your system wasn't so "secure" "anyone" could just attach the drive to another computer and fix the problem. But hey Linux is more secure than Windows right?
(Score: 4, Informative) by RS3 on Sunday July 21, @03:13AM (1 child)
Good ideas and questions. I'm not expert on drive encryption. I don't use it as I'm much much more worried about not being able to access my own data, than I am about someone else accessing my data.
That said, I'd suspect there are some encryptions that rely on / are motherboard / TPM / something specific, such that if you pull the drive, you can not decrypt it. I would never use such a thing. I've had to replace defective motherboards.
It's kind of a security weakness in Linux, I suppose, at least for ext2-4, in that you can pull a drive and mount it as a secondary drive in another machine and access everything. Windows somewhat honors some security / ownership of files on a secondarily-mounted drive. You can usually reset those permissions, but sometimes Windows warns you that you may lose data but I forget why.
Again, not drive encryption expert but under Linux you have many filesystems available, some with sophisticated encryption and ownership / permissions, so YMMV.
Linux kernel is much more secure inherently, and its architecture engenders much more manageable and more easily configurable security in the surrounding OS administration.
Security is pretty multi-layered thing. There are many types of scenarios for administration. If you have multiple user logins on one machine, you have to be much more careful about locking permissions so user 1 doesn't mess with user 2's files. I'm too tired to write more, there are so many scenarios. Nothing is perfect, but a quite strong one is the "sandbox" concept where you'd have a hypervisor / host OS and guest OSes. Try to keep important data somewhere on a server and/or other backup locations, so if the user trashes the guest OS, no big deal, you just copy the saved standard image. For that matter you can re-image a hard drive (SSD) if the user trashes it. I haven't experienced it, but I've heard of companies where the IT people will re-image company computers- on a schedule and/or at random, so keep your data elsewhere.
(Score: 0) by Anonymous Coward on Monday July 22, @04:06AM
But if nonadmins being able to do such stuff so easily is a "compliance checkbox" issue[1] then good luck with that, might be even harder to recover from than Windows...
[1] Imagine if a selectable older kernel has a vulnerability (not that unlikely), so someone with nonadmin access can boot that kernel so that they can pwn the machine to get local admin etc.
(Score: 4, Insightful) by anubi on Saturday July 20, @11:02PM
Fact check: verified
https://www.businessinsider.com/crowdstrike-ceo-george-kurtz-tech-outage-microsoft-mcafee-2024-7?op=1 [businessinsider.com]
I actually have his book, "Hacking Exposed", 2nd edition.
That book is one of the main reasons I fought management over centralization. I saw this as an example of "having all your eggs in one basket" thing. I am quite taken back by how clever people can be to things selectively available.
I figured since that XP fiasco, Kurtz would be the Safest one to trust about testing before deploying as he had experienced first hand the fallout of poor judgement, where others have only read about it and ticked the correct boxes on college exams.
I believe what I have been shown is a variant of that episode of " Mayday: Airplane Disasters" where the pilot became so distracted over a malfunctioning landing gear indicator that he ignored the altimeter and met with terrain...a total loss of plane crew and passengers.
In this case, so concerned about keeping the hacker out that he ignored resiliency from coding errors. This should have resulted in an automatic rollback along with an error report. Not a BSOD!
With the "sophistication" of today's "advanced' operating system software, it should literally take a hardware failure like an internal CPU gate fail to trip one off. "Traps" exist for every conceivable irresolvable conflict and the OS should handle it gracefully.
What happened here is precisely why I am so reticent to design business software into production machines.
If a business system fails, it's a nice vacation with pay for all involved.
If a production system fails, it's apt to take a LOT of very expensive hardware out as well as a helluva mess of misproduced product.
There is a lot of difference between an airport full of people who won't get a flight, and a refinery full of explosive petrochemicals losing control.
I am sure that tradeoff is considered in Corporate Boardrooms as the Executives decide which paradigm best fits their needs.
A lot of us keep our companies out of this by knowing how the stuff works.
Monsanto ( hybrid seed corn ) is just the ticket for mega-farms. But it's very vulnerable to blight. And it won't reproduce. It's a mule. While the "prepper" heirloom seeds are of all sorts of varieties, much more resilient to monogenomic blight, and the corn produces viable seeds.
I live in a world where scientists are tasked with gain-of-function of diseases... Apparently to make a market for their treatment ( not a cure - a treatment - subscription model ) . I have no idea what countries are engaged in research to find ways to reciprocate for international "punishments". I get the idea the next conflagration will be settled with very destructive technologies targeting centralized systems.
Food, internet, transportation.
All this centralization is making things easier for the "enemy" as it concentrates our "vitals" into a "heart" that can be taken out with a precision targeted attack.
"Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]
(Score: 2) by captain normal on Monday July 22, @04:59AM
From the original article summary:
"The Register has found numerous accounts of Windows 10 PCs crashing, displaying the Blue Screen of Death, then being unable to reboot."
The Musk/Trump interview appears to have been hacked, but not a DDOS hack...more like A Distributed Denial of Reality.
(Score: 1, Informative) by Anonymous Coward on Sunday July 21, @12:42AM
And install more CrowdStrike stuff since that's not the real problem here? 🤣
https://access.redhat.com/solutions/7068083 [redhat.com]
https://www.mail-archive.com/debian-kernel@lists.debian.org/msg136186.html [mail-archive.com]
(Score: 5, Funny) by Anonymous Coward on Saturday July 20, @06:56PM
Remember, if you've got your big stomping boots on, kick your MicroSoft rep *16* times. That's one boot and 15 re-boots. That should get him working!
(Score: 4, Interesting) by sjames on Saturday July 20, @10:14PM (12 children)
The fundamental question is why did CloudStrike even need to do anything that could prevent the damned OS from booting? Why did the OS even allow it?
(Score: 4, Insightful) by RS3 on Sunday July 21, @12:58AM (2 children)
Purely speculating, but based on facts: to do their "security" thing they need to load a kernel-level (ring 0) driver: "csagent.sys".
You have to admit, all those 8.5M+ computers are quite safe now. :-}
But seriously, it appears to be a Windows architecture weakness that most drivers are integral to the kernel / boot. Although I run Linux, life has steered me away from some of the intricate details, like would Linux still boot and run even if you tried to load a buggered driver? I suppose I could try it but not in the mood.
Point is: a good OS should boot itself, needing only a few very critical drivers, then load driver modules as needed. I remember when driver modules began in the Linux kernel and I remember thinking that Linux was going to quickly dominate OSes. Well, it turns out the decision-makers don't seem to care about good solid correct OS kernel architecture design. Sigh.
(Score: 5, Informative) by sjames on Sunday July 21, @01:36AM
As a some time developer of kernel modules, by far the most common result of loading a bad module is that you get an OOPS and everything else keeps working.
Based on what I have seen, neither MacOS nor Linux will even allow what CrowdStrike is doing in the Windows kernel, nor should they.
Further, it looks like the "channel files" more or less act like a script interpreted by in-kernel code which also seems like a less than great idea. Apparently there aren't enough sanity checks to make sure it doesn't crash the kernel.
(Score: 3, Interesting) by owl on Sunday July 21, @02:07PM
The identical result can happen with a Linux kernel driver (because they too run at ring 0, identical to the rest of the kernel, and so poking at unmapped memory will cause a general protection fault and result in either a kernel oops, or a kernel panic.
And, reports are that CrowdStrike broke linux [neowin.net] a few months back (for those who's auditors demand it be installed), but that breakage didn't get the press notice of this one a few days ago.
(Score: 5, Insightful) by Tork on Sunday July 21, @01:00AM (8 children)
*I just know I'm going to accidentally call it "CloudStrike" and not catch it. My apologies in advance.
** I'm sorry... I did more venting than directly answering your question. Yesterday fucking sucked for me and it was way worse for my coworkers, whom I consider friends, in IT. We have concrete floors where I work and I'm still sore from it. I don't know how they do it. Without them we wouldn't have a business.
🏳️🌈 Proud Ally 🏳️🌈
(Score: 5, Funny) by aafcac on Sunday July 21, @01:37AM (2 children)
Don't worry, they're already working on porting this vulnerability to systemd. Perhaps while they're at it, they could arrange for it to brick the motherboard by screwing up the UEFI.
(Score: 2) by Unixnut on Sunday July 21, @02:37PM (1 child)
For those unaware, that feature was implemented in systemd back in 2014 from what I remember, since then you are advised to set /boot/efi read-only, or not mount it at all unless you need to make changes.
(Score: 2) by aafcac on Sunday July 21, @04:23PM
Which doesn't make it any less stupid. Not everything needs to be a readable filesystem. I get that Apple did that with their iPods back in the day, but for some things it just makes more sense to not directly interact with it. The fact that it can be set up in such a damaging way is just stupid, people are going to do it, or some genius up stream is going to accidentally change the setting or perhaps a cracker does it for the lulz.
(Score: 3, Interesting) by sjames on Sunday July 21, @02:21AM (2 children)
I saw my error a few minutes later, so no apology needed. Other acceptable thinkos are OnStrike and ClownStrike.
I felt really grateful yesterday that I only admin Linux boxes, but I do feel for the Windows admins. Due to excessive automation and IT being on a permanent austerity budget, they just don't have enough boots to put on the ground when something like this happens. At least Linux offers a functional serial console.
If this seemed bad, just wait for the inevitable Solarwinds like hack where somebody other than the company manages to push a malicious update.
(Score: 2) by JoeMerchant on Sunday July 21, @03:18PM (1 child)
We flew cross country yesterday on Southwest. Two hops, both flights left within 30 minutes of originally scheduled departure times, and both were 100% full no doubt due to the thousands of cancellations on other airlines.
I did see some BSODs in the airport, but just on an art installation, nothing functional seemed to be affected anymore.
🌻🌻 [google.com]
(Score: 2) by sjames on Sunday July 21, @08:22PM
According to the news, things are still not so good at Hartsfield. The departure and arrival screens are back to normal, but they still havbe hundreds of flight cancellations on Sunday and many more delayed. It's going to take a few days to clear the backlog.
For whatever reason, Southwest seems to have been less affected than other airlines. Perhaps they didn't use CrowdStrike on every PC.
(Score: 1) by anubi on Sunday July 21, @02:21AM
I am sure there was a good time for all. I sincerely hope the company has at least awarded you guys pizza on the house for at least a decade of years.
This is going to be especially interesting if all your disks were encrypted...and the keyserver is also on an encrypted disk!
"Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]
(Score: 2) by RS3 on Sunday July 21, @02:48AM
Sorry you had to do all of that. I do a bit of crawling, including under desks, cars, whatever, and almost always grab a pad of some sort. I have some foam pads, pieces of carpet, even cardboard is better than concrete (or worse, like gravel or coarse macadam (blacktop)).
Your story made me think of good old days of the "thin client", or maybe a more intelligent workstation but boot from server- no local hard disk / SSD. One boot image fix and you've fixed possibly hundreds of workers' work environment.
(Score: 2) by corey on Monday July 22, @11:37AM (2 children)
Been interesting reading the stories and anecdotes here. Sorry to those here who admin windows boxes who were affected by this. Not your fault.
I’m interested to see what happens next. I mean there is some serious lost business from this: millions, billions. I wonder where the first shot (lawsuit) will be fired from. Not to mention the personal stress on all the big and small businesses employees from this. I wonder if CrowdStrike will exist in a year’s time.
I worked as a Linux dev and PHP coder (LAMP) back when I was in my last years of uni. Enjoyed it but gee things have changed. Everything’s cloud, virtualised, weird (node.js), outsourced, etc. Glad I’m not doing that any more, it looks horrible now.
(Score: 2) by bart9h on Monday July 22, @06:11PM (1 child)
Yes, it is totally their fault for 1) using Windows and 2) trusting CrowdStrike.
(Score: 2) by corey on Monday July 22, @10:35PM
From what I’ve read and heard, at least those IT folk who have spoken about it, they are forced to use CS either by management or to tick a box. And besides, this was a CS problem (+MS) but the IT guys in their customers have to pick up the pieces. This situation is good though, a lot will learn from this and make change for the better (not use CS and/or Windows), and CS will not be deploying partially tested patches.