Stories
Slash Boxes
Comments

SoylentNews is people

Meta
posted by martyb on Friday May 21 2021, @12:25AM   Printer-friendly

As many of you noticed, we had a site crash today. From around 1300 until 2200 UTC (2021-05-20).

A HUGE thank you goes to mechanicjay who spent the whole time trying to get our ndb (cluster) working again. It's an uncommon configuration, which made recovery especially challenging... there's just not a lot of documentation about it on the web.

I reached out and got hold of The Mighty Buzzard on the phone. Then put him in touch with mechanicjay who got us back up and running using backups.

Unfortunately, we had to go way back until April 14 to get a working backup. (I don't know all the details, but it appears something went sideways on neon).

We're all wiped out right now. When we have rested and had a chance to discuss things, we'll post an update.

In the meantime, please join me in thanking mechanicjay and TMB for all they did to get us up and running again!

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Insightful) by Tork on Friday May 21 2021, @01:09AM (33 children)

    by Tork (3914) Subscriber Badge on Friday May 21 2021, @01:09AM (#1137439)
    Just wanted to second the show of gratitude to TMB. Thanks, man.
    --
    🏳️‍🌈 Proud Ally 🏳️‍🌈
    Starting Score:    1  point
    Moderation   +3  
       Insightful=2, Touché=1, Total=3
    Extra 'Insightful' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   5  
  • (Score: -1, Flamebait) by Anonymous Coward on Friday May 21 2021, @06:03AM (31 children)

    by Anonymous Coward on Friday May 21 2021, @06:03AM (#1137461)

    It was his crap setup that probably caused this in the first place. AND I predicted it would probably happen in advance the next time Linode did network maintenance. But instead of listening and learning to the guy who has run high availability clusters professionally, he got all high and mighty about how I was misinformed. I just hope that whomever is in charge this time actually reads the manual and doesn't make basic mistakes like he did.

    • (Score: 0) by Anonymous Coward on Friday May 21 2021, @06:10AM (20 children)

      by Anonymous Coward on Friday May 21 2021, @06:10AM (#1137464)

      *yawn*

      Have you ever volunteered? No? OK then, we can safely ignore your prescience that can't be proved. Like every other medium who predicts catastrophies, you're probably just another charlatan.

      • (Score: 0) by Anonymous Coward on Friday May 21 2021, @06:27AM (19 children)

        by Anonymous Coward on Friday May 21 2021, @06:27AM (#1137470)

        Except it was documented on this site and I'm not the first nor only one to notice the problem pattern with their setup. And why would I have volunteered to butt heads with someone who apparently hasn't read the docs and doesn't understand the apparent basics of HA clusters like this.

        • (Score: 3, Touché) by janrinok on Friday May 21 2021, @08:36AM (18 children)

          by janrinok (52) Subscriber Badge on Friday May 21 2021, @08:36AM (#1137488) Journal

          I'm not about to go searching through every AC comment to find where you claim it is "documented".

          And it appears that you have already reached the conclusion of what caused the problem and who is responsible. If it was a full hard drive then, I believe, somebody on the current team should have been responsible for checking for that possible occurrence and taking the appropriate action before we get to a critical stage. And in such a case that person would be responsible - not TMB who is no longer managing our hardware and therefore cannot be responsible for our day-to-day running. And when the site was first set up there were quite a few people responsible for structuring the hardware to meet the requirements of the code that we inherited. TMB was only one of those involved.

          True, we can restructure the entire hardware and software configuration but that takes qualified people and there are precious few of them volunteering to give up any of their time to help this site. Fortunately we had 2 people - mechanicjay and TMB - who have given willingly and freely of their personal time over the last day or so. If you are volunteering to help, then you must understand that we cannot simply accept the word of an AC - you will have to be a little bit more approachable than that. You will be welcome in that event but, if not, then you can continue to rant on here as an AC and we will continue to take everything you say with a healthy pinch of salt.

          The site will crash again - of course, I don't know when or why. Making such a claim does not qualify me to operate as a sysadmin on this site, however by making this forecast I have now got exactly the same provable credentials as yourself.

          • (Score: -1, Flamebait) by Anonymous Coward on Friday May 21 2021, @09:44AM (17 children)

            by Anonymous Coward on Friday May 21 2021, @09:44AM (#1137497)

            Keep making excuses. You don’t need to search for any “documentation “ - the site crashes on a regular basis, and everyone knows it.

            And as a parent poster pointed out, nobody is going to volunteer to fix it if it means continually arguing with Mr “Proud I Don’t Need An Education” Buzztard.

            But stick with the current “plan”. Where you’ve built up so much technical debt that recovery is impossible. Because there’s no fool like an old fool.

            • (Score: 4, Informative) by janrinok on Friday May 21 2021, @10:21AM (12 children)

              by janrinok (52) Subscriber Badge on Friday May 21 2021, @10:21AM (#1137502) Journal

              You've read something that I didn't say.

              We ARE looking at how to improve both the system configuration and the software that we use. We are not, however, simply throwing everything away to start from scratch again. The system can be simplified which should result in a more robust site. As mechanicjay has stated elsewhere, some of the software that is installed to provide resilience is actually causing more problems than it is intended to solve. We can get rid of that straight away. A content management system should be able to work with any chosen database, and that includes MySQL, so there is nothing that I am aware of to suggest that MySQL cannot fulfil the role we are asking of it. People may have their own personal preferences but changing the database will require changes to the perl code which will all need writing and testing.

              There is a problem with documentation but that is also linked to the lack of staff that we currently have, and that goes back at least 3 years. What few staff we have are currently kept busy keeping the site going and, although we are aware of areas where work needs to be done it can only be done by those who understand the system configuration. You can only write more documentation when you have people who understand what it is they are documenting. We need more sysadmins because system failures do not occur when the only active sysadmin is sat at his computer with nothing better to do.

              We need more programmers who are prepared to volunteer to help support the site. It doesn't matter which language the site is written in, we will still need programmers to do that work. We always need more editors - although even with just the handful that we currently have we a looking relatively well manned compared with every other part of the support team. QA is a one-man team - MartyB again, who actually fills several more roles in the team at the same time.

              You definitely have got a bug about TMB though - in case you missed it he is no longer part of the support team although he remains a member of our community and he kindly gave advice to mechanicjay during the last 24 hours. If you have had your nose put out of joint during your earlier discussions with him then that is a personal matter between you two.

              But stick with the current “plan”. Where you’ve built up so much technical debt that recovery is impossible. Because there’s no fool like an old fool.

              So you can see that, far from your claim, we are not sticking with the current plan. With the limited resources that we have we will make progress at as fast a rate as is possible. When the site first went active there were 20-30 active participants who were all contributing to keeping the site going. I reckon that we have less than 10 available today. Rather than sitting back and criticising as an AC, wouldn't you prefer to join the team and help fix some of the problems?

              • (Score: -1, Redundant) by Anonymous Coward on Friday May 21 2021, @11:05AM (11 children)

                by Anonymous Coward on Friday May 21 2021, @11:05AM (#1137513)
                Geeklog works just fine with MySQL and Marian, among others. Slash, in it’s current state, obviously is way too far gone to bother fixing. There comes a time in many software projects where you have to throw everything out and do it right, using the lessons learned.

                But keep telling yourself that slash can be fixed. You need proper devs, something you haven’t had in years, who wouldn’t put up with wishful thinking but will speak the hard truths borne of the confidence of experience.

                There are other CMS that will also do a decent job. But NONE OF THEM ARE WRITTEN IN PERL. So the language issue is entirely relevant. Because software has to be maintained. And nobody wants to use Perl for large projects any more. Not when there are better alternatives.

                If your code is so great why isn’t anyone else running it? Because it’s an in maintained pile of patches over patches.

                The month of corrupted backups is just a symptom, another red flag. But keep making excuses - reality will continue to bite, and bite increasingly harder. There are hard decisions to be made, and you either make them now or events will make them for you. The future waits for nobody.

                • (Score: 2) by janrinok on Friday May 21 2021, @11:37AM (6 children)

                  by janrinok (52) Subscriber Badge on Friday May 21 2021, @11:37AM (#1137520) Journal

                  Rehash was not the cause of the latest crash (as far as we can ascertain) - it is not where the focus is at present. That is not to say it will never be replaced but, for the time being, it is still working as expected. Currently, we have not got the resources to replace Rehash with a different language or package. If it ain't broke, don't fix it.

                  You are focussing on an area that is not causing us a problem at the moment. The system configuration is where we continue to encounter problems and that is where mechanicjay is currently concentrating his efforts.

                  • (Score: 0) by Anonymous Coward on Friday May 21 2021, @01:58PM (5 children)

                    by Anonymous Coward on Friday May 21 2021, @01:58PM (#1137534)

                    If it ain't broke, don't fix it.

                    More often than not, this actually means "the fix is too hard, so it cannot possibly be broken".

                    • (Score: 2) by janrinok on Friday May 21 2021, @02:08PM (4 children)

                      by janrinok (52) Subscriber Badge on Friday May 21 2021, @02:08PM (#1137536) Journal

                      Rehash is working today as advertised, the system configuration isn't - with very limited resources which one would you work on first?

                      • (Score: 0) by Anonymous Coward on Friday May 21 2021, @02:34PM (3 children)

                        by Anonymous Coward on Friday May 21 2021, @02:34PM (#1137544)
                        It broke … again. And nobody else is using the code anyway. Why not? Because it’s fragile, and too much Perl causes brain damage. Someone mentioned pipedot.org as an example - a site that has been inactive since April 2017.

                        Another post mentioned using a separate process to update story counts. Someone needs to learn to code better, and brushing up on sql as well. Just because the original devs didn’t know how to do it right is no excuse to preserve shit like that. This is 2021, not 1995.

                        TMB fücked up by not using a LIMIT clause in SQL that would have avoided time-outs under load. Experienced devs will ALWAYS seek ways that guarantee the most efficient use of resources because they don’t want intermittent bugs. Rehash is a total hash. Either learn to code or get something that other people are maintaining because it’s widely used, in a language that is widely used for web development. But you won’t. You will continue to ignore the red flags.

                        Why the resistance to a clean-sheet rethink of the site? Articles, user comments , and user journals are the only essentials. The polls suck, but most CMS packages contain poll functionality, so keep pills if you must. But do you really want to waste part of your life dealing with stupid complaints about unfair moderation? What a time sink! Dump it. It’s far from essential, and keeping it didn’t preserve slashdot’s ability to generate the slashdot effect.

                        If you think that user moderation is the killer feature that keeps people on the site, well, it ain’t working here, same as it didn’t on the green site. Is it SO hard to grab a copy of geeklog and skin it so it looks the way you want while still allowing the essentials - stories, comments, and journals? It’s a one-day job (with breaks).

                        What do you have to lose at this point?

                        • (Score: 3, Insightful) by janrinok on Friday May 21 2021, @04:39PM (1 child)

                          by janrinok (52) Subscriber Badge on Friday May 21 2021, @04:39PM (#1137584) Journal

                          We are, at this very moment, discussing options on a private channel. And currently ALL of our resources are currently working on recovering from yesterday, or keeping the site going today.

                          The only thing that is causing a problem (repeatedly) is one element of the system configuration that is not providing us with any benefit whatsoever - so that is what we are currently working on removing. The rest of the site is working just as we want it to. Let me explain it in an auto analogy - which is the traditional way of doing things around here. What you are suggesting is that we currently have a flat tire but you are recommending that we also paint the car, change the upholstery and fit a new engine too.

                          If it can be done in a day I will await your contribution by, shall we say, Sunday evening? Show me something working to convince me - not just make ridiculous suggestions that we haven't got the resources to complete anyway.

                          • (Score: 0) by Anonymous Coward on Friday May 21 2021, @10:46PM

                            by Anonymous Coward on Friday May 21 2021, @10:46PM (#1137643)

                            The only thing that is causing a problem (repeatedly) is one element of the system configuration that is not providing us with any benefit whatsoever - so that is what we are currently working on removing.

                            Perhaps that is for the best since you all apparently don't know how to use it properly. Quite a number of people use it under higher loads with better uptimes, after all.

                        • (Score: 0) by Anonymous Coward on Saturday May 22 2021, @05:03PM

                          by Anonymous Coward on Saturday May 22 2021, @05:03PM (#1137762)

                          ...and too much Perl causes brain damage.

                          Ah yes, but only when it comes to inferior brains

                • (Score: 5, Insightful) by martyb on Friday May 21 2021, @02:30PM (3 children)

                  by martyb (76) Subscriber Badge on Friday May 21 2021, @02:30PM (#1137543) Journal

                  One thing to keep in mind is the "heritage" of our code.

                  I was with /. before it even had userids! I've witnessed all kinds of attacks on the site. Page-widening trolls. Actual SPAM comments. Mod bombs. Whatever creative nerds could come up with, they threw it at /. and changes were made to mitigate them. It stood up under heavy fire.

                  Slashcode begat rehash which is the open-source, freely available code that powers this site. Our foundation is solid.

                  Also, there is MUCH MUCH more going on behind the scenes. I dare say the admin interface has AT LEAST as much going on as what is presented to the community. Quite possibly twice (or thrice) as much. Every once in a while I find yet-another setting or configuration that could be tweaked!

                  The foundation is solid.

                  Admittedly, the site would benefit from some tuning. When SoylentNews started, it was difficult to foresee what areas would grow fastest and what needed to be allocated. I mean, here is my first comment on the site: comment 255 [soylentnews.org], and here I am replying to comment number 1,137,513!

                  Remember, too, this site is run by volunteers in their spare time.

                  Sure, it would be wonderful to have paid, full-time staff monitoring the site 24/7/365 like on Reddit or the like. How much would that cost per year? 3 x 8-hour shifts per day x 365 days-per-year is 8760 hours. At $15.00 per hour (dirt cheap for these kinds of skills!) that works out to $131,400 per year! And that does not even include server hosting costs! More realistically, at just $30.00 per hour, that works out to $1,839,600 per year! And that does not even include hosing costs.

                  SoylentNews gets by on just $7,000 for an entire year!. And that includes the annual costs of being incorporated, filing taxes,hosting expense, everything!

                  .

                  --
                  Wit is intellect, dancing.
                  • (Score: 4, Informative) by martyb on Friday May 21 2021, @02:38PM

                    by martyb (76) Subscriber Badge on Friday May 21 2021, @02:38PM (#1137546) Journal

                    Oops! His submit instead of preview.

                    s/hosing/hosting/

                    s/131,400/919,800/

                    There's prolly some more; it was a LONG day yesterday!

                    --
                    Wit is intellect, dancing.
                  • (Score: 0) by Anonymous Coward on Friday May 21 2021, @03:07PM (1 child)

                    by Anonymous Coward on Friday May 21 2021, @03:07PM (#1137552)
                    Marty, I like you, but seriously, there are SO many flaws in your post.

                    Yes, it’s expensive keeping a full-time dev on the payroll. That’s why hobby sites like soylent don’t do that - they use widely used open source CMS packages that have proper documentation, a developer community, and use a broadly used language combo - the most popular being written in php and any MySQL or PostgreSQL variant.

                    Not slash. Not rehash. Perl lost the race a long time ago.

                    What do users want? Articles, the ability to post comments, and journals. Plenty of CMS packages using php and a database server can do that without the legacy of Perl.

                    Grab a copy of geeklog and play around with it. You should be able to have a functional site with stories, comments , and journals. And of c, the administrative backend contains all the functionality you want to hide from users.

                    About Geeklog

                    Geeklog is an open source application for managing dynamic web content. It is written in PHP and supports MySQL or PostgreSQL as the database backend.

                    "Out of the box", Geeklog is a CMS, or a blog engine with support for comments, trackbacks, multiple syndication formats, spam protection, and all the other vital features of such a system.

                    The core Geeklog distribution can easily be extended by the many community developed plugins and other add-ons to radically alter its functionality. Available plugins include forums, image galleries, and many more.

                    This is what you use when you can’t afford to keep a team of developers on staff. Php has a wide user base, so you might actually attract developers, because nobody wants to screw around with Perl. The whole “TMTOWTDI” is a bug, not a feature.

                    You might even want to give the site a new, fresher look.

                    Seriously, give it a try. Take a shitbox computer, install Linux or FreeBSD on it, and give geeklog a try. It worked for groklaw under traffic you can only dream of. Don’t be fooled by groklaw’s blah appearance. If you know HTML and CSS, and have any graphics talent , you can make it look clean and modern and spiffy as all. Icons for stories? Screw that - real images or graphics that the text wraps around. (You can still keep the topic icons if you must, but they’re really dated).

                    As for the whole editorial process, you’d best run a private copy for the editors to edit submissions before someone posts them to the main site. I get that the subs queue is there so people can check before submitting a story, but multiple submissions are a good thing if they contain more information. You’ll probably end up dropping ICQ if editors can see what their proposed stories and edits and included graphics look like, and other editors can cut n paste and change and tweak it, and see the changes right there in the thread in their comments.

                    • (Score: 1, Insightful) by Anonymous Coward on Saturday May 22 2021, @11:04AM

                      by Anonymous Coward on Saturday May 22 2021, @11:04AM (#1137719)

                      Those php sites get pwned every now and then too.

                      Geeklog's security track record is crap and the types of vulnerabilities are not confidence inspiring: https://www.google.com/search?q=%22Geeklog%22+exploit [google.com]

                      Go get a clue. If you're not going to spend much time and money on a site you don't pick shit that needs to be patched every month.

            • (Score: 1) by khallow on Saturday May 22 2021, @11:35PM (3 children)

              by khallow (3766) Subscriber Badge on Saturday May 22 2021, @11:35PM (#1137836) Journal

              And as a parent poster pointed out, nobody is going to volunteer to fix it if it means continually arguing with Mr “Proud I Don’t Need An Education” Buzztard.

              Even when TMB was in house, nobody was continually arguing with Mr. "Proud". I wonder how much else of your narrative is just as imaginary?

              • (Score: 0) by Anonymous Coward on Sunday May 23 2021, @06:09AM (2 children)

                by Anonymous Coward on Sunday May 23 2021, @06:09AM (#1137905)

                And then you wonder why a system that has a five nines guarantee in a 2/2/2 setup doesn't even have two.

                • (Score: 1) by khallow on Sunday May 23 2021, @01:14PM (1 child)

                  by khallow (3766) Subscriber Badge on Sunday May 23 2021, @01:14PM (#1137942) Journal

                  And then you wonder why a system that has a five nines guarantee in a 2/2/2 setup doesn't even have two.

                  I already know of real world systems - the Space Shuttle, that failed that hard. There's no wondering over here.

                  • (Score: 0) by Anonymous Coward on Monday May 24 2021, @12:37AM

                    by Anonymous Coward on Monday May 24 2021, @12:37AM (#1138082)

                    Because operating on the edge of science and technology at the extremes of risk with single points of failure meeting Swiss Cheese model of reality is directly analogous to running an incorrectly deployed bog-standard cluster deployment that is failing to meet its uptime guarantees despite hundreds of thousands of deployments operating successfully in worse conditions when they do deploy it correctly.

                    Right.

    • (Score: 5, Insightful) by Tork on Friday May 21 2021, @06:15AM (9 children)

      by Tork (3914) Subscriber Badge on Friday May 21 2021, @06:15AM (#1137465)

      Whether he is blameless or not the fact is he quit. I've butted heads with him before, so it's not like I'm being charitable when I say: I can't think of a reason to fault him for lending a hand.

      --
      🏳️‍🌈 Proud Ally 🏳️‍🌈
      • (Score: -1, Troll) by Anonymous Coward on Friday May 21 2021, @06:20AM (8 children)

        by Anonymous Coward on Friday May 21 2021, @06:20AM (#1137468)

        I'm not faulting him for lending a hand, it's everything else that got this site to this point.

        • (Score: 0, Disagree) by Anonymous Coward on Friday May 21 2021, @11:28AM (7 children)

          by Anonymous Coward on Friday May 21 2021, @11:28AM (#1137519)
          It’s Stockholm Syndrome. Or Battered Spouse Syndrome. TMB shit on everything for so long that people were grateful when he stopped/quit. He quit, and donations that had been stalled since the beginning of the month started again, as people voted their approval of his leaving with their money.

          Money talks.

          • (Score: 0) by Anonymous Coward on Friday May 21 2021, @02:17PM

            by Anonymous Coward on Friday May 21 2021, @02:17PM (#1137540)

            What utter nonsense. This site is run by volunteers and the staff (including TMB) have been excellent.

          • (Score: 2) by janrinok on Friday May 21 2021, @02:21PM (4 children)

            by janrinok (52) Subscriber Badge on Friday May 21 2021, @02:21PM (#1137542) Journal

            MartyB is responsible for recording donations, and he has been rather busy for the last year or two - not TMB. Anything you else you want to accuse TMB of while you are at it?

            • (Score: 1, Interesting) by Anonymous Coward on Friday May 21 2021, @03:19PM (3 children)

              by Anonymous Coward on Friday May 21 2021, @03:19PM (#1137558)
              Whoosh! TMB quit, and donations-which had been stalled, started again. Talk about missing the point.

              Plenty of people saw buzz as a large part of the problem. After a week of “playing nice” he packed it in.

              If you had even ONE serious developer on tap you could have a different CMS up and running in a day, stories, comments, user journals, stupid polls, etc. ONE DAY. Take a few weeks to get user feedback, tweak things a bit, etc.

              But NOBODY with a clue wants to maintain a mess of obsolete Perl that has no documentation. It’s one of those cases where you have to close your eyes and ignore the sink costs of the time and emotions tied up in rehash.

              Software development is brutal. You can’t be sentimental over old code. You need to “knife the baby” on a regular basis to progress. Otherwise we’d still be stuck with GWBASIC.

              • (Score: 3, Informative) by janrinok on Friday May 21 2021, @04:49PM

                by janrinok (52) Subscriber Badge on Friday May 21 2021, @04:49PM (#1137587) Journal

                The Perl code is working as expected. This is a system configuration problem. Stop trying to fix the wrong problem.

              • (Score: 5, Informative) by DECbot on Saturday May 22 2021, @03:06AM

                by DECbot (832) on Saturday May 22 2021, @03:06AM (#1137680) Journal

                To add to what Janrinok has said, slashcode regularly handled a traffic load that would take offline mainstream static sites and ddos ISPs with less resources then what people throw at a regular Wordpress or Drupal sites today. The perl code is proven, mysql is proven, but there is a configuration issue causing stability issues with the SN implementation of the slashcode. That is what needs to be addressed. It's not time to replace the horse cause it threw an ill fitting shoe.

                --
                cats~$ sudo chown -R us /home/base
              • (Score: 2) by bzipitidoo on Sunday May 23 2021, @03:08AM

                by bzipitidoo (4388) on Sunday May 23 2021, @03:08AM (#1137879) Journal

                One day, for one person to do a migration to another CMS? You're dreaming. I am not familiar with CMSes, but if it's anything like a database migration, you're talking weeks at the least. Yeah, one expert could migrate a little toy of a database in one day, but a real one with millions of rows and complexities such as stored procedures and non-standard SQL, no way.

          • (Score: 2) by Tork on Friday May 21 2021, @04:03PM

            by Tork (3914) Subscriber Badge on Friday May 21 2021, @04:03PM (#1137574)

            TMB shit on everything for so long that people were grateful when he stopped/quit.

            He 'shit' on everything by babbling just like I do. If I couldn't take his silliness I certainly wouldn't be able to deal with the AC-holes around here. (I'll leave it ambiguous for now whether or not I'm including you in that.)

            I'm not 'in love with my abuser', my panties are simply wrinkle-resistant.

            --
            🏳️‍🌈 Proud Ally 🏳️‍🌈
  • (Score: 2) by Reziac on Saturday May 22 2021, @02:29AM

    by Reziac (2489) on Saturday May 22 2021, @02:29AM (#1137665) Homepage

    Third. And to whoever else stayed up all night getting the site back up. Made my day.

    --
    And there is no Alkibiades to come back and save us from ourselves.