Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Thursday May 17 2018, @06:10PM   Printer-friendly
from the draining-the-queue dept.

[Update: I do not have access to the mail queue, but the server dashboard shows that, as of 23:00 UTC, beryllium has returned to normal disk and cpu levels. That said, I see a gap in daily story headlines and daily story e-mails that were sent to me. We are continuing to monitor the situation. Please let us know if you have any outstanding issues. --martyb]

We have been open with the community since the outset, and in keeping with that practice: we just fixed an issue with the site.

On or about May 9th, our mail server, beryllium, stopped sending out e-mails. The cause was the antivirus handler failing to be loaded, so all outgoing mail that would be processed by that handler ended up waiting indefinitely.

Many thanks to mechanicjay for debugging and fixing the issue!

Impact: If you signed up for emails from this site (such as notification of comment replies or moderation, subscription being low or expired, etc.) these have been delayed. It may take some time for the queue to be processed and for all pending e-mails to be sent out.

I well remember when SoylentNews launched and each day brought a seemingly endless supply of crashes and failures. It is a tribute to our volunteer staff that site issues now happen so rarely!


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 3, Interesting) by datapharmer on Thursday May 17 2018, @07:22PM (3 children)

    by datapharmer (2702) on Thursday May 17 2018, @07:22PM (#680847)

    It would be interesting to know what actions are being taken (if any) to prevent a regression or similar issue from happening in the future. Is a job sending an outbound email periodically to a monitoring server that can alert if it doesn't get messages delivered at a set interval? Is a monitor being set on the message queue to alert if the queue size exceeds a normal level? Is a cron job running every hour or two to check that the required services are launched and responsive in the case they are no longer set to run on boot or in case they crash? Is the work to setup automated monitors and mitigations not worth the headache of setting them up given available sysadmin resources?

    Starting Score:    1  point
    Moderation   +1  
       Interesting=1, Total=1
    Extra 'Interesting' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   3  
  • (Score: 2) by martyb on Thursday May 17 2018, @07:57PM (2 children)

    by martyb (76) Subscriber Badge on Thursday May 17 2018, @07:57PM (#680862) Journal

    It would be interesting to know what actions are being taken (if any) to prevent a regression or similar issue from happening in the future. Is a job sending an outbound email periodically to a monitoring server that can alert if it doesn't get messages delivered at a set interval? Is a monitor being set on the message queue to alert if the queue size exceeds a normal level? Is a cron job running every hour or two to check that the required services are launched and responsive in the case they are no longer set to run on boot or in case they crash?

    Excellent questions!

    Way back in the early days of the site, when outages were a regular occurrence, someone had set up Icinga to monitor a whole slew of things. Problem was, it chewed up resources both locally (aggregating data) and remotely (monitoring all the various services). When the person who set that up left, it fell into disuse. After some server consolidations to reduce site costs, many of the monitors needed to be retargeted. Further, it became a resource that itself needed to be monitored to see if it was running correctly, too. As a result, it was removed from operation.

    Is the work to setup automated monitors and mitigations not worth the headache of setting them up given available sysadmin resources?

    Short answer: Ding! Ding! Ding!

    Longer answer: We now enjoy uptimes in the range of hundreds of days. Basically, only Linode-required reboots interfere with that. (Xen -> KVM, Free VM memory/storage upgrade, data center migration, Meltdown/Spectre mitigations.) So, it's just a lot easier to deal with issues as they arise, than to plan for any and all potential issues that might arise.

    Post Mortem: It seems this issue arose just after a Linode-required reboot of beryllium (see: Soylentnews Server Reboot Schedule May 5-10 [Update 5] [soylentnews.org]). Knowing mechanicjay, he's probably already mades updates to ensure an autostart of these services on reboot.

    It just so happened that TMB has been away for a few days for some IRL stuff. Of course, I only noticed we had issues the day after he left.

    --
    Wit is intellect, dancing.