Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Monday August 20 2018, @11:45AM   Printer-friendly
from the teamwork++ dept.

We strive for openness about site operations here at SoylentNews. This story continues in that tradition.

tl;dr: We believe all services are now functioning properly and all issues have been attended to.

Problem Symptoms: I learned at 1212 UTC on Sunday 2018-08-19, that some pages on the site were returning 50x error codes. Sometimes, choosing 'back' in the browser and trying to resubmit the page would work. Oftentimes, it did not. We also started receiving reports of problems with our RSS and Atom feeds.

Read on past the break if you are interested in the steps taken to isolate and correct the problems.

Problem Isolation: As many of you may be aware, TheMightyBuzzard is away on vacation. I logged onto our IRC (Internet Relay Chat) Sunday morning (at 1212 UTC) when I saw chromas had posted (at 0224 UTC) there had been reports of problems with the RSS and Atom feeds we publish. I also noticed that one of our bots, Bender, was double-posting notifications of stories appearing on the site.

While I was investigating Bender's loquaciousness, chromas popped in to IRC (at 1252 UTC) and informed me that he was getting 502 and 503 error codes when he tried to load index.rss using a variety of browsers. I tried and found no issues when using Pale Moon. We tried a variety of wget requests from different servers. To our surprise we received incomplete replies which then caused multiple retries even when trying to access it from one of our SoylentNews servers. So, we surmised, it was probably not a communications issue.

At 1340 UTC, SemperOss (Our newest sysadmin staff member... Hi!) joined IRC and reported that he, too, was getting retry errors. Unfortunately, his account setup has not been completed leaving him with access to only one server (boron). Fortunately for us, he has a solid background in sysops. We combined his knowledge and experience with my access privileges and commenced to isolate the problem.

(Aside: If you have ever tried to isolate and debug a problem remotely, you know how frustrating it can be. SemperOss had to relay commands to me through IRC. I would pose questions until I was certain of the correct command syntax and intention. Next, I would issue the command and report back the results; again in IRC. On several occasions, chromas piped up with critical observations and suggestions — plus some much-needed humorous commentary! It could have been an exercise in frustration with worn patience and frazzled nerves. In reality, there was only professionalism as we pursued various possibilities and examined outcomes.)

From the fact we were receiving 50x errors, SemperOss surmised we were probably having a problem with nginx. We looked at the logs on sodium (which runs Ubuntu), one of our two load balancers, but nothing seemed out of the ordinary. Well, let's try the other load balancer, on magnesium (running Gentoo). Different directory structure, it seems, but we tracked down the log files and discovered that access.log had grown to over 8GB... and thus depleted all free space on /dev/root, the main file system of the machine.

That's not a good thing, but at least we finally knew what the problem was!

Problem Resolution: So, we renamed the original access.log file and created a new one for nginx to write to. Next up came a search for a box with sufficient space that we could copy the file to. SemperOss reported more than enough space free on boron. We had a few hiccups with ACLs and rsync, so moved the file to /tmp and tried rsync again, which resulted in the same ACL error messages. Grrrr. SemperOss suggested I try to pull the file over to /tmp on boron using scp. THAT worked! A few minutes later and the copy was completed. Yay!

But, we still had the original, over-sized log file to deal with. No problemo. I ssh'd back over to magnesium and did an rm of the copy of the access.log and... we were still at 100% usage. Doh! Needed to bounce nginx so it would release its hold on the file's inode so it could actually be cleaned up. Easy peasy; /etc/init.d/nginx restart and... voila! We were now back down to 67% in use.

Finally! Success! We're done, right?
Umm, no.

Did you see what we missed? The backup copy of access.log was now sitting on boron on /tmp which means the next system restart would wipe it. So, a simple mv from /tmp to my ~/tmp and now the file was in a safe place.

By 1630 UTC, we had performed some checks with loads of various RSS and atom feeds and all seemed well. Were unable to reproduce 50x errors, either.

And we're still not done.

Why/how did the log file get so large in the first place? There was no log rotation in place for it on magnesium. That log file had entries going back to 2017-06-20. At the moment, we have more than sufficient space to allow us to wait until TMB returns from vacation. (We checked free disk space on all of our servers.) The plan is we will look over all log files and ensure rotation is in place so as to avoid a recurrence of this issue.

Problem Summary: We had a problem with an oversized logfile taking up all free space on one of our servers but believe we have fixed it and that all services are now functioning properly and all issues have been attended to.

Conclusion: Please join me in thanking chromas and SemperOss for all the time they gave up on a Sunday to isolate the problem and come up with a solution. Special mention to Fnord666 who we later learned silently lurked, but was willing to jump in had he sensed we needed any help. Thank-you for having our backs! Further, please join me in publicly welcoming SemperOss to the team and wishing him well on his efforts here!

Lastly, this is an all-volunteer, non-commercial site — nobody is paid anything for their efforts in support of the site. We are, therefore, entirely dependent on the community for financial support. Please take a moment and consider subscribing to SoylentNews, either with a new subscription, by extending an existing subscription, or making a gift subscription to someone else on the site. Any amount entered in the payment amount field, above and beyond the minimum amount is especially appreciated!


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by NewNic on Monday August 20 2018, @08:40PM

    by NewNic (6420) on Monday August 20 2018, @08:40PM (#723905) Journal

    Add "logrotate" to your USE variable, then re-emerge all affected packages.

    emerge --newuse @world.

    --
    lib·er·tar·i·an·ism ˌlibərˈterēənizəm/ noun: Magical thinking that useful idiots mistake for serious political theory
    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2