We strive for openness about site operations here at SoylentNews. This story continues in that tradition.
tl;dr: We believe all services are now functioning properly and all issues have been attended to.
Read on past the break if you are interested in the steps taken to isolate and correct the problems.
===== EXTENDED COPY =====
Problem Summary: I learned at 1312 UTC on Sunday 2018-08-19, that some pages on the site were returning 50x error codes. Sometimes, choosing 'back' in the browser and trying to resubmit the page would work. Oftentimes, it did not. We also started receiving reports of problems with our RSS [soylentnews.org] and Atom [soylentnews.org] feeds.
Problem Isolation: As many of you may be aware, TheMightyBuzzard is away on vacation [soylentnews.org]. I logged onto our IRC [soylentnews.org] (Internet Relay Chat) Sunday morning (at 1313 UTC) when I saw chromas had posted (at 0324 UTC) there had been reports of problems with the RSS and Atom feeds we publish. I also noticed that one of our bots, Bender, was double-posting notifications of stories appearing on the site.
While I was investigated Bender's loquaciousness, Chromas popped in to IRC (at 1352 UTC) and directly related to me that he was getting 502 and 503 error codes when he tried to load index.rss using a variety of browsers. I tried and found no issues when using Pale Moon. We tried a variety of wget [gnu.org] requests from different servers and were able to reproduce issues with receiving incomplete replies which then caused multiple retries even when trying to access it from one of our SoylentNews servers. So, we surmised, it was probably not a communications issue.
At 1440 UTC, SemperOSS (Our newest sysadmin staff member!) joined IRC and reported that he, too, was getting retry errors. Unfortunately, his account setup has not been completed. Fortunately for us, he has a solid background in sysops. We combined his knowledge with my access privileges and commenced to isolate the problem.
From the fact we were receiving 50x errors, SemperOSS surmised we were probably having a problem with nginx. We looked at the logs on sodium (which runs Ubuntu), one of our two load balancers, but nothing seemed out of the ordinary. Well, let's try the other load balancer, on magnesium (running Gentoo). Different directory structure, it seems, but we tracked down the log files and discovered that access.log had grown to over 8MB... and thus depleted all free space on /dev/root.
That's not a good thing, but at least we know what the problem is!
Problem Resolution: So, we renamed the original access.log file and created a new one for nginx to write to. Next up came a search for a box with sufficient space that we could copy the file to. SemperOSS reported more than enough space free on boron. We had a few hiccups with ACLs and rsync, so moved the file to /tmp and tried rsync again, which resulted in the same ACL error messages. Grrrr. SemperOSS suggested I try to pull the file over to /tmp on boron using scp. THAT worked! A few minutes later and the copy was completed. Yay!
But, we still had the original, over-sized log file to deal with. No problemo. I ssh'd back over to magnesium and did an rm of the copy of the access.log and... we were still at 100% usage. Doh! Needed to bounce nginx so it would release its hold on the file's inode so it could actually be cleaned up. Easy peasy; /etc/init.d/nginx restart and... voila! We were now back down to 67% in use.
Finally! Success! We're done, right?
Did you see what we missed? The backup copy of access.log was now sitting on boron on /tmp which means the next system restart would wipe it. So, a simple mv from /tmp to my ~/tmp and now the file was in a safe place.
By 1730 UTC, we had performed some checks with loads of various RSS and atom feeds and all seemed well. Were unable to reproduce 50x errors, either.
And we're still not done.
Why/how did the log file get so large in the first place? There was no log rotation in place for it on magnesium. That log file had entries going back to 2017-06-20. At the moment, we have more than sufficient space to allow us to wait until TMB returns from vacation. The plan is we will look over all log files and ensure rotation is in place so as to avoid a recurrence of this issue.
Please join me in thanking chromas and SemperOSS for all the time they gave up on a Sunday to isolate the problem and come up with a solution. Special mention to fnord666 who we later learned followed along, silently, but was willing to jump in had he sensed we needed any help. Thanks for having our backs! Further, please join me in publicly welcoming SemperOSS to the team and wishing him well on his efforts here!