Stories
Slash Boxes
Comments

SoylentNews is people

Meta
posted by martyb on Monday August 20 2018, @11:45AM   Printer-friendly
from the teamwork++ dept.

We strive for openness about site operations here at SoylentNews. This story continues in that tradition.

tl;dr: We believe all services are now functioning properly and all issues have been attended to.

Problem Symptoms: I learned at 1212 UTC on Sunday 2018-08-19, that some pages on the site were returning 50x error codes. Sometimes, choosing 'back' in the browser and trying to resubmit the page would work. Oftentimes, it did not. We also started receiving reports of problems with our RSS and Atom feeds.

Read on past the break if you are interested in the steps taken to isolate and correct the problems.

Problem Isolation: As many of you may be aware, TheMightyBuzzard is away on vacation. I logged onto our IRC (Internet Relay Chat) Sunday morning (at 1212 UTC) when I saw chromas had posted (at 0224 UTC) there had been reports of problems with the RSS and Atom feeds we publish. I also noticed that one of our bots, Bender, was double-posting notifications of stories appearing on the site.

While I was investigating Bender's loquaciousness, chromas popped in to IRC (at 1252 UTC) and informed me that he was getting 502 and 503 error codes when he tried to load index.rss using a variety of browsers. I tried and found no issues when using Pale Moon. We tried a variety of wget requests from different servers. To our surprise we received incomplete replies which then caused multiple retries even when trying to access it from one of our SoylentNews servers. So, we surmised, it was probably not a communications issue.

At 1340 UTC, SemperOss (Our newest sysadmin staff member... Hi!) joined IRC and reported that he, too, was getting retry errors. Unfortunately, his account setup has not been completed leaving him with access to only one server (boron). Fortunately for us, he has a solid background in sysops. We combined his knowledge and experience with my access privileges and commenced to isolate the problem.

(Aside: If you have ever tried to isolate and debug a problem remotely, you know how frustrating it can be. SemperOss had to relay commands to me through IRC. I would pose questions until I was certain of the correct command syntax and intention. Next, I would issue the command and report back the results; again in IRC. On several occasions, chromas piped up with critical observations and suggestions — plus some much-needed humorous commentary! It could have been an exercise in frustration with worn patience and frazzled nerves. In reality, there was only professionalism as we pursued various possibilities and examined outcomes.)

From the fact we were receiving 50x errors, SemperOss surmised we were probably having a problem with nginx. We looked at the logs on sodium (which runs Ubuntu), one of our two load balancers, but nothing seemed out of the ordinary. Well, let's try the other load balancer, on magnesium (running Gentoo). Different directory structure, it seems, but we tracked down the log files and discovered that access.log had grown to over 8GB... and thus depleted all free space on /dev/root, the main file system of the machine.

That's not a good thing, but at least we finally knew what the problem was!

Problem Resolution: So, we renamed the original access.log file and created a new one for nginx to write to. Next up came a search for a box with sufficient space that we could copy the file to. SemperOss reported more than enough space free on boron. We had a few hiccups with ACLs and rsync, so moved the file to /tmp and tried rsync again, which resulted in the same ACL error messages. Grrrr. SemperOss suggested I try to pull the file over to /tmp on boron using scp. THAT worked! A few minutes later and the copy was completed. Yay!

But, we still had the original, over-sized log file to deal with. No problemo. I ssh'd back over to magnesium and did an rm of the copy of the access.log and... we were still at 100% usage. Doh! Needed to bounce nginx so it would release its hold on the file's inode so it could actually be cleaned up. Easy peasy; /etc/init.d/nginx restart and... voila! We were now back down to 67% in use.

Finally! Success! We're done, right?
Umm, no.

Did you see what we missed? The backup copy of access.log was now sitting on boron on /tmp which means the next system restart would wipe it. So, a simple mv from /tmp to my ~/tmp and now the file was in a safe place.

By 1630 UTC, we had performed some checks with loads of various RSS and atom feeds and all seemed well. Were unable to reproduce 50x errors, either.

And we're still not done.

Why/how did the log file get so large in the first place? There was no log rotation in place for it on magnesium. That log file had entries going back to 2017-06-20. At the moment, we have more than sufficient space to allow us to wait until TMB returns from vacation. (We checked free disk space on all of our servers.) The plan is we will look over all log files and ensure rotation is in place so as to avoid a recurrence of this issue.

Problem Summary: We had a problem with an oversized logfile taking up all free space on one of our servers but believe we have fixed it and that all services are now functioning properly and all issues have been attended to.

Conclusion: Please join me in thanking chromas and SemperOss for all the time they gave up on a Sunday to isolate the problem and come up with a solution. Special mention to Fnord666 who we later learned silently lurked, but was willing to jump in had he sensed we needed any help. Thank-you for having our backs! Further, please join me in publicly welcoming SemperOss to the team and wishing him well on his efforts here!

Lastly, this is an all-volunteer, non-commercial site — nobody is paid anything for their efforts in support of the site. We are, therefore, entirely dependent on the community for financial support. Please take a moment and consider subscribing to SoylentNews, either with a new subscription, by extending an existing subscription, or making a gift subscription to someone else on the site. Any amount entered in the payment amount field, above and beyond the minimum amount is especially appreciated!


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 4, Insightful) by Slartibartfast on Monday August 20 2018, @11:58AM (21 children)

    by Slartibartfast (5104) on Monday August 20 2018, @11:58AM (#723727)

    I *always* put /var/log on a different partition on production systems.

    $.02.

    • (Score: 2) by isostatic on Monday August 20 2018, @12:25PM (13 children)

      by isostatic (365) on Monday August 20 2018, @12:25PM (#723731) Journal

      I always have monitoring flagging up when the drive is filling up

      • (Score: 2) by martyb on Monday August 20 2018, @01:12PM (9 children)

        by martyb (76) Subscriber Badge on Monday August 20 2018, @01:12PM (#723745) Journal

        I always have monitoring flagging up when the drive is filling up

        If you don't mind my asking, what do you use for monitoring?

        WAY back when, we had Icinga(sp?) in place, but the dev who installed and maintained it departed and it was finally disabled. From my perspective, it seemed pretty slick, but also a bit fiddly. Ideally, we need something that is pretty much set-and-forget. Would be nice to have a 'dashboard' where one could see the state-of-the-world at a glance. Possibly with indicators of: nominal, marginal, and critical.

        --
        Wit is intellect, dancing.
        • (Score: 2) by isostatic on Monday August 20 2018, @01:37PM (3 children)

          by isostatic (365) on Monday August 20 2018, @01:37PM (#723749) Journal

          Used nagios for donkeys years, pulls in disk details via SNMP, critical when a partition hits 80% -- the standard pressed adds those requirements, including a randomised snmp string, and a deb which calls to the central monitoring server and registers the new machine.

          Also have munin-node installed on each machine, which pulls and tracks tons of historical data, again installed by default, although I haven't added the automatic configuration yet. All the cool kids use ansible nowadays.

          • (Score: 2) by RS3 on Monday August 20 2018, @02:45PM (1 child)

            by RS3 (6367) on Monday August 20 2018, @02:45PM (#723776)

            Too lazy / not enough time to research, but from what I remember:

            - ansible is a bit of work to install / configure, but better for big installs (many servers).

            - Nobody talks about webmin- I found it almost self-installs, finds things.

            - chef and puppet seem to be hot.

            - I know I tried puppet but I don't remember my conclusions...

            - same for nagios...

            • (Score: 4, Interesting) by isostatic on Monday August 20 2018, @04:50PM

              by isostatic (365) on Monday August 20 2018, @04:50PM (#723820) Journal

              I use clusterssh for many servers - I just installed a new munin plugin (via apt from a local repo) that I'd written onto 39 machines across 4 continents. 38 worked fine, but one of them. Took about 30 seconds, and each one has an entry in the auth.log file showing me logging in, then me running sudo apt-get update, then sudo apt-get install my-plugin

              If someone has a problem with the server next week they will know exactly what was run, and where, rather than "sudo /tmp/ansible.asdfaiib9", with that /tmp file no longer existing. Far easier to debug. Additionally in my experience with ansible and apt is that things do go wrong, and it's a right pain to work out why.

              But I wouldn't want to use clusterssh to manage 1300 machines -- indeed I didn't in my old job, but then change control procedures are so onerous nothing actually gets done in bulk so it makes no difference.

          • (Score: 2) by boltronics on Tuesday August 21 2018, @08:43AM

            by boltronics (580) on Tuesday August 21 2018, @08:43AM (#724082) Homepage Journal

            I have a similar approach. I use icinga2 (having the node installed in the image, not using SNMP) and munin-node too. I also make good use of rsyslog and RainerScript to manage logging centrally. Everything is managed with Salt.

            --
            It's GNU/Linux dammit!
        • (Score: 3, Funny) by Fnord666 on Monday August 20 2018, @02:18PM

          by Fnord666 (652) on Monday August 20 2018, @02:18PM (#723761) Homepage
          Is anyone using collectd and cacti?
        • (Score: 2) by RS3 on Monday August 20 2018, @02:37PM (1 child)

          by RS3 (6367) on Monday August 20 2018, @02:37PM (#723774)

          As mentioned above, it's good to partition the drive such that /tmp /var/log, /home, /var/spool, /var/lib/mysql, really many /var/lib directories, and possibly many others depending on the use of the machine. This can get messy. On some servers I just have a 10-100x oversized disk and put most stuff in one partition (or lvm if the distro insists...) and keep an eye on it.

          lvm is messy for me, but you can resize things if needed. (not a big fan of lvm. I don't get the need. gparted does a nice job if needed. I know- you can add disks and span with lvm. I'm sure it's panacea for many- I just haven't needed it.)

          I haven't yet used quota, but it will limit fill based on a user's allowed limit. Probably not useful here.

          Another approach (I haven't yet used): start with one big partition and create files that are virtual filesystems and mount them. It's pretty easy to change, resize, etc.

          I typically log in at least daily, run: df, free, uptime, ps aux, alpine, yum -v update, and a few others depending...

          I do what you did: start new log files for apache, etc., but gzip (or whatever) the old ones. They're text, so the compression ratio is 'uge.

          I'm sure there's something to automate that but I don't know, and it would be fairly easy to write a cron script to start new log files and compress the old ones, even it it's fairly static, IE: hard-coded directories and file names.

          I've messed with some automations but haven't kept them- not enough servers running to justify, and I don't like what they do to my config files. webmin, ansible, chef, puppet, etc. https://www.linuxtechi.com/top-7-tools-automate-linux-admin-task/ [linuxtechi.com] and there are more.

          If you're running a major hosting server where accounts are being generated and removed frequently and/or by non-admins, you'd use one of the above mentioned fine automations.

          Bottom line: if and when I get time I'll be trying some of the above automations on one of my test machines.

          • (Score: 2) by boltronics on Tuesday August 21 2018, @08:33AM

            by boltronics (580) on Tuesday August 21 2018, @08:33AM (#724081) Homepage Journal

            LVM is wonderful. I manage a bunch of Xen servers all running different types of DomUs, so LVM makes it a snap to reallocate space for those as required.

            I also found it a life saver when upgrading an old Asterisk PBX that has dedicated hardware (this was some years back). The upgrade required more changes to our Asterisk server than I had expected and I didn't have the time to immediately figure it out, but I fortunately took an LVM snapshot prior to running a dist-upgrade. Rolling back was just a matter of adjusting a kernel boot argument in Grub until the problem was researched (which could be analysed by mounting the original upgraded volume). Far quicker than booting from a live thumb drive and restoring a recent backup over the network!

            My usual plan of attack when using LVM is to create logical volumes for everything of a reasonable size, and monitor everything with Icinga2. When warnings start showing up, I check if everything looks reasonable and fix the issue or increase the logical volume size and expand the filesystem (which ext4 can easily do online). Without LVM, it's difficult to manage allocating some random partition extra space from your HDD if required, unless the partition either happens to be just before the free space, or you're using a filesystem that supports multiple block devices, or using mdadm to do it or something else equally hacky.

            You could go with other solutions (btrfs, ZFS, etc.) but those are not as common as LVM, and AFAIK won't help when it comes to other filesystems such as swap (unless you have a swap image in a file on your filesystem...).

            --
            It's GNU/Linux dammit!
        • (Score: 1, Funny) by Anonymous Coward on Monday August 20 2018, @03:30PM

          by Anonymous Coward on Monday August 20 2018, @03:30PM (#723796)

          zabbix is quite simple to setup

        • (Score: 1, Interesting) by Anonymous Coward on Tuesday August 21 2018, @04:15PM

          by Anonymous Coward on Tuesday August 21 2018, @04:15PM (#724237)

          A script that runs every 5 minutes to check disk free space, and for very large files every 30 minutes, throws a report as an email and uploaded to a DB which outputs to a HTML page showing % free disk space and highlights disks with less than 20% free, with difference colors and flashing as it gets close to 1%. These modern tools are nice, but for this kind of thing it's possible to hack together something quickly that runs on a schedule.

      • (Score: 1, Interesting) by Anonymous Coward on Monday August 20 2018, @02:14PM (2 children)

        by Anonymous Coward on Monday August 20 2018, @02:14PM (#723758)

        If SHTF, the partition may fill up quicker than your monitoring can flag.
        On really important systems I would consider a dedicated syslog server.

        • (Score: 2) by isostatic on Monday August 20 2018, @04:56PM (1 child)

          by isostatic (365) on Monday August 20 2018, @04:56PM (#723823) Journal

          Crucial logs get syslogged out, although as UDP, which by it's nature isn't 100% reliable (and of course the syslog servers themselves can keel over)

          Had bad experience with syslog over tcp in the past.

          I have a lot of bespoke monitoring though which isn't really suitable for syslog. I've looked into elasticsearch, but the output just isn't as manipulable as text files with grep/awk/perl/etc.

          • (Score: 0) by Anonymous Coward on Monday August 20 2018, @07:39PM

            by Anonymous Coward on Monday August 20 2018, @07:39PM (#723881)

            Had bad experience with syslog over tcp in the past.

            Elaborate?

    • (Score: 3, Informative) by zocalo on Monday August 20 2018, @12:51PM (3 children)

      by zocalo (302) on Monday August 20 2018, @12:51PM (#723741)
      Yep, good practice from the old days and still just as relevant today - along with separate /home and /tmp partitions to prevent each of the most common types of disk space chewing screw-up from bringing down the whole system. You used to be able to have a dedicated /usr partition on *NIX boxen as well, mounted readonly (and sometimes even configured as the only partition the system would execute code from!) to help prevent bad code getting into the system as well, but then along came systemd and a dev with zero understanding of the reason core unix design principles are the way they are...
      --
      UNIX? They're not even circumcised! Savages!
      • (Score: 1, Informative) by Anonymous Coward on Monday August 20 2018, @02:26PM (1 child)

        by Anonymous Coward on Monday August 20 2018, @02:26PM (#723767)

        Not quite. First, there is still good unix around that does not use SystemD. Second, usr was for stuff that was only needed after boot up, and could be put on an nfs server and shared. Directories elsewhere contain executables, like sbin for privileged executables.
        Pretty much everything except for parts of var and tmp can be made read only; current Linux helps with the var bit by linking into run, a memory file system. tmp can be made to live there too.

        • (Score: 1, Informative) by Anonymous Coward on Monday August 20 2018, @10:06PM

          by Anonymous Coward on Monday August 20 2018, @10:06PM (#723944)

          /sbin is for statically linked executables.

      • (Score: 1, Funny) by Anonymous Coward on Monday August 20 2018, @02:59PM

        by Anonymous Coward on Monday August 20 2018, @02:59PM (#723781)

        You can still run stuff without x bit, though.

    • (Score: 2) by martyb on Monday August 20 2018, @01:26PM (2 children)

      by martyb (76) Subscriber Badge on Monday August 20 2018, @01:26PM (#723748) Journal

      I *always* put /var/log on a different partition on production systems.
      $.02.

      That makes sense! It leads me to wonder if there is an agreed-upon partitioning scheme that leads to the least possible hurt?

      (I say 'hurt', because solving one problem (out-of-space) might lead to creating another problem WRT where programs expect to be able to find things.)

      --
      Wit is intellect, dancing.
      • (Score: 2) by isostatic on Monday August 20 2018, @02:04PM (1 child)

        by isostatic (365) on Monday August 20 2018, @02:04PM (#723755) Journal

        The out of space issue on one partition is a tradeoff with it more likely to run out of space on a given partition.

        If a partition fills up, I'd rather the machine shut itself down than continued in an unknown state

        • (Score: 2) by Slartibartfast on Monday August 20 2018, @02:12PM

          by Slartibartfast (5104) on Monday August 20 2018, @02:12PM (#723757)

          Different strokes for different circumstances, of course, but I prefer the system up, and Nagios alarming, over a potentially critical, service-impacting system going down arbitrarily. If the logs are full, it's *real* likely the RCA is toward the top of the log spamming.

  • (Score: 5, Informative) by Fnord666 on Monday August 20 2018, @12:31PM (2 children)

    by Fnord666 (652) on Monday August 20 2018, @12:31PM (#723732) Homepage
    Major kudos to martyb, SemperOss and chromas for taking time on their Sunday to chase this issue down and get it resolved quickly. As martyb has said, this is an all-volunteer site. It's not just a group of volunteers though, it's a team. They work together, often wearing multiple hats at any given time, to get the job done and keep our site running smoothly. Please accept my heartfelt thanks to you and everyone that keeps this site up and running.
    • (Score: 4, Insightful) by kazzie on Monday August 20 2018, @02:09PM (1 child)

      by kazzie (5309) Subscriber Badge on Monday August 20 2018, @02:09PM (#723756)

      Seconded. Also to matryb for the professional yet enthralling writeup!

      • (Score: 2) by AnonTechie on Monday August 20 2018, @08:25PM

        by AnonTechie (2275) on Monday August 20 2018, @08:25PM (#723897) Journal

        Thirded .. is that even a word ? I found the problem description and the path to a solution quite an interesting read. Appreciate the hard work of the volunteers who devoted their Sunday to resolve SN website problems. All the best guys ...

        --
        Albert Einstein - "Only two things are infinite, the universe and human stupidity, and I'm not sure about the former."
  • (Score: 4, Interesting) by RamiK on Monday August 20 2018, @03:11PM (2 children)

    by RamiK (1813) on Monday August 20 2018, @03:11PM (#723788)

    See here for the ticket: https://bugzilla.mozilla.org/show_bug.cgi?id=1477667#c0 [mozilla.org]

    And here for an alternative: https://addons.mozilla.org/en-US/firefox/addon/livemarks/ [mozilla.org]

    There should others aside from the standalone readers.

    --
    compiling...
    • (Score: 2) by takyon on Monday August 20 2018, @06:20PM (1 child)

      by takyon (881) <takyonNO@SPAMsoylentnews.org> on Monday August 20 2018, @06:20PM (#723859) Journal

      In previous comments, I've noted that Live Bookmarks is one of the few things that Firefox does right over Chrome. It just worked, no fancy (or buggy, or spying) extension needed. No ugly XML output by default when opening links to RSS feeds.

      It's not sustainable and nobody wants to maintain it in current shape, nor we have the resources to write it from scratch (and maintain it then).

      That they can't figure out how to maintain the code for it is really telling. Those hundreds of millions of dollars Mozilla has extracted from Google and Yahoo! must be spent on hookers and blow.

      As pointed out, it's only a matter of time before news sites remove RSS buttons over this. Although Firefox doesn't have the usage share clout it once had, it still clocks in at #2 or #3 and Chrome already does nothing with feed XML.

      And of course they can't take any criticism on bug reports, so they delete comments for "advocacy" and lock it down.

      --
      [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
      • (Score: 3, Insightful) by RamiK on Monday August 20 2018, @07:06PM

        by RamiK (1813) on Monday August 20 2018, @07:06PM (#723874)

        As pointed out, it's only a matter of time before news sites remove RSS buttons over this.

        Consider the other side of the coin: Doing away with feeds would potentially do wonders for old-school news aggregators sites / communities like Soylent :)

        --
        compiling...
  • (Score: 2) by ilPapa on Monday August 20 2018, @05:43PM

    by ilPapa (2366) on Monday August 20 2018, @05:43PM (#723843) Journal

    You guys do good work. Carry on.

    --
    You are still welcome on my lawn.
  • (Score: 2) by NewNic on Monday August 20 2018, @08:40PM

    by NewNic (6420) on Monday August 20 2018, @08:40PM (#723905) Journal

    Add "logrotate" to your USE variable, then re-emerge all affected packages.

    emerge --newuse @world.

    --
    lib·er·tar·i·an·ism ˌlibərˈterēənizəm/ noun: Magical thinking that useful idiots mistake for serious political theory
(1)