Stories
Slash Boxes
Comments

SoylentNews is people

Meta
posted by martyb on Sunday February 14 2021, @04:47AM   Printer-friendly
from the Constants-aren't-and-variables-won't dept.

[2021-02-14 15:53:00 UTC: UPDATE added need to check apache log before doing a slash -restart]

We seem to have experienced some difficulties with the SoylentNews site.

I've noticed that both the number of hits and comments for each story do not seem to be updating.

Corrective measures taken:

  1. "Bounce" the Servers I doubted it would help, but it causes no harm to try it, so why not? And, as expected, it did not help, either.:
    This is my personal "bounce" script:
    cat ~/bin/bounce

    #!/bin/bash
    servers='hydrogen fluorine'
    for server in ${servers} ; do echo Accessing: ${server} &&  rsh ${server} /home/bob/bin/bounce ; done

    Which, in turn, runs the following script on each of the above servers:

    cat /home/bob/bin/bounce

    #!/bin/bash
    sudo /etc/init.d/varnish restart
    sudo -u slash /srv/soylentnews.org/apache/bin/apachectl -k restart

  2. Restart slash For those who are unaware, slash has its own internal implementation of what is, effectively, cron. It periodically fires off tasks that support the site's operations. But, this potentially has side-effects, so first need to check the apache error_log.

    # Go to the appropriate server:
    ssh fluorine
    # Ensure the apache log is not showing issues: tail -f /srv/soylentnews.org/apache/logs/error_log
    # Restart slash:
    sudo /etc/init.d/slash restart
    >> slashd slash has no PID file
    >> Sleeping 10 seconds in a probably futile attempt to be clean: ok.
    >> Starting slashd slash: ok PID = 3274

    NB: this failed to run to a successful conclusion when I originally tried it a few hour ago. I gave it one more try while writing this story... it seemed to run okay this time?!

Things appears to be running okay, now. Please reply in the comments if anything else is amiss. Alternatively, mention it in the #dev channel on IRC (Internet Relay Chat, or send an email to admin (at) soylentnews (dot) org.

We now return you to the ongoing discussion of: teco or ed?


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 0) by Anonymous Coward on Monday February 15 2021, @05:18PM

    by Anonymous Coward on Monday February 15 2021, @05:18PM (#1113205)

    I don't mean to sound grumpy, but how fucking difficult is it to document your interdependencies and write a runbook?

    Figure it out! What depends upon what? Draw some diagrams. You built this thing and you can't draw a simple hierarchical diagram of its operation?

    Basically, you are looking for dependencies.

    I prefer to illustrate the dependencies as a 'stack', in case I am being managed by someone who understands wedding cakes better than they do software and hardware interdependencies.

    At the bottom of the stack is the grounding. No reliable ground, no reliable power. Several times I've been involved in diagnosing computer problems in buildings built along the bayshore. Those salty marshes and tides play hell with electrical grounding.

    Next is the power. You can't boot without power. It can't be spiky power and the power needs to be delivered as sine waves, not triangles or squares.

    Next is the hardware. When is the last time you ran memtest86 on your servers? When is the last time you did a dd(1) of each hard drive in its entirety to assure yourself there were no bad blocks in use? Do you have diagnostic CDROMs for your servers? Do you ever use them? Memtest86 is the FIRST thing I do on ALL of my computers.

    Make sure the computer isn't clogged with dust! Make sure the fans are running! Make sure the hard drives aren't making horrible noises, too.

    Next, system resources. Make sure you aren't running out of disk space! Cleanup can and should be automated.

    Next, the database. No point in starting a web server when your database is tits up. The database comes before the web server.

    Next, the business logic (AKA 'middleware'). Are you using Java? Is the JVM running? Make sure the business logic is in communication with the database. Refer to your diagram. Identify test points and create tests for those test points. Automate it. You should be able to run a shell script and see that your business logic is working, that all of the required processes are in the process table and acting normally. Ideally it should be written as an /etc/init.d or /etc/rc.d script. Refer to other such scripts for tips on how to achieve quality start/stop scripts.

    (Odds are good that the business logic is where the startups get complicated; that would indicate that a better understanding of your business logic's interdependencies is called for. It may also be appropriate to invest in a Nagios server, to monitor interdependencies in a graphical fashion. And some cron jobs, to make sure certain pieces are running and to restart them if they are not.)

    Finally, the web server. If you're convinced you have correctly started your database and your business logic is working correctly and you have content to serve, then you can start your web server.

    There are other processes I have not addressed such as DNS and user authentication. If you are using an LDAP database to manage users, for instance, and that creates another dependency, IE, you can't start processes until you can log in and you can't log in until the LDAP database is restarted, then you need to include those in your diagrams and startup sequences and runbooks.

    Programmers adore complexity and messing with new versions but sysadmins adore consistency and reliability. When programmers are in charge of things, they tend to get horribly complicated, and when things go tits up, programmers tend to stand around and say "it SHOULD do this", relying upon some written document somewhere, whereas the sysadmin will observe, "it is NOT doing what is says it will do", and will happily rip it out and replace it with a small shell script, which is more reliable.

    I haven't followed Soylent News' architectural design that closely but I hope there is a staging environment, and maybe a bug-tracking infrastructure.

    My $0.02

    The goasl of your documentation should be to make it simple for you to restart the system after a night of heavy drinking OR to walk a clever ten-year-old child through doing the same thing.