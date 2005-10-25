We are aware of the significant number of 50x responses that users are experiencing from the site. The problem was recognised about 1 week ago and there is much investigative work going on behind the scenes.
The actual cause is difficult to identify. As of Saturday there is new software in place which is trying to find how often the 50x responses occur, while trying to correlate the occurrences with other functions in Rehash. This is a time consuming process. Some users have been assisting by reporting on IRC #soylent when they receive such a response. If you would like to help please report when the 50x response was received with a precise time so that we can find the corresponding query in the server logs, exactly what function were you doing that appeared to trigger it, and how long the problem lasted? If you also know your own IP address it would be very helpful but we understand that many of you will be reluctant to give this information.
In most cases the problem clears itself in less than 10 seconds but there have been periods of unresponsiveness that have lasted several minutes in some rare cases.
FIXED - at least until we find out that it isn't.... [Added at 2025-1005 19:00Z--JR] See also here.
(Score: 3, Funny) by SomeGuy on Sunday October 05, @05:03PM (1 child)
So it's not our glorious over hyped AI overlords training on how to produce endless mind-curdling streams of rancid vile?
(Score: 2) by JoeMerchant on Sunday October 05, @10:34PM
I don't so much receive a 50x reply as I just get no response from the site.
However, it wouldn't hurt to ask Claude code if it has any insight into the problem, be prepared to cough up $20 for the monthly pro plan (or try a free month on Alphabet's competing Jules product). Given access to read a copy of the site configuration, it might just have something helpful to say. I wouldn't let it modify the config without extensive review...
(Score: 2) by istartedi on Sunday October 05, @05:04PM
Glad to know you're on it. Happy to post those "Guru meditation" numbers somewhere if you think it'll help.
(Score: 2) by driverless on Sunday October 05, @05:05PM
Can't say I've noticed any myself, maybe I've just been lucky.
(Score: 5, Interesting) by RS3 on Sunday October 05, @05:41PM (4 children)
For me it's mostly just nothing- no response. But most often it's an error message from Varnish Cache, saying something about timeout on backend server. If / when it happens again I'll capture the page and post it and date + time here.
As mentioned elsewhere there may be a problem with database connections not being closed. I'm not expert on rehash admin, but there should be a timeout setting for database connections.
I do admin web and WordPress servers, and there are some bugs that keep database connections open. It doesn't happen enough or long enough to cause a problem for the systems / sites I admin, but it's always bugged me that it happens.
My hunch is that it's kind of like race conditions: some Internet router / packet timeouts cut off the TCP connection and the webserver (Apache) doesn't know the client closed the page, so webserver (php / perl) keeps the database connection open. This reminds me of many other similar problems with file sharing / contention. Again, it might be as simple as setting a timeout.
BTW, what I've done in the systems I admin is set a low number of allowed connections to prevent the database from spawning too many processes and gobbling up all RAM. I don't use varnish nor any other cache other than memcached which seems to work (and a php cache for the WordPress). I don't look too deeply in the server logs to see how much the sites are getting tagged by crawlers. At some point I may have to start blocking the AI crawlers. I'm hoping someone is making some kind of software / plugin / firewall rules / something to block them.
(Score: 3, Informative) by JoeMerchant on Sunday October 05, @11:17PM (3 children)
I've had once or twice where it was really slow but eventually responded like normal.
Then I've had three or four times like your cache message.
The vast majority of the time I just don't get a response at all. I like using SN in part because it is usually fast to load and light on bandwidth. Lately, if it hasn't loaded after 5 seconds I just assume it is borked again.
(Score: 1) by anubi on Monday October 06, @06:32AM (2 children)
I was having identical response last week.
It's been working great today.
Same backend error page about a varnish server.
Or the standard browser timeout.
(Score: 3, Informative) by janrinok on Monday October 06, @06:38AM (1 child)
(Score: 2, Informative) by anubi on Monday October 06, @08:55AM
As many problems as I was having last week, I tried two different networks ( T-Mobile, TracFone ), four different browsers ( brave, Vivaldi, ddg, chrome ), multiple websites, and it was definitely correlating here. I checked Kolie's journal and saw sysadmins were already aware.
I did not want to pester anyone here with the obvious. My guess was a "drive-by software upgrade" , or some AI decided to scrape the place.
You had to be on it. Up and down, up and down.
For days. Just like me and some loose wire in the transmission controller in my van.
Intermittent problem. It's as hard as catching mice. A lot of waiting, change something, and wait some more, hoping that is it, only it's not.
Thank you and the rest of SN for getting this place back up.
Another thing I thank you for is your tenacity.
Given some of the stuff I just saw you and some of the other sysadmins putting up with. I salute you. I don't have near your tolerance for that level of disrespect. I remain very grateful to the SN team for inviting my visits. I am old and this is the only social platform where I feel I fit in: A bunch of scientists of various specialties. I think a lot are a lot like me.
(Score: 5, Informative) by kolie on Sunday October 05, @06:00PM (7 children)
As of 10/5 1800 zulu its been assumed that most of this has been resolved for the time and we are back into a stable, quick, working condition.
(Score: 2) by RS3 on Sunday October 05, @06:37PM (4 children)
Thank you, so much. I'm curious what did you find / fix?
(Score: 5, Informative) by kolie on Sunday October 05, @08:27PM (3 children)
There is a lot of stuff that kind of all blends into one issue.
Most of the performance code in slashcode/rehash is off, there is very little in the way of caching/optimizations turned on. There is constant drift from apache/perl when rehash was written to the systems its running on now - the base OS moves and the assumptions apache/perl had whenever it was writen originally drift. Memory usage shifts. The site used to run on a clustered DB and we've gone down to a single instance.
The 500s where tracked down to the site erroring because it failed to open database connections. Not much has changed on the site but suddenly 150 DB connections isn't enough, somethings gone weird in the stack but it is what it is. We removed that limit. Now the site is using more resources, and all the stuff above compounds. Now resources limits are being hit - we get weird behavior, chunks loading, long responds, DB connections taking awhile to complete because of higher numbers....
I made some optimizations in some places - throttled/blocked some anomaly site requests - set harder limits on some of the docker vm resources - set cpu limits on the backup queries - removed locking during backup sequences - optimized the storage layout - changed nice priority levels on various processess.
For reference the load numbers in top where north of 250-300 at some times. Initial changes got this down to 50 - and we are now seeing 2-3 again.
(Score: 2, Informative) by pTamok on Sunday October 05, @09:19PM
Thank you for working on the issue and taking the time and trouble to give an informative reply.
(Score: 2) by JoeMerchant on Sunday October 05, @11:19PM
I would suspect you have a DB connection "leak" (like a memory leak). Can you track the age / activity of the connections?
(Score: 2) by jelizondo on Monday October 06, @02:10AM
Thank you. I know the time you spend troubleshooting for us could be spent on other activities and I wish I could help. but alas, I can't. So please, accept my thanks.
(Score: 2) by hubie on Sunday October 05, @07:26PM
Thank you!
(Score: 1) by khallow on Sunday October 05, @09:17PM
(Score: 0) by Anonymous Coward on Sunday October 05, @06:51PM (2 children)
I thought couldn't handle how badass my comment was.
(Score: 3, Funny) by turgid on Sunday October 05, @07:33PM (1 child)
WHAT IS MY NAME?
(Score: 0) by Anonymous Coward on Sunday October 05, @07:46PM
We are Anonymous.
(Score: 2) by bzipitidoo on Monday October 06, @12:22AM (1 child)
I saw quite a few 503 errors over the past week, and just figured the site was down and that the SN team would soon have it up again. Wondered if the site had been DDOSed.
Yes, I'll make a note of it the next time I get such an error.
But if it's not an attack, I find it curious that such resource problems are happening at all, when they weren't earlier. Did SN recently upgrade the server side? New version of Apache, or the database?
(Score: 1, Funny) by Anonymous Coward on Monday October 06, @05:16AM
No, not DDOSed, it was slashdotted. :)
(Score: 1) by fen on Monday October 06, @08:18AM (3 children)
So the issue count went from 2 to 100? Soylent had 1.98 problems and wrote a classic rap song after the problem count increased. Yes I know it's HTTP code 500-599, but didn't parse it right.
(Score: 2) by janrinok on Monday October 06, @10:02AM (1 child)
It wasn't only 503, there were other 5xx codes including 502.
[nostyle RIP 06 May 2025]
(Score: 2) by janrinok on Monday October 06, @10:04AM
[nostyle RIP 06 May 2025]
(Score: 1) by DECbot on Monday October 06, @10:12PM
...but a 418 [mozilla.org] ain't one.
(Score: 2) by mcgrew on Monday October 06, @04:35PM (2 children)
Will do. Had a rash of them last time I was here. Also, what ever is causing that problem may be disallowing me from commenting on This article [soylentnews.org] marked "This discussion was created by mrpg (5708) for logged-in users only." Except I am logged in, as you can see from this comment.
(Score: 2) by janrinok on Monday October 06, @05:29PM (1 child)
Can you please try to comment in several different journals and main page stories. You may have just found a bug.
[nostyle RIP 06 May 2025]
(Score: 2) by mcgrew on Tuesday October 07, @08:18PM
Sure. One clue might be that after refreshing the page later after I was through commenting on other stories, I was able to comment. So at least there's a simple and easy workaround. I'll let you know if I see it again, I usually open articles that look interesting in tabs.
