Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Thursday February 02, @08:24AM   Printer-friendly
from the in-Russia-search-engine-gives-you-their-data dept.

https://arstechnica.com/information-technology/2023/01/massive-yandex-code-leak-reveals-russian-search-engines-ranking-factors/

Nearly 45GB of source code files, allegedly stolen by a former employee, have revealed the underpinnings of Russian tech giant Yandex's many apps and services. It also revealed key ranking factors for Yandex's search engine, the kind almost never revealed in public.
[...]
As detailed by Buraks (in two threads), Yandex's engine favors pages that:

  • Aren't too old
  • Have a lot of organic traffic (unique visitors) and less search-driven traffic
  • Have fewer numbers and slashes in their URL
  • Have optimized code rather than "hard pessimization," with a "PR=0"
  • Are hosted on reliable servers
  • Happen to be Wikipedia pages or are linked from Wikipedia
  • Are hosted or linked from higher-level pages on a domain
  • Have keywords in their URL (up to three)

I'm not sure how different these differ from our own search engines. Does anyone have any insights?


Original Submission

This discussion was created by janrinok (52) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 2, Interesting) by Anonymous Coward on Thursday February 02, @08:30AM (13 children)

    by Anonymous Coward on Thursday February 02, @08:30AM (#1289841)
    Yandex image search often produces more interesting results than Google's when searching by image. e.g. with Yandex the images found are often actually similar even if it's not the image I am searching for.

    Google seems intent on making their image search worse. I've had cases where a Google image search for an image that worked before stopped working for weeks/months then started working again.
    • (Score: 5, Informative) by Opportunist on Thursday February 02, @10:19AM (11 children)

      by Opportunist (5545) on Thursday February 02, @10:19AM (#1289849)

      That's mostly due to Google being very averse to finding "inappropriate" pictures. And I don't mean porn. I mean things like, say, labeling black people as monkeys [wired.com] or finding mugshots when looking for "three black teens" [washingtonpost.com].

      Yandex just doesn't give a fuck about that. Which results in better results. Yes, sometimes they're fucking racist, but the rest of the results are spot on.

      • (Score: 2) by Rich on Thursday February 02, @02:33PM (10 children)

        by Rich (945) on Thursday February 02, @02:33PM (#1289861) Journal

        Yes, sometimes they're fucking racist

        To the point of .. ahem .. misunderstanding .. the desire to no longer call certain technical dependencies a "master-slave" relation (teaser: they kept the "master"):

        https://mezha.media/en/2023/01/27/yandex-repository-leaked-racist-expressions-in-the-code-censorship-and-how-russians-communicate-with-the-smart-speaker/ [mezha.media]

        But, wrt top post, their image search really is top notch. I wonder if its logic has been disclosed in the leak, too, and if we will read about how it works so well.

        • (Score: 2) by looorg on Thursday February 02, @03:12PM (7 children)

          by looorg (578) on Thursday February 02, @03:12PM (#1289865)

          Seems to mostly just be childish and stupid variable names, don't know why you would want to call processes niggers. All it appears to do is to call pkill and wait for it to terminate all processes. If anything it probably just shows that in the beginnings there was probably just a few people working on it, or even less, and they just wrote things and didn't care that others would eventually read it or that whatever they wrote would be around decades later.

          Once upon a time in ye olden age I worked for a company that had named all their servers after female pornstars. It was all young lads there, a list of suitable length was needed to name the servers and apparently they landed on that. I don't know who or how it was decided but there it was. Jenna has gone down ... Management didn't know or cared or understood. It wasn't something they saw anyway. Until they did. Then it was a lot less fun and a lot of work.

          • (Score: 3, Interesting) by Rich on Thursday February 02, @05:48PM (6 children)

            by Rich (945) on Thursday February 02, @05:48PM (#1289891) Journal

            Jenna has gone down

            Epic!

            Until they did. Then it was a lot less fun and a lot of work.

            Couldn't you just have explained that the servers are tagged with names in alphabetic order, like storms or other meteorological phenomenons?!

            (I remember to this day that a worn out floppy on an Apple IIgs will return Error 39 ($27). I diligently caught all errors that were expected or encountered and displayed appropriate messages. We didn't ever see an IO Error and that was mapped to a generic error dialog that never ever appeared during testing. It had "So what?" as dismissal button. I heard about it at lunch with customers, and one customer was not really amused. Ouch...)

            • (Score: 2) by looorg on Thursday February 02, @07:35PM (4 children)

              by looorg (578) on Thursday February 02, @07:35PM (#1289915)

              While that might have possibly worked somewhere else it didn't really work here. First there were printed labels with the name attached to each machine, not all but some of them even came with little pictures (headshots, not that kind!). It was fine after all nobody was in the basement server hall so it wasn't like it could be spotted by accident by someone that just passed by. If it had been local names there were obvious gaps in the a-z listing and while some might have passed for local names (Nina, Julia ...) most of them were clearly somewhat weird and foreign, there just are not a lot of girls named Jenna, Amber, Tabatha and Kobe etc around here.

              Also management were men to so while one or two names might not have given it away, but been considered weird, once a few more became known it wasn't like they couldn't figure out the naming convention either. After all they knew these names to. It's not like they got mad about it or sacked anyone. At a meeting they just asked why the servers were named after female pornstars. What was weird was it had been going on for years before they found out and/or decided that it was a problem that had to be fixed. It was deemed unprofessional and all machine names henceforth had to follow the corporate naming convention that they had come up with. The porn-names lived on as aliases tho, out of sight out of mind or something. Remembering those names were just so much simpler then the system they came up with.

              • (Score: 3, Funny) by Opportunist on Thursday February 02, @09:02PM (2 children)

                by Opportunist (5545) on Thursday February 02, @09:02PM (#1289930)

                At a meeting they just asked why the servers were named after female pornstars.

                Answer: Huh? These are porn stars? I'm amazed at your knowledge of various things, sir, I didn't know that...

                Trust me, that problem would never rear its ugly head again.

                • (Score: 0) by Anonymous Coward on Thursday February 02, @09:54PM (1 child)

                  by Anonymous Coward on Thursday February 02, @09:54PM (#1289942)

                  Do it with gay male porn stars and GUARANTEED your secret will be safe.

              • (Score: 0) by Anonymous Coward on Friday February 03, @06:01AM

                by Anonymous Coward on Friday February 03, @06:01AM (#1289981)

                they just asked why the servers were named after female pornstars

                It was deemed unprofessional

                Aren't pornstars pros though? Naming them after twitch streamers would be unprofessional...
                😏

            • (Score: 3, Funny) by Opportunist on Thursday February 02, @09:00PM

              by Opportunist (5545) on Thursday February 02, @09:00PM (#1289928)

              As long as the error doesn't end up in a message box reading "This shit can't happen, fuck the requirement that every error message box needs a text"...

        • (Score: 2) by Opportunist on Thursday February 02, @08:55PM (1 child)

          by Opportunist (5545) on Thursday February 02, @08:55PM (#1289926)

          Well, not exactly the most professional way to name functions and variables, but then again, looking down my naming conventions... I mean, the worker nodes of my Kubernetes cluster here are called the "proles", and the firewall is called "boomer" (because it's doing the gatekeeping here)...

          But jokes aside, I never really got the whole master/slave problem. I'm fairly sure the first thing that comes to mind for most people these days when hearing about master and slave isn't the abolition question but rather the BDSM one. So maybe calling them "top" and "bottom" would be more socially acceptable, just to stay in the general terminology?

          • (Score: 1, Touché) by Anonymous Coward on Friday February 03, @06:04AM

            by Anonymous Coward on Friday February 03, @06:04AM (#1289982)

            I dunno I think master-slave can be appropriate terms if the slave has to do most of the work the master orders no matter what.

            What next stop calling Freebsd jails jails because they are jails?

    • (Score: 2) by Unixnut on Friday February 03, @05:02PM

      by Unixnut (5779) on Friday February 03, @05:02PM (#1290047)

      > Google seems intent on making their image search worse. I've had cases where a Google image search for an image that worked before stopped working for weeks/months then started working again.

      Meh, Google has been intent on making all their search worse for years now, if not more than a decade. Roughly at the point they switched from being a "search company" to being an "advertising company".

      Their goal is not to give you the best search, but to get the most money from advertising to you, and it shows in the quality of their search results. Sometimes giving you poor search results gets them more advertising, as you have to click through more search pages (more pages = more ads loaded), and maybe have to search for the same thing multiple times with different terms (multiple page loads = more ads loaded).

      As such I've not used Google for years now. Funnily enough, my "go to" search engine nowadays is Yandex, by far one of the better search engines out there. My main pet peeve with it is the fact every few times I search, it asks me to fill in a captcha to prove I am not a bot.

      Presumably this is because I don't allow javascript on my web browser, but damn, surely they can find a better way to deal with that. Yahoo is my second choice, and it doesn't suffer this issue, but I suspect that is because they track me, so know I am not a bot despite no javascript. I guess a plus for Yandex for not tracking me, but it does get tiring to keep having to re-do the captcha's.

  • (Score: 2, Insightful) by shrewdsheep on Thursday February 02, @08:52AM (2 children)

    by shrewdsheep (5215) on Thursday February 02, @08:52AM (#1289843)

    Looking at "organic traffic" means they need to rely on some spyware to get those numbers. Does Yandex offer a toolbar or an app? Still those numbers must be heavily biased as compared to a random sample.

  • (Score: 4, Interesting) by looorg on Thursday February 02, @02:53PM

    by looorg (578) on Thursday February 02, @02:53PM (#1289864)

    Some of this is probably also viable knowledge for other Search Engine Optimization games. I would be somewhat surprised if a lot of the engines and their various indices doesn't converge on a somewhat similar amount of points regarding how they list things. So if these are true for some, more or less, current version of Yandex I would be very surprised if it wasn't also somewhat true for say Google. With the possible exceptions that Yandex in some regard might be more pure, or if you will don't have to give a fuck about various shifting western sensibilities regarding race, politics and social issues.

    That said SEO have become a guessing game of hot garbage and just doing what was previously just normal searching is now basically nightmare fuel. Almost everything gets misinterpreted on some level or funneled here and there and everywhere.

    Still a model based on 1922 ranking factors tho is horrible by any standard, certainly so for one that is supposed to answer usually fairly basic and trivial search results. Something must be quite messed up in there. I thought Russians were supposed to be good at maths and optimization. When did that change?

    Not that they won't have their own censorship settings in there somewhere, I'm not sure how the Kremlin feel about all the memes of Putin riding Bears while bare chested etc.

    The amount of UNUSED and DEPRECATED tags just mean that it is changing and evolving over time. While I have not checked I'm sure if you are geolocated inside Russia and search for info one current military operation in Ukraine you will get very different versions of event then someone outside etc.

  • (Score: 2) by quietus on Thursday February 02, @06:58PM (1 child)

    by quietus (6328) on Thursday February 02, @06:58PM (#1289904) Journal

    The Yandex source leak was probably organized by somebody on the inside, and the point was most likely not to notify the rest of the world about how better to rank in Yandex.

    The point might have been to show that Yandex censors results [meduza.io] which may show the Z symbol and images of Putin in contexts that may embarass the current regime.

    While the image filtering code leaked, ImgPatch, is most often used to filter out pornography, its second most common use is to filter out images of Putin, in association with terms like bullshitter, balding, dickhead, top thief, scumbag of all Rus, dick in a Spacesuit, dickhead in an ice hole, what do pedophiles look like, when will he croak, strange creature waves his hand, and grandpa in his bunker. . It is also used to filter out images with the symbol Z in them, in association with terms associated with the nazi regime.

    • (Score: 0) by Anonymous Coward on Thursday February 02, @09:59PM

      by Anonymous Coward on Thursday February 02, @09:59PM (#1289943)

      Good point. The mysterious air of power, where things just happen that favor them is having its lid blown off. Look how much micromanaging bullshit is necessary behind the scenes to have one pouty billionaire.

(1)