Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 19 submissions in the queue.
posted by takyon on Wednesday November 21 2018, @04:00PM   Printer-friendly
from the found-and-lost dept.

The privacy-oriented search engine Findx has shut down: https://privacore.github.io/

The reasons cited are:

  • While people are starting to understand the importance of privacy it is a major hurdle to get them to select a different search engine.
  • Search engines eat resources like crazy, so operating costs are non-negligible.
  • Some sites (including e.g. github) use a whitelist in robots.txt, blocking new crawlers.
  • The amount of spam, link-farms, referrer-linking, etc. is beyond your worst nightmare.
  • Returning good results takes a long time to fine-tune.
  • Monetizing is nearly impossible because advertising networks want to know everything about the users, going against privacy concerns.
  • Buying search results from other search engines is impossible until you have least x million searches/month. Getting x million searches/month is impossible unless you buy search results from other search engines (or sink a lot of cash into making it yourself).

So what do you soylentils think can be done to increase privacy for ordinary users, search-engine-wise ?

Dislaimer: I worked at Findx.


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 5, Informative) by Apparition on Wednesday November 21 2018, @04:06PM (5 children)

    by Apparition (6835) on Wednesday November 21 2018, @04:06PM (#764779) Journal

    I use a combination of Startpage and DuckDuckGo. Together, they work well enough for me. The only "major" search engine I've used in years is Bing, and that's for Bing Maps in particular.

    • (Score: 4, Informative) by ikanreed on Wednesday November 21 2018, @04:51PM (1 child)

      by ikanreed (3164) Subscriber Badge on Wednesday November 21 2018, @04:51PM (#764800) Journal

      I use duckduckgo, but not exclusively.

      2 reasons.
      1. It's real easy to switch search engines when your default is ddg. !g is really easy to type when you need it.
      2. "django SuspiciousMultipartForm exception javascript" fairly often gives less useful results than "!g django SuspiciousMultipartForm exception javascript". The complexity of some searches can easily overwhelm their comparatively simple algorithm.

      But if you're still going straight to google for anything? Why?

      • (Score: 4, Insightful) by Anonymous Coward on Wednesday November 21 2018, @05:34PM

        by Anonymous Coward on Wednesday November 21 2018, @05:34PM (#764833)

        You know the !g bit is absolutely brilliant.

        It does 2 things.

        The user basically usually immediately gets what they need they are happy. Maybe a bit irked that DDG did not find it.

        DDG gets feedback on what to look for. Go through the search strings find the !g's and focus your results on finding those things as those are the holes in the system.

    • (Score: 4, Informative) by takyon on Wednesday November 21 2018, @05:09PM (2 children)

      by takyon (881) <takyonNO@SPAMsoylentnews.org> on Wednesday November 21 2018, @05:09PM (#764817) Journal

      I use https://searx.me/ [searx.me] in addition to the GOOG. Searx appears to use a donation funding model.

      --
      [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
      • (Score: 0) by Anonymous Coward on Wednesday November 21 2018, @06:55PM (1 child)

        by Anonymous Coward on Wednesday November 21 2018, @06:55PM (#764886)

        Is searx a search engine? I got the impression that it's a "meta" search -- sending searches out to various actual search engines (which are user controllable). I've been using the searx.org instance which often finds what I'm looking for.

        This seems like a decent way to get around the advertising & tracking on the big sites...at least until they learn to filter out requests from searx instances.

        • (Score: 2) by isj on Wednesday November 21 2018, @07:22PM

          by isj (5249) on Wednesday November 21 2018, @07:22PM (#764905) Homepage

          Searx is a meta search engine. It doesn't not have its own index. Instead if calls out to google/yahoo/bing/yandex/... depending on configuration.

  • (Score: 4, Funny) by Knowledge Troll on Wednesday November 21 2018, @04:15PM (5 children)

    by Knowledge Troll (5948) on Wednesday November 21 2018, @04:15PM (#764786) Homepage Journal

    Now that Google seems intent of destroying the utility of their web search DDG is getting a lot closer to parody with them.

    • (Score: 4, Interesting) by Anonymous Coward on Wednesday November 21 2018, @05:23PM

      by Anonymous Coward on Wednesday November 21 2018, @05:23PM (#764824)

      I switched to duckduckgo, and use it pretty much exclusively (I occasionally do a !g search if I'm desperate, but it rarely helps), for almost a decate now. I don't remember exactly when, and I don't remeber exactly what they did, but google was fucking around with the main search UI and I was not happy with it. I think at the time people were talking about ixquick and duckduckgo on the green site. I gave both a try for a while and liked duckduckgo better.

      I never looked back: Duckduckgo was offering pretty much exactly what got me using google almost exclusively in late 90s: a no-nonsense box to type words in and get no-nonsense results.

      This is literally the first time I have ever heard of findx, so it's doesn't surprise me that they had trouble getting people to switch.

    • (Score: 1) by nitehawk214 on Wednesday November 21 2018, @07:12PM (2 children)

      by nitehawk214 (1304) on Wednesday November 21 2018, @07:12PM (#764900)

      "closer to parody" hehe

      But the problem I have with DDG is its image search is even more useless than Bing.

      --
      "Don't you ever miss the days when you used to be nostalgic?" -Loiosh
      • (Score: 2) by toddestan on Thursday November 22 2018, @06:15PM (1 child)

        by toddestan (4982) on Thursday November 22 2018, @06:15PM (#765267)

        I've actually found Bing's image search to be pretty decent. One of the problems with DDG's is that it seems to be hard-coded to only give you like the first 150 images it finds no matter how many it actually finds. Google's reverse image search is one of the very few things I'll use Google for anymore.

        • (Score: 0) by Anonymous Coward on Friday November 23 2018, @03:39AM

          by Anonymous Coward on Friday November 23 2018, @03:39AM (#765425)

          I use tineye.com for that. It works without Javascript :)

    • (Score: 2) by eravnrekaree on Thursday November 22 2018, @03:05PM

      by eravnrekaree (555) on Thursday November 22 2018, @03:05PM (#765212)

      s/parody/parity/

  • (Score: 5, Insightful) by Anonymous Coward on Wednesday November 21 2018, @04:18PM (8 children)

    by Anonymous Coward on Wednesday November 21 2018, @04:18PM (#764788)

    First up. This is the first I ever heard of you. A bit of telling people you exist helps.

    • (Score: 0) by Anonymous Coward on Wednesday November 21 2018, @04:42PM

      by Anonymous Coward on Wednesday November 21 2018, @04:42PM (#764795)

      me too. never heard of findx and i've looked at using Yacy like 10 times...

    • (Score: 3, Interesting) by ledow on Wednesday November 21 2018, @04:58PM (2 children)

      by ledow (5567) on Wednesday November 21 2018, @04:58PM (#764808) Homepage

      Agreed.

      No idea who these people are and they're already dead in the water.

      "While people are starting to understand the importance of privacy it is a major hurdle to get them to select a different search engine." - sure, if I've never heard of you.
      "Search engines eat resources like crazy, so operating costs are non-negligible." - depends on what you're doing. If you're just indexing the Internet a rack or two of servers and a decent backbone connection in a datacentre would be a start. That's far from prohibitively expensive but, hey, maybe you could have asked people to help you out and had software they could run to contribute to your indexing efforts, etc. - you know, like collaboration, open-source, projects like Streetmap, etc.
      "Some sites (including e.g. github) use a whitelist in robots.txt, blocking new crawlers." - Sure, that's a problem - for any search engine whatsoever. So don't index github and also DON'T use their services, like you are for your shutdown message!
      "The amount of spam, link-farms, referrer-linking, etc. is beyond your worst nightmare." - Probably. That's what graph analysis is for, you only need suck in the HTML and index it.
      "Returning good results takes a long time to fine-tune." - Welcome to the problem of search engines... it's not about plucking data from a database, but apart finding human-relevant data within it. Can't be solved by brute-force alone, so why would you try?
      "Monetizing is nearly impossible because advertising networks want to know everything about the users, going against privacy concerns." - Advertising is just one way to monetise. If you were relying on it, you picked a really bad business model.
      "Buying search results from other search engines is impossible until you have least x million searches/month. Getting x million searches/month is impossible unless you buy search results from other search engines (or sink a lot of cash into making it yourself)." - Why on earth would you buy other people's data, or expect millions of searches a month as a nobody that no-one's heard of?

      Honestly, if you were going to do this, really seriously, then I'd have had a SETI@Home-like software that people could (voluntarily) run on their desktops or servers, which would help index sites (at the very least URL 1 contains links to URLs 2, 3 and 4 and contains these keywords - that's a ton of processing and network bandwidth saved right there) and report back to a central server. That server would almost certainly be a Elastic Cloud or similar service so it could grow with demand and cost nothing when idle (and I suspect their servers were doing far more indexing than ever serving results). That would have to have a business model not "we'll stick adverts in and hope for the best" (especially when claiming to be privacy-conscious!). And then I would spend the rest of the time/effort/money getting the word out on geeky sites, into things like Linux and open-source community, try to get deals with TorBrowser and similar to just be a simple "other search engine" in some way.

      These people set up a miniscule search engine a year ago between three of them and expect to be inside Firefox and challenging the big boys with zero money... it's just laughable.

      • (Score: 1, Insightful) by Anonymous Coward on Wednesday November 21 2018, @05:24PM

        by Anonymous Coward on Wednesday November 21 2018, @05:24PM (#764826)

        Your idea is interesting but could be used to manipulate results very easily. Client controlled is very tricky to get just right. Even things like SETI had cheaters. There is nothing to gain there but 'internet points'.

        You would have to be very careful to make it so your results would be verified in some way (double the crawling of each site and a verification). Using a cloud service is a decent idea. But you need to be careful early as the thing could spider quite quickly (meaning your costs are large as you point out). You also need to look out for SEO tricks. Such as having a 2 servers look like 100 servers serving 10 million pages that point to 1 site to make it look important. You need to watch out for malicious actors. They have honeypots that like to morass spiders. Because they have had issues with them in the past. Also that they lead off with censorship would not sit well with privacy conscious people either. It is a form of manipulation most people in that crowd will violently react to. Basically they missed their audience. It would be like having a store that sells the most amazing drinks in the world. But you refuse to make them. Everything else on your menu is pretty much the same as everyone else. You have nothing new to offer and people will shop with what they know. Poof you are out of business all because you for some reason decided you know what your customers want. Even though they came in every day saying 'hey make me this drink'. Now that is not always a good plan. But when your audience is saying "I want X" and you tell them "nope only have Y too bad" they *will* go elsewhere.

        But like I said they needed to 'hit the streets' as they call it in the advertising world. They needed to tell others that they even existed. Dropping a note on SD and hacker news once 2 years ago does not count. It means posting a lot about it. Blogging about it. What sort of challenges are you having? What sort of tech stack are you using and why? Are you building your own or just gluing something together? Getting your blogs picked up by the typical news aggregators. Tell the world why you are special. Tossing up a web page does not mean people know about you. You know about you, but no one else does. When you work for a largish company you usually do not have to worry about such things. But if you are a small company, you personally do or you hire someone to do just that.

        it's just laughable
        Their business plan was not great. But hopefully they 'fail upwards'. Meaning they learned what not to do and maybe some things to do. Most businesses fail. Use that for your next venture. Good luck!

      • (Score: 3, Insightful) by Pino P on Thursday November 22 2018, @01:45AM

        by Pino P (4721) on Thursday November 22 2018, @01:45AM (#765026) Journal

        "While people are starting to understand the importance of privacy it is a major hurdle to get them to select a different search engine." - sure, if I've never heard of you.

        If you were running that business, how would you have advertised?

        That's what graph analysis is for, you only need suck in the HTML and index it.

        I thought Google LLC still had the exclusive license to the PageRank patent.

        "Monetizing is nearly impossible because advertising networks want to know everything about the users, going against privacy concerns." - Advertising is just one way to monetise. If you were relying on it, you picked a really bad business model.

        If you were running that business, how would you have raised revenue instead?

    • (Score: 0) by Anonymous Coward on Wednesday November 21 2018, @05:41PM (2 children)

      by Anonymous Coward on Wednesday November 21 2018, @05:41PM (#764839)

      Where is that search engine where you find search engines?

    • (Score: 1, Touché) by Anonymous Coward on Wednesday November 21 2018, @09:16PM

      by Anonymous Coward on Wednesday November 21 2018, @09:16PM (#764957)

      You're not the only one, wikipedia for example: https://en.wikipedia.org/w/index.php?title=FindX&redirect=no [wikipedia.org]

      It's not about them
  • (Score: 5, Interesting) by rigrig on Wednesday November 21 2018, @04:49PM (7 children)

    by rigrig (5129) <soylentnews@tubul.net> on Wednesday November 21 2018, @04:49PM (#764798) Homepage

    we made the decision to not index known porn sites – too many of such pages contain malicious code

    Maybe the assumption that the target audience (people who value their privacy) do like built-in censorship was a bit flawed?

    (And doesn't leaving out known porn sites mean the unknown ones get higher rankings? I'd expect those to be even worse, malware-wise.)

    --
    No one remembers the singer.
    • (Score: 3, Informative) by bzipitidoo on Wednesday November 21 2018, @05:07PM (4 children)

      by bzipitidoo (4388) on Wednesday November 21 2018, @05:07PM (#764815) Journal

      Seems searching for porn would be a top use for privacy. Wouldn't do to have those weird fetishes outed to the wrong people.

      • (Score: 0) by Anonymous Coward on Wednesday November 21 2018, @05:35PM (3 children)

        by Anonymous Coward on Wednesday November 21 2018, @05:35PM (#764834)

        Reptilian themed porn is nothing to be ashamed of.

        • (Score: 0) by Anonymous Coward on Wednesday November 21 2018, @06:03PM (2 children)

          by Anonymous Coward on Wednesday November 21 2018, @06:03PM (#764851)

          Neither is furry themed porn.

          • (Score: 0) by Anonymous Coward on Wednesday November 21 2018, @08:27PM (1 child)

            by Anonymous Coward on Wednesday November 21 2018, @08:27PM (#764933)

            But god help you if you go for that human guy on girl stuff, probably have to check the Wayback Machine. [archive.org]

            • (Score: 2) by MostCynical on Wednesday November 21 2018, @08:55PM

              by MostCynical (2589) on Wednesday November 21 2018, @08:55PM (#764947) Journal

              Porn, not baby-making lessons!

              --
              "I guess once you start doubting, there's no end to it." -Batou, Ghost in the Shell: Stand Alone Complex
    • (Score: 4, Informative) by isj on Wednesday November 21 2018, @06:59PM (1 child)

      by isj (5249) on Wednesday November 21 2018, @06:59PM (#764891) Homepage

      It wasn't meant as censorship. It was a pragmatic solution to dealing with the huge amount of copy-pasted, keyword-stuffing, referral-linking port sites. The number of such site is huge, and we didn't have room for it in our index.

      The list of filtered-out sites is available: https://github.com/privacore/filter-lists/blob/master/adult/findx-adult.txt [github.com] (182526 entries)
      You can use that for "research purposes" if you like :-)

      • (Score: 2, Funny) by Anonymous Coward on Wednesday November 21 2018, @10:15PM

        by Anonymous Coward on Wednesday November 21 2018, @10:15PM (#764970)

        >too big to show

        You tease.

  • (Score: 3, Insightful) by Anonymous Coward on Wednesday November 21 2018, @04:49PM (11 children)

    by Anonymous Coward on Wednesday November 21 2018, @04:49PM (#764799)

    First, sorry about the loss of your job and hope you'll find something soon (if you are out of work...)

    But like someone else above, I'd never heard of findx either. DuckDuckGo is the privacy-protecting search engine that was my go-to. What did findx do that DDG didn't? Even then I end up using Google a lot because DDG doesn't find what I'm looking for and the Goog does. When deadlines are bearing down and search is essential I'm afraid I wimp out on privacy for the sake of Getting It Done.

    • (Score: 0) by Anonymous Coward on Wednesday November 21 2018, @04:58PM

      by Anonymous Coward on Wednesday November 21 2018, @04:58PM (#764807)

      This is the candid response you wish you had had BEFORE starting on Findx rather than survey data or hunches that people are concerned with privacy... So there must be a business supportable market for it. Well people are concerned but Not enough to sacrifice results.

    • (Score: 4, Informative) by isj on Wednesday November 21 2018, @06:53PM (8 children)

      by isj (5249) on Wednesday November 21 2018, @06:53PM (#764884) Homepage

      What did findx do that DDG didn't?

        - Findx had its own index. DDG doesn't. So DDG is at the mercy of Bing/Yandex. Also, as far as I known DDG doesn't do its own ranking (I could be wrong) so that ranking (or worst case: bias) is out of their control.
        - Findx was based in Europe. DDG is in the US.
        - Findx had a more nuanced approach to multiple languages than any other search engine (copy of my blog post: http://i1.dk/privacore_findx_blog/2018-07-05-lanuage-support-in-findx/) [i1.dk]

      • (Score: 3, Interesting) by Runaway1956 on Wednesday November 21 2018, @08:05PM (6 children)

        by Runaway1956 (2926) Subscriber Badge on Wednesday November 21 2018, @08:05PM (#764923) Journal

        That more nuanced approach to language? Does that translate to a more nuanced approach to non-Cyrillic languages? I'm sure that most of us, over the years, have noticed that Opera browser isn't very popular in western nations - but Opera is the browser of choice among people who use the Cyrillic alphabet. For search engines, those folk have Yandex, of course. No, I'm not being a smart alec here, I'm just wondering what you focused on, and maybe what you failed to focus on.

        As for your question about Soylentil's solutions: DDG is my most used search these days. I'll use Yandex sometimes, and if I've failed to get relevant hits, I'll sometimes go to Google. I refuse to use Bing under any circumstances, and my opinion of Yahoo is almost as dismal.

        • (Score: 0) by Anonymous Coward on Wednesday November 21 2018, @08:29PM (1 child)

          by Anonymous Coward on Wednesday November 21 2018, @08:29PM (#764934)

          I'm sure that most of us, over the years, have noticed that Opera browser isn't very popular in western nations - but Opera is the browser of choice among people who use the Cyrillic alphabet.

          Suspicions intensify

          • (Score: 2) by Runaway1956 on Wednesday November 21 2018, @08:36PM

            by Runaway1956 (2926) Subscriber Badge on Wednesday November 21 2018, @08:36PM (#764937) Journal

            Fine. Why don't you define those suspicions, categorize them, and spell them out for us. Meanwhile, I think I have some paint drying that needs to be watched.

        • (Score: 3, Informative) by isj on Wednesday November 21 2018, @11:12PM

          by isj (5249) on Wednesday November 21 2018, @11:12PM (#764985) Homepage

          That more nuanced approach to language? Does that translate to a more nuanced approach to non-Cyrillic languages?

          Findx focused on languages used in the West/Central Europe + English, using latin-derived scripts. And Greek.
          We didn't have the resources to crawl/index/verify other scripts or languages.

        • (Score: 2) by legont on Thursday November 22 2018, @01:24AM (2 children)

          by legont (4179) on Thursday November 22 2018, @01:24AM (#765019)

          I do use Yandex often not only because it is somewhat better than Google, which it is, but simply because I care much less about Putin sorting my dirty laundry than, you know, locals.

          However, it is the first time I've heard that Opera is popular in Cyrillic world. How come?

          --
          "Wealth is the relentless enemy of understanding" - John Kenneth Galbraith.
          • (Score: 2) by Runaway1956 on Thursday November 22 2018, @05:28AM (1 child)

            by Runaway1956 (2926) Subscriber Badge on Thursday November 22 2018, @05:28AM (#765074) Journal

            I have only noted the fact of Opera's popularity in a specific region. I am a poor monolingual American, so I have zero ideas why those people like Opera over anything. I can only look on, and listen while others discuss the reasons. And, if the reasons were stated for me, clearly and succinctly, I probably still wouldn't understand.

            • (Score: 3, Interesting) by legont on Thursday November 22 2018, @08:50PM

              by legont (4179) on Thursday November 22 2018, @08:50PM (#765318)

              I did a little asking around. Yes, to my surprise, Opera was popular among "privacy conscious" folks of somewhat eastern origins. The most popular feature was built in free vpn. However, it was sold in 2016 to Chinese, who removed the feature so "sophisticated" users left.

              --
              "Wealth is the relentless enemy of understanding" - John Kenneth Galbraith.
      • (Score: 2) by isj on Wednesday November 21 2018, @11:15PM

        by isj (5249) on Wednesday November 21 2018, @11:15PM (#764987) Homepage

        Above link should have been without the closing parenthesis: http://i1.dk/privacore_findx_blog/2018-07-05-lanuage-support-in-findx/ [i1.dk]

    • (Score: 3, Touché) by toddestan on Thursday November 22 2018, @06:27PM

      by toddestan (4982) on Thursday November 22 2018, @06:27PM (#765272)

      I guess I'm the opposite. I don't have time sift though page after page after page of bullshit irrelevant search results that Google gives me. If Duckduckgo didn't find it, it won't be in the pile of garbage that I'll get from typing the same query into Google.

  • (Score: 0) by Anonymous Coward on Wednesday November 21 2018, @05:27PM

    by Anonymous Coward on Wednesday November 21 2018, @05:27PM (#764829)

    Never lost "X", why would I search for it? Surely alt-tech sites would pay for an integrated, privacy respecting search?

  • (Score: 1, Interesting) by Anonymous Coward on Wednesday November 21 2018, @06:00PM

    by Anonymous Coward on Wednesday November 21 2018, @06:00PM (#764850)

    At this point the only real architectures that make sense for data that is expected to be free from governmental intrusion is p2p storage. Probably something like torrent, with the records authenticated via blockchain to prevent people from hijacking the search algo or creating MITM access points.

  • (Score: 0) by Anonymous Coward on Wednesday November 21 2018, @06:23PM

    by Anonymous Coward on Wednesday November 21 2018, @06:23PM (#764864)

    But you don't index porn. I figured out why you went out of business.

  • (Score: 3, Interesting) by istartedi on Wednesday November 21 2018, @06:24PM (13 children)

    by istartedi (123) on Wednesday November 21 2018, @06:24PM (#764865) Journal

    Search results from Google et. al. are 99.9% junk at least, and I'm not sure how many 9s after that but that's beside the point. IMHO, curated search needs to come back. IIRC, Yahoo stood for something like "Yet Another Hierarchical Official Oracle", or something like that. When I'm searching, I'm going to end up on Wikipedia a big chunk of the time. I'm going to find a handful of web sites like Stack Overflow that have their own search function. Google's "returned 1,245,555 results" is really pretty useless. A good search gets you the top 10 or 20 that might be helpful. The long tail, like most tails, has an asshole under it.

    So. If 1000 people curated 100 links by checking their validity once a month I don't think that would be a terribly high bar of volunteer effort. The whole thing could be one file. That's 100,000 web sites that don't suck. I'm willing to bet that would give us several 9s of goodness instead of the several 9s of crap that most search results contain.

    --
    Appended to the end of comments you post. Max: 120 chars.
    • (Score: 5, Interesting) by isj on Wednesday November 21 2018, @07:08PM (3 children)

      by isj (5249) on Wednesday November 21 2018, @07:08PM (#764897) Homepage

      Google actually has 20.000+ freelancers reviewing results and essentially curating the results. So you have to compete with 20.000 people doing this more-or-less full-time.

      Note: the google reviewers only have approximately 10 seconds to review search result entry - that's why some of the automated-translation spam-blog slip through.

      I can't find the source right now but looking for "google review guidelines" or the like should turn up the guidelines for the results reviewers.

      • (Score: 2) by legont on Thursday November 22 2018, @01:27AM (2 children)

        by legont (4179) on Thursday November 22 2018, @01:27AM (#765020)

        So, Google degraded to the old time Yahoo?

        --
        "Wealth is the relentless enemy of understanding" - John Kenneth Galbraith.
        • (Score: 2) by isj on Thursday November 22 2018, @01:43AM (1 child)

          by isj (5249) on Thursday November 22 2018, @01:43AM (#765024) Homepage

          Not exactly. From what I can gather they use their ultra-secret ranking algorithm, but then validate/fine-tune it with human reviewers.

          • (Score: 2) by legont on Thursday November 22 2018, @02:30AM

            by legont (4179) on Thursday November 22 2018, @02:30AM (#765044)

            This sounds like an old school censorship to me. No wonder politicians are demanding it bent one way or another. Once they started doing it, any authorities request is reasonable.

            --
            "Wealth is the relentless enemy of understanding" - John Kenneth Galbraith.
    • (Score: 3, Insightful) by darkfeline on Wednesday November 21 2018, @08:10PM (2 children)

      by darkfeline (1030) on Wednesday November 21 2018, @08:10PM (#764925) Homepage

      > Search results from Google et. al. are 99.9% junk at least

      Keep in mind that's after:

      "The amount of spam, link-farms, referrer-linking, etc. is beyond your worst nightmare."

      Returning good results is hard. I think you're spoiled by how good modern search engines are and demanding the impossible.

      > If 1000 people curated 100 links by checking their validity once a month I don't think that would be a terribly high bar of volunteer effort.

      How many links do you think you need to serve all of the queries done globally, or even in one country like the US? A million? Ten million? So you're going to rely on a hundred million volunteers curating links for free, right?

      --
      Join the SDF Public Access UNIX System today!
      • (Score: 2) by toddestan on Thursday November 22 2018, @06:38PM (1 child)

        by toddestan (4982) on Thursday November 22 2018, @06:38PM (#765277)

        Returning good results is hard. I think you're spoiled by how good modern search engines are and demanding the impossible.

        It's not that difficult because there's a lot of search engines out there that do better than Google. I will grant that Google has the problem of being the search engine that everyone games with their spammy SEO tricks, but even Bing is on par or better than Google for a lot of searches.

        Though I don't demand anything of Google. I'll just use those other search engines instead.

        How many links do you think you need to serve all of the queries done globally, or even in one country like the US? A million? Ten million? So you're going to rely on a hundred million volunteers curating links for free, right?

        It's probably not as big of a problem as you think. My guess is a handful of servers run most of the spammy link farms out there. You block those and you'd make a big difference. Of course you'd have the long tail to deal with and you'd never totally eliminate the problem, but for little effort I bet they could make a pretty good dent in it.

        Google could also throw their weight around too. If there are specific hosting companies or IP blocks that seemed to host a large number of spammy sites, just downrank every site that's hosted there. I bet that would fix the problem real quick.

        • (Score: 2) by isj on Thursday November 22 2018, @08:39PM

          by isj (5249) on Thursday November 22 2018, @08:39PM (#765315) Homepage

          My guess is a handful of servers run most of the spammy link farms out there.

          Sounds like your hands have thousands of fingers. You freak :-)

          On a more serious note: There is quite a bit more than a few spam/seo/link-farm/... sites and operators. Think how many clandestine SEO companies there are. There are at least as many operators of link farms.

          A small organisation, substandard.org, identified several link sites / pagerank-aggregation sites just for the Danish ccTLD. Some links were using link text go boost ranking of sites offering competing products etc. One of them a large retail chain. It is illegal to use link farms in Denmark (simplification) so they of course reported those link farms to the appropriate authorities. Soon after substandard.org it got DDOSed. I think they are still DDOSes now, 1½ year later. So someone doesn't like it if you bring down link-farms.

    • (Score: 3, Insightful) by Runaway1956 on Wednesday November 21 2018, @08:21PM

      by Runaway1956 (2926) Subscriber Badge on Wednesday November 21 2018, @08:21PM (#764929) Journal

      My experience with Google isn't like that. When I do a Google search, my more-or-less relevant results are always close to 100%. (Sorry, I'm not doing the numbers to see just how close to 100%.) For starters, you block Google's ad servers at the router. Block adsense. Block googletagservices and googletagmanager and googleanalytics. Those google services which I want to use still work with all that crap blocked. I should probably take a screenshot of my Google searches, and post it somewhere for people to see. uBlock and uBlock origin are just two of the script blockers that offer to block that stuff for you.

      My web surfing is, at a minimum, 85% ad free and tracker free. The advertising assholes ruined my internet experience almost two decades ago, and I started learning then how to stop that nonsense. In the time since then, I've actually taken ownership of my desktop, and my internet experience. Some of you may not be old enough to really appreciate MySpace. It was godawful horrible. Geocities had some crap that was nearly as bad. The wider internet was trying hard to be just as bad, with their insane banners, popups, popunders, etc ad nauseum. As I say, I took ownership, and blocked every bit of it.

      Google is less glaringly horrible than most of that crap was. But, still, it's none of Google's business whether I like Rice Krispies, or Cocoa Puffs, or Wheaties. It is far less of their business what kind of car I drive, or where I shop for auto parts, or much of anything else. So, I prevent Google learning anything that I can prevent.

      And, to top that all off - I'm not even paranoid. I know that Google isn't out to get me, and I don't worry about anything like that. I am simply aware that Google is a prying corporate entity, and I am also aware that I don't have to permit them to pry into my life.

      But, I believe that Google's core - their search engine - is just about the best in the world. When anything or everything else fails me, I go to Google. If I can't find anything relevant with Google, then I presume that whatever I am looking for has been "sanitized", and I won't find it without some special insider help.

    • (Score: 2) by bobthecimmerian on Wednesday November 21 2018, @09:01PM (3 children)

      by bobthecimmerian (6834) on Wednesday November 21 2018, @09:01PM (#764949)

      It's easy to downplay the difficulty of getting search right. But to me two examples of the skill behind Google search are searches for software code problems, and searches for sales of items that aren't tremendously mainstream. There are probably five hundred mainstream products in which searching for them on Google directly vs searching for them on Amazon.com, Walmart.com, and Ebay.com gives identical results but for anything outside that Google is reasonably relevant and the rest go right off the rails. Bing doesn't hold up, either.

      I'm not defending the company's business model or ethics. I'm just saying that matching them at their own game is not trivial.

      • (Score: 4, Interesting) by istartedi on Wednesday November 21 2018, @09:30PM

        by istartedi (123) on Wednesday November 21 2018, @09:30PM (#764959) Journal

        It's easy to downplay the difficulty of getting search right

        Fair enough. Matching Google step-for-step would be daunting so why try? Right now I can go to Google and type "How high is the Burj Khalifa" and it comes back with "2,717′, 2,722′ to tip" in a heartbeat.

        That's pretty smart. A lot's happening under the hood to make that happen because it's capable of answering very generic queries like that. You don't even have to think.

        OTOH, if we had a hierarchy of something like /lists/buildings/tall, or /architecture/tall buildings we could find some pages devoted to this kind of thing and probably get the height easily--it would just take a bit more thought and time on the part of the end user.

        It would definitely be the kind of trade-off that people make all the time when they go "off the grid" to some degree. Growing a garden vs. fresh veggies from the store.

        --
        Appended to the end of comments you post. Max: 120 chars.
      • (Score: 3, Interesting) by isj on Thursday November 22 2018, @12:06AM (1 child)

        by isj (5249) on Thursday November 22 2018, @12:06AM (#764995) Homepage

        I'm just saying that matching them at [google's] own game is not trivial.

        I agree. That doesn't mean that there isn't room for improvement in google's results.

        Examples I can think of I encountered in my work at Findx:

        Bias toward shops
        If you search for a single word, eg "plasterboard", the google results will have a strong bias toward shops where you can buy them. No reviews. No building codes. No evaluation by consumer organisations. So if you search for a single word google thinks you want to buy stuff
        Still vulnerable to SEO
        An acquaintance noticed that google never showed links to where you could buy the cheapest plasterboards. So apparently the sites with SEO and link farms made it to page 1 every time, but the most useful link for the user was buried on page 3. There isn't much quality difference in plasterboards so wouldn't the cheapest be the "best" result?
        Handling of compound words
        I noticed that google's handling of compound words isn't that great. They claim they solved "the Swedish problem" (which is what they called the compound words challenge) in 2006. But I recently saw that a news paper's front page had a new compound word in an article link, and the article had the compound word in a different infliction. Google did have the main article crawled (verified with osearch for other unique words), but couldn't find it using the compound word. First after 3 days did it work. I'm not sure what is going on there, but I have a suspicion that analyzing compound words and generating inflictions is done offline and in batch, and there is some lag there. If you're curious then it was the Danish word "smølfedomptør"
        Old documents ignored?
        I noticed that findx could find an old usenet post that google couldn't find. It was a 10 year old post made available on a webpage. No clue why google didn't find it. So google apparently doesn't crawl everything, or they drop old documents
        Apparently doesn't use third-party quality indicators
        When looking to buy something google apparently doesn't use third-party quality seals/approval/badges (at least we couldn't find any indications that it does). Many countries have consumer organisations that provide badges to well-behaved webshops. That is a useful ranking parameter.

        One more note on compound words: If you want to handle Danish/Norwegian/German/Swedish/Icelandic/Finnish/Russian (and to some extend Italian) you have to deal with compound words. Findx solved it for Danish using a morphological dictionary (STO [cst.ku.dk]). I did some (incomplete) analysis of Danish webpages and it seemed that up to 10-30% of the unique words were compound words made-on-the-spot. So you can never have a complete dictionary for languages that easily form compounds, and you have to deal with them in some other way.

        • (Score: 2) by bobthecimmerian on Sunday November 25 2018, @03:41PM

          by bobthecimmerian (6834) on Sunday November 25 2018, @03:41PM (#766174)

          Thanks for the detailed response. Everything you wrote makes sense. For what it's worth, I'm sorry FindX failed. I too was unaware of it, and I had tried Yacy and Searx and a few other options that have since disappeared.

    • (Score: 0) by Anonymous Coward on Wednesday November 21 2018, @09:06PM

      by Anonymous Coward on Wednesday November 21 2018, @09:06PM (#764951)

      OK, I'm in. Where do I sign up to help curate?

      I'd be happy to check the special interest sites that I visit often, maybe even one or two others that were randomly assigned. I'd even be happy to send some data back--for example, I use EFF's Privacy Badger which reports # of trackers, could pass that number along to the database.

      How will shills be kept out? One bad apple (a curator paid to plug certain sites) could poison the database... No one wants another Yelp (uggggh).

      The searx project could be a good source for code -- they are set up for anyone to host their own instance.

      Can you

  • (Score: 0) by Anonymous Coward on Wednesday November 21 2018, @06:58PM (6 children)

    by Anonymous Coward on Wednesday November 21 2018, @06:58PM (#764889)

    If someone today wanted to set up their own search engine on a university server (assuming bandwidth is no object), what are some good algorithms and techniques to manage spidering without infinite recursion or doing something stupid to piss off server owners, identification of spammy sites, storage, and searching your storage? You're sure as hell not using SQL to store the results, it has to be something custom. I once had a fulltext indexed field that took a minute to return a response.

    Can someone recommend papers, design talks, code examples?

    • (Score: 4, Interesting) by isj on Wednesday November 21 2018, @07:14PM

      by isj (5249) on Wednesday November 21 2018, @07:14PM (#764902) Homepage

      No, you definitely don't want to store it a standard SQL database, even with full-text search.

      Look for "inverted indexes".

      Depending on the goal of your search engine you may be able to reduce the index size with:
          - lemmatization
          - stemming
          - if word order doesn't matter then store occurrences only once per document

      If the document set is relatively uniform (say, a set of scientific papers, or a set of children's books (but not a mix of both)) then you can use BM25 ranking algorithm for getting reasonably good results.

    • (Score: 1, Disagree) by Anonymous Coward on Wednesday November 21 2018, @11:23PM (1 child)

      by Anonymous Coward on Wednesday November 21 2018, @11:23PM (#764988)

      Google makes its own optimized hardware for its search engine. Like it or not, that is an advantage that a startup is going to have a hard time competing with. They are really fucking smart. If you are going to compete you're going to need to do something that is totally different from the ground up. That is something that can be done IMHO, but it is a hard nut to crack. And you sure as shit aren't going to crack it by taking an off the shelf SQL server and thinking you're going to make a search engine. Not gonna happen.

      You have to completely rethink not just search, but the underlying systems that make the Internet function. Googles competitors in the hardware/OS/application marketspace are already doing that, and they've all seemed to have decided that fracturing the network to kick Google off of it is the only way of competing with them. That is probably true for a big company, but not for a small one.

      • (Score: 0) by Anonymous Coward on Thursday November 22 2018, @12:07AM

        by Anonymous Coward on Thursday November 22 2018, @12:07AM (#764996)

        The only way to do it is to grow a true AI, such as a brain in a box, and train it to be smarter and more useful than any search engine or voice assistant ever conceived of.

        Of course, Google could do it first.

    • (Score: 2) by eravnrekaree on Thursday November 22 2018, @05:12PM (1 child)

      by eravnrekaree (555) on Thursday November 22 2018, @05:12PM (#765250)

      The problem with a run of the mill MySQL is that anything that uses a single monolithic SQL server would crash and burn under the load. It would need to be highly distributed, the word is "scale out". Hundreds of SQL server instances over a farm. Which means a lot of load balancing, mirroring, slicing, distribution, etc going in a kind of a mesh architecture avoiding a single point of load. Sharding has been used in dealing with these sorts of things as well. SQL may not even be the big problem, the big problem are implementations of it, which usually revolve around a single monolithic server or a few and where replication is primative . The most research however have gone into scale-out however for NoSQL databases.

      There is Gigablast which is on GitHub that did implement an open source search engine.

      • (Score: 3, Informative) by isj on Thursday November 22 2018, @05:34PM

        by isj (5249) on Thursday November 22 2018, @05:34PM (#765252) Homepage

        There is Gigablast which is on GitHub that did implement an open source search engine.

        Findx used the gigablast open-soure-search-engine. In hindsight that was a mistake. Email me if you want details. I'm not going to rant about it here.

        Regarding sharding: Yes, you have to shard. You also have to build in the assumptions that shards will fail, so you need redundancy and a way to deal with inconsistencies.

    • (Score: 2) by quietus on Thursday November 22 2018, @06:18PM

      by quietus (6328) on Thursday November 22 2018, @06:18PM (#765268) Journal

      Start with the book Introduction to Information Retrieval (Manning, Raghavan and Schütze, Cambridge Press). Study graph theory & algorithms thoroughly*. After that, some (most?) of the latest research around the subject of Information Retrieval/Search (engines) can be found, publicly available, here [iw3c2.org] (the older material can be directly accessed, but the newer material requires ACM membership).

      I you want to delve a bit deeper into the mathematical background of graph theory or anything else [mathematically] interesting you encounter, start out with Dover Press books.

  • (Score: 0) by Anonymous Coward on Wednesday November 21 2018, @07:06PM

    by Anonymous Coward on Wednesday November 21 2018, @07:06PM (#764896)

    Not a good news, if the company was as it says it was. Probably a very tough to compete in the search business, and i have to say aswell that i've never heard of FindX.

    Personally i use startpage and sometimes google, if it think it's not showing me everything. Although most of the time i can't find anything more with Google than i can with Startpage.

    I am a bit conserned how startpage and DDG actually make their money.

  • (Score: 3, Insightful) by Anonymous Coward on Wednesday November 21 2018, @08:15PM

    by Anonymous Coward on Wednesday November 21 2018, @08:15PM (#764926)

    What the world needs is a new hastalavista.box.sk.

  • (Score: 2) by eravnrekaree on Thursday November 22 2018, @02:29PM

    by eravnrekaree (555) on Thursday November 22 2018, @02:29PM (#765193)

    DuckDuckGo does not paginate results in a way you can jump around between different result pages by page number. Loading additional pages into a single page causes memory use to skyrocket and does not make things easier to use. I wanted to view one page of results at a time and be able to jump to specific pages of results at will. This is really a bummer and makes the thing hard to use.

(1)