Stories
Slash Boxes
Comments

SoylentNews is people

posted by takyon on Wednesday November 21 2018, @04:00PM   Printer-friendly
from the found-and-lost dept.

The privacy-oriented search engine Findx has shut down: https://privacore.github.io/

The reasons cited are:

  • While people are starting to understand the importance of privacy it is a major hurdle to get them to select a different search engine.
  • Search engines eat resources like crazy, so operating costs are non-negligible.
  • Some sites (including e.g. github) use a whitelist in robots.txt, blocking new crawlers.
  • The amount of spam, link-farms, referrer-linking, etc. is beyond your worst nightmare.
  • Returning good results takes a long time to fine-tune.
  • Monetizing is nearly impossible because advertising networks want to know everything about the users, going against privacy concerns.
  • Buying search results from other search engines is impossible until you have least x million searches/month. Getting x million searches/month is impossible unless you buy search results from other search engines (or sink a lot of cash into making it yourself).

So what do you soylentils think can be done to increase privacy for ordinary users, search-engine-wise ?

Dislaimer: I worked at Findx.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 3, Interesting) by istartedi on Wednesday November 21 2018, @06:24PM (13 children)

    by istartedi (123) on Wednesday November 21 2018, @06:24PM (#764865) Journal

    Search results from Google et. al. are 99.9% junk at least, and I'm not sure how many 9s after that but that's beside the point. IMHO, curated search needs to come back. IIRC, Yahoo stood for something like "Yet Another Hierarchical Official Oracle", or something like that. When I'm searching, I'm going to end up on Wikipedia a big chunk of the time. I'm going to find a handful of web sites like Stack Overflow that have their own search function. Google's "returned 1,245,555 results" is really pretty useless. A good search gets you the top 10 or 20 that might be helpful. The long tail, like most tails, has an asshole under it.

    So. If 1000 people curated 100 links by checking their validity once a month I don't think that would be a terribly high bar of volunteer effort. The whole thing could be one file. That's 100,000 web sites that don't suck. I'm willing to bet that would give us several 9s of goodness instead of the several 9s of crap that most search results contain.

    --
    Appended to the end of comments you post. Max: 120 chars.
    Starting Score:    1  point
    Moderation   +1  
       Interesting=1, Total=1
    Extra 'Interesting' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   3  
  • (Score: 5, Interesting) by isj on Wednesday November 21 2018, @07:08PM (3 children)

    by isj (5249) on Wednesday November 21 2018, @07:08PM (#764897) Homepage

    Google actually has 20.000+ freelancers reviewing results and essentially curating the results. So you have to compete with 20.000 people doing this more-or-less full-time.

    Note: the google reviewers only have approximately 10 seconds to review search result entry - that's why some of the automated-translation spam-blog slip through.

    I can't find the source right now but looking for "google review guidelines" or the like should turn up the guidelines for the results reviewers.

    • (Score: 2) by legont on Thursday November 22 2018, @01:27AM (2 children)

      by legont (4179) on Thursday November 22 2018, @01:27AM (#765020)

      So, Google degraded to the old time Yahoo?

      --
      "Wealth is the relentless enemy of understanding" - John Kenneth Galbraith.
      • (Score: 2) by isj on Thursday November 22 2018, @01:43AM (1 child)

        by isj (5249) on Thursday November 22 2018, @01:43AM (#765024) Homepage

        Not exactly. From what I can gather they use their ultra-secret ranking algorithm, but then validate/fine-tune it with human reviewers.

        • (Score: 2) by legont on Thursday November 22 2018, @02:30AM

          by legont (4179) on Thursday November 22 2018, @02:30AM (#765044)

          This sounds like an old school censorship to me. No wonder politicians are demanding it bent one way or another. Once they started doing it, any authorities request is reasonable.

          --
          "Wealth is the relentless enemy of understanding" - John Kenneth Galbraith.
  • (Score: 3, Insightful) by darkfeline on Wednesday November 21 2018, @08:10PM (2 children)

    by darkfeline (1030) on Wednesday November 21 2018, @08:10PM (#764925) Homepage

    > Search results from Google et. al. are 99.9% junk at least

    Keep in mind that's after:

    "The amount of spam, link-farms, referrer-linking, etc. is beyond your worst nightmare."

    Returning good results is hard. I think you're spoiled by how good modern search engines are and demanding the impossible.

    > If 1000 people curated 100 links by checking their validity once a month I don't think that would be a terribly high bar of volunteer effort.

    How many links do you think you need to serve all of the queries done globally, or even in one country like the US? A million? Ten million? So you're going to rely on a hundred million volunteers curating links for free, right?

    --
    Join the SDF Public Access UNIX System today!
    • (Score: 2) by toddestan on Thursday November 22 2018, @06:38PM (1 child)

      by toddestan (4982) on Thursday November 22 2018, @06:38PM (#765277)

      Returning good results is hard. I think you're spoiled by how good modern search engines are and demanding the impossible.

      It's not that difficult because there's a lot of search engines out there that do better than Google. I will grant that Google has the problem of being the search engine that everyone games with their spammy SEO tricks, but even Bing is on par or better than Google for a lot of searches.

      Though I don't demand anything of Google. I'll just use those other search engines instead.

      How many links do you think you need to serve all of the queries done globally, or even in one country like the US? A million? Ten million? So you're going to rely on a hundred million volunteers curating links for free, right?

      It's probably not as big of a problem as you think. My guess is a handful of servers run most of the spammy link farms out there. You block those and you'd make a big difference. Of course you'd have the long tail to deal with and you'd never totally eliminate the problem, but for little effort I bet they could make a pretty good dent in it.

      Google could also throw their weight around too. If there are specific hosting companies or IP blocks that seemed to host a large number of spammy sites, just downrank every site that's hosted there. I bet that would fix the problem real quick.

      • (Score: 2) by isj on Thursday November 22 2018, @08:39PM

        by isj (5249) on Thursday November 22 2018, @08:39PM (#765315) Homepage

        My guess is a handful of servers run most of the spammy link farms out there.

        Sounds like your hands have thousands of fingers. You freak :-)

        On a more serious note: There is quite a bit more than a few spam/seo/link-farm/... sites and operators. Think how many clandestine SEO companies there are. There are at least as many operators of link farms.

        A small organisation, substandard.org, identified several link sites / pagerank-aggregation sites just for the Danish ccTLD. Some links were using link text go boost ranking of sites offering competing products etc. One of them a large retail chain. It is illegal to use link farms in Denmark (simplification) so they of course reported those link farms to the appropriate authorities. Soon after substandard.org it got DDOSed. I think they are still DDOSes now, 1½ year later. So someone doesn't like it if you bring down link-farms.

  • (Score: 3, Insightful) by Runaway1956 on Wednesday November 21 2018, @08:21PM

    by Runaway1956 (2926) Subscriber Badge on Wednesday November 21 2018, @08:21PM (#764929) Journal

    My experience with Google isn't like that. When I do a Google search, my more-or-less relevant results are always close to 100%. (Sorry, I'm not doing the numbers to see just how close to 100%.) For starters, you block Google's ad servers at the router. Block adsense. Block googletagservices and googletagmanager and googleanalytics. Those google services which I want to use still work with all that crap blocked. I should probably take a screenshot of my Google searches, and post it somewhere for people to see. uBlock and uBlock origin are just two of the script blockers that offer to block that stuff for you.

    My web surfing is, at a minimum, 85% ad free and tracker free. The advertising assholes ruined my internet experience almost two decades ago, and I started learning then how to stop that nonsense. In the time since then, I've actually taken ownership of my desktop, and my internet experience. Some of you may not be old enough to really appreciate MySpace. It was godawful horrible. Geocities had some crap that was nearly as bad. The wider internet was trying hard to be just as bad, with their insane banners, popups, popunders, etc ad nauseum. As I say, I took ownership, and blocked every bit of it.

    Google is less glaringly horrible than most of that crap was. But, still, it's none of Google's business whether I like Rice Krispies, or Cocoa Puffs, or Wheaties. It is far less of their business what kind of car I drive, or where I shop for auto parts, or much of anything else. So, I prevent Google learning anything that I can prevent.

    And, to top that all off - I'm not even paranoid. I know that Google isn't out to get me, and I don't worry about anything like that. I am simply aware that Google is a prying corporate entity, and I am also aware that I don't have to permit them to pry into my life.

    But, I believe that Google's core - their search engine - is just about the best in the world. When anything or everything else fails me, I go to Google. If I can't find anything relevant with Google, then I presume that whatever I am looking for has been "sanitized", and I won't find it without some special insider help.

  • (Score: 2) by bobthecimmerian on Wednesday November 21 2018, @09:01PM (3 children)

    by bobthecimmerian (6834) on Wednesday November 21 2018, @09:01PM (#764949)

    It's easy to downplay the difficulty of getting search right. But to me two examples of the skill behind Google search are searches for software code problems, and searches for sales of items that aren't tremendously mainstream. There are probably five hundred mainstream products in which searching for them on Google directly vs searching for them on Amazon.com, Walmart.com, and Ebay.com gives identical results but for anything outside that Google is reasonably relevant and the rest go right off the rails. Bing doesn't hold up, either.

    I'm not defending the company's business model or ethics. I'm just saying that matching them at their own game is not trivial.

    • (Score: 4, Interesting) by istartedi on Wednesday November 21 2018, @09:30PM

      by istartedi (123) on Wednesday November 21 2018, @09:30PM (#764959) Journal

      It's easy to downplay the difficulty of getting search right

      Fair enough. Matching Google step-for-step would be daunting so why try? Right now I can go to Google and type "How high is the Burj Khalifa" and it comes back with "2,717′, 2,722′ to tip" in a heartbeat.

      That's pretty smart. A lot's happening under the hood to make that happen because it's capable of answering very generic queries like that. You don't even have to think.

      OTOH, if we had a hierarchy of something like /lists/buildings/tall, or /architecture/tall buildings we could find some pages devoted to this kind of thing and probably get the height easily--it would just take a bit more thought and time on the part of the end user.

      It would definitely be the kind of trade-off that people make all the time when they go "off the grid" to some degree. Growing a garden vs. fresh veggies from the store.

      --
      Appended to the end of comments you post. Max: 120 chars.
    • (Score: 3, Interesting) by isj on Thursday November 22 2018, @12:06AM (1 child)

      by isj (5249) on Thursday November 22 2018, @12:06AM (#764995) Homepage

      I'm just saying that matching them at [google's] own game is not trivial.

      I agree. That doesn't mean that there isn't room for improvement in google's results.

      Examples I can think of I encountered in my work at Findx:

      Bias toward shops
      If you search for a single word, eg "plasterboard", the google results will have a strong bias toward shops where you can buy them. No reviews. No building codes. No evaluation by consumer organisations. So if you search for a single word google thinks you want to buy stuff
      Still vulnerable to SEO
      An acquaintance noticed that google never showed links to where you could buy the cheapest plasterboards. So apparently the sites with SEO and link farms made it to page 1 every time, but the most useful link for the user was buried on page 3. There isn't much quality difference in plasterboards so wouldn't the cheapest be the "best" result?
      Handling of compound words
      I noticed that google's handling of compound words isn't that great. They claim they solved "the Swedish problem" (which is what they called the compound words challenge) in 2006. But I recently saw that a news paper's front page had a new compound word in an article link, and the article had the compound word in a different infliction. Google did have the main article crawled (verified with osearch for other unique words), but couldn't find it using the compound word. First after 3 days did it work. I'm not sure what is going on there, but I have a suspicion that analyzing compound words and generating inflictions is done offline and in batch, and there is some lag there. If you're curious then it was the Danish word "smølfedomptør"
      Old documents ignored?
      I noticed that findx could find an old usenet post that google couldn't find. It was a 10 year old post made available on a webpage. No clue why google didn't find it. So google apparently doesn't crawl everything, or they drop old documents
      Apparently doesn't use third-party quality indicators
      When looking to buy something google apparently doesn't use third-party quality seals/approval/badges (at least we couldn't find any indications that it does). Many countries have consumer organisations that provide badges to well-behaved webshops. That is a useful ranking parameter.

      One more note on compound words: If you want to handle Danish/Norwegian/German/Swedish/Icelandic/Finnish/Russian (and to some extend Italian) you have to deal with compound words. Findx solved it for Danish using a morphological dictionary (STO [cst.ku.dk]). I did some (incomplete) analysis of Danish webpages and it seemed that up to 10-30% of the unique words were compound words made-on-the-spot. So you can never have a complete dictionary for languages that easily form compounds, and you have to deal with them in some other way.

      • (Score: 2) by bobthecimmerian on Sunday November 25 2018, @03:41PM

        by bobthecimmerian (6834) on Sunday November 25 2018, @03:41PM (#766174)

        Thanks for the detailed response. Everything you wrote makes sense. For what it's worth, I'm sorry FindX failed. I too was unaware of it, and I had tried Yacy and Searx and a few other options that have since disappeared.

  • (Score: 0) by Anonymous Coward on Wednesday November 21 2018, @09:06PM

    by Anonymous Coward on Wednesday November 21 2018, @09:06PM (#764951)

    OK, I'm in. Where do I sign up to help curate?

    I'd be happy to check the special interest sites that I visit often, maybe even one or two others that were randomly assigned. I'd even be happy to send some data back--for example, I use EFF's Privacy Badger which reports # of trackers, could pass that number along to the database.

    How will shills be kept out? One bad apple (a curator paid to plug certain sites) could poison the database... No one wants another Yelp (uggggh).

    The searx project could be a good source for code -- they are set up for anyone to host their own instance.

    Can you