Stories
Slash Boxes
Comments

SoylentNews is people

posted by takyon on Wednesday November 21 2018, @04:00PM   Printer-friendly
from the found-and-lost dept.

The privacy-oriented search engine Findx has shut down: https://privacore.github.io/

The reasons cited are:

  • While people are starting to understand the importance of privacy it is a major hurdle to get them to select a different search engine.
  • Search engines eat resources like crazy, so operating costs are non-negligible.
  • Some sites (including e.g. github) use a whitelist in robots.txt, blocking new crawlers.
  • The amount of spam, link-farms, referrer-linking, etc. is beyond your worst nightmare.
  • Returning good results takes a long time to fine-tune.
  • Monetizing is nearly impossible because advertising networks want to know everything about the users, going against privacy concerns.
  • Buying search results from other search engines is impossible until you have least x million searches/month. Getting x million searches/month is impossible unless you buy search results from other search engines (or sink a lot of cash into making it yourself).

So what do you soylentils think can be done to increase privacy for ordinary users, search-engine-wise ?

Dislaimer: I worked at Findx.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 0) by Anonymous Coward on Wednesday November 21 2018, @06:58PM (6 children)

    by Anonymous Coward on Wednesday November 21 2018, @06:58PM (#764889)

    If someone today wanted to set up their own search engine on a university server (assuming bandwidth is no object), what are some good algorithms and techniques to manage spidering without infinite recursion or doing something stupid to piss off server owners, identification of spammy sites, storage, and searching your storage? You're sure as hell not using SQL to store the results, it has to be something custom. I once had a fulltext indexed field that took a minute to return a response.

    Can someone recommend papers, design talks, code examples?

  • (Score: 4, Interesting) by isj on Wednesday November 21 2018, @07:14PM

    by isj (5249) on Wednesday November 21 2018, @07:14PM (#764902) Homepage

    No, you definitely don't want to store it a standard SQL database, even with full-text search.

    Look for "inverted indexes".

    Depending on the goal of your search engine you may be able to reduce the index size with:
        - lemmatization
        - stemming
        - if word order doesn't matter then store occurrences only once per document

    If the document set is relatively uniform (say, a set of scientific papers, or a set of children's books (but not a mix of both)) then you can use BM25 ranking algorithm for getting reasonably good results.

  • (Score: 1, Disagree) by Anonymous Coward on Wednesday November 21 2018, @11:23PM (1 child)

    by Anonymous Coward on Wednesday November 21 2018, @11:23PM (#764988)

    Google makes its own optimized hardware for its search engine. Like it or not, that is an advantage that a startup is going to have a hard time competing with. They are really fucking smart. If you are going to compete you're going to need to do something that is totally different from the ground up. That is something that can be done IMHO, but it is a hard nut to crack. And you sure as shit aren't going to crack it by taking an off the shelf SQL server and thinking you're going to make a search engine. Not gonna happen.

    You have to completely rethink not just search, but the underlying systems that make the Internet function. Googles competitors in the hardware/OS/application marketspace are already doing that, and they've all seemed to have decided that fracturing the network to kick Google off of it is the only way of competing with them. That is probably true for a big company, but not for a small one.

    • (Score: 0) by Anonymous Coward on Thursday November 22 2018, @12:07AM

      by Anonymous Coward on Thursday November 22 2018, @12:07AM (#764996)

      The only way to do it is to grow a true AI, such as a brain in a box, and train it to be smarter and more useful than any search engine or voice assistant ever conceived of.

      Of course, Google could do it first.

  • (Score: 2) by eravnrekaree on Thursday November 22 2018, @05:12PM (1 child)

    by eravnrekaree (555) on Thursday November 22 2018, @05:12PM (#765250)

    The problem with a run of the mill MySQL is that anything that uses a single monolithic SQL server would crash and burn under the load. It would need to be highly distributed, the word is "scale out". Hundreds of SQL server instances over a farm. Which means a lot of load balancing, mirroring, slicing, distribution, etc going in a kind of a mesh architecture avoiding a single point of load. Sharding has been used in dealing with these sorts of things as well. SQL may not even be the big problem, the big problem are implementations of it, which usually revolve around a single monolithic server or a few and where replication is primative . The most research however have gone into scale-out however for NoSQL databases.

    There is Gigablast which is on GitHub that did implement an open source search engine.

    • (Score: 3, Informative) by isj on Thursday November 22 2018, @05:34PM

      by isj (5249) on Thursday November 22 2018, @05:34PM (#765252) Homepage

      There is Gigablast which is on GitHub that did implement an open source search engine.

      Findx used the gigablast open-soure-search-engine. In hindsight that was a mistake. Email me if you want details. I'm not going to rant about it here.

      Regarding sharding: Yes, you have to shard. You also have to build in the assumptions that shards will fail, so you need redundancy and a way to deal with inconsistencies.

  • (Score: 2) by quietus on Thursday November 22 2018, @06:18PM

    by quietus (6328) on Thursday November 22 2018, @06:18PM (#765268) Journal

    Start with the book Introduction to Information Retrieval (Manning, Raghavan and Schütze, Cambridge Press). Study graph theory & algorithms thoroughly*. After that, some (most?) of the latest research around the subject of Information Retrieval/Search (engines) can be found, publicly available, here [iw3c2.org] (the older material can be directly accessed, but the newer material requires ACM membership).

    I you want to delve a bit deeper into the mathematical background of graph theory or anything else [mathematically] interesting you encounter, start out with Dover Press books.