While people are starting to understand the importance of privacy it is a major hurdle to get them to select a different search engine.
Search engines eat resources like crazy, so operating costs are non-negligible.
Some sites (including e.g. github) use a whitelist in robots.txt, blocking new crawlers.
The amount of spam, link-farms, referrer-linking, etc. is beyond your worst nightmare.
Returning good results takes a long time to fine-tune.
Monetizing is nearly impossible because advertising networks want to know everything about the users, going against privacy concerns.
Buying search results from other search engines is impossible until you have least x million searches/month. Getting x million searches/month is impossible unless you buy search results from other search engines (or sink a lot of cash into making it yourself).
So what do you soylentils think can be done to increase privacy for ordinary users, search-engine-wise ?
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(Score: 0) by Anonymous Coward on Wednesday November 21 2018, @06:58PM
(6 children)
by Anonymous Coward
on Wednesday November 21 2018, @06:58PM (#764889)
If someone today wanted to set up their own search engine on a university server (assuming bandwidth is no object), what are some good algorithms and techniques to manage spidering without infinite recursion or doing something stupid to piss off server owners, identification of spammy sites, storage, and searching your storage? You're sure as hell not using SQL to store the results, it has to be something custom. I once had a fulltext indexed field that took a minute to return a response.
Can someone recommend papers, design talks, code examples?
(Score: 4, Interesting) by isj on Wednesday November 21 2018, @07:14PM
No, you definitely don't want to store it a standard SQL database, even with full-text search.
Look for "inverted indexes".
Depending on the goal of your search engine you may be able to reduce the index size with:
- lemmatization
- stemming
- if word order doesn't matter then store occurrences only once per document
If the document set is relatively uniform (say, a set of scientific papers, or a set of children's books (but not a mix of both)) then you can use BM25 ranking algorithm for getting reasonably good results.
(Score: 1, Disagree) by Anonymous Coward on Wednesday November 21 2018, @11:23PM
(1 child)
by Anonymous Coward
on Wednesday November 21 2018, @11:23PM (#764988)
Google makes its own optimized hardware for its search engine. Like it or not, that is an advantage that a startup is going to have a hard time competing with. They are really fucking smart. If you are going to compete you're going to need to do something that is totally different from the ground up. That is something that can be done IMHO, but it is a hard nut to crack. And you sure as shit aren't going to crack it by taking an off the shelf SQL server and thinking you're going to make a search engine. Not gonna happen.
You have to completely rethink not just search, but the underlying systems that make the Internet function. Googles competitors in the hardware/OS/application marketspace are already doing that, and they've all seemed to have decided that fracturing the network to kick Google off of it is the only way of competing with them. That is probably true for a big company, but not for a small one.
(Score: 0) by Anonymous Coward on Thursday November 22 2018, @12:07AM
by Anonymous Coward
on Thursday November 22 2018, @12:07AM (#764996)
The only way to do it is to grow a true AI, such as a brain in a box, and train it to be smarter and more useful than any search engine or voice assistant ever conceived of.
The problem with a run of the mill MySQL is that anything that uses a single monolithic SQL server would crash and burn under the load. It would need to be highly distributed, the word is "scale out". Hundreds of SQL server instances over a farm. Which means a lot of load balancing, mirroring, slicing, distribution, etc going in a kind of a mesh architecture avoiding a single point of load. Sharding has been used in dealing with these sorts of things as well. SQL may not even be the big problem, the big problem are implementations of it, which usually revolve around a single monolithic server or a few and where replication is primative . The most research however have gone into scale-out however for NoSQL databases.
There is Gigablast which is on GitHub that did implement an open source search engine.
There is Gigablast which is on GitHub that did implement an open source search engine.
Findx used the gigablast open-soure-search-engine. In hindsight that was a mistake. Email me if you want details. I'm not going to rant about it here.
Regarding sharding: Yes, you have to shard. You also have to build in the assumptions that shards will fail, so you need redundancy and a way to deal with inconsistencies.
Start with the book Introduction to Information Retrieval (Manning, Raghavan and Schütze, Cambridge Press). Study graph theory & algorithms thoroughly*. After that, some (most?) of the latest research around the subject of Information Retrieval/Search (engines) can be found, publicly available, here [iw3c2.org] (the older material can be directly accessed, but the newer material requires ACM membership).
I you want to delve a bit deeper into the mathematical background of graph theory or anything else [mathematically] interesting you encounter, start out with Dover Press books.
(Score: 0) by Anonymous Coward on Wednesday November 21 2018, @06:58PM (6 children)
If someone today wanted to set up their own search engine on a university server (assuming bandwidth is no object), what are some good algorithms and techniques to manage spidering without infinite recursion or doing something stupid to piss off server owners, identification of spammy sites, storage, and searching your storage? You're sure as hell not using SQL to store the results, it has to be something custom. I once had a fulltext indexed field that took a minute to return a response.
Can someone recommend papers, design talks, code examples?
(Score: 4, Interesting) by isj on Wednesday November 21 2018, @07:14PM
No, you definitely don't want to store it a standard SQL database, even with full-text search.
Look for "inverted indexes".
Depending on the goal of your search engine you may be able to reduce the index size with:
- lemmatization
- stemming
- if word order doesn't matter then store occurrences only once per document
If the document set is relatively uniform (say, a set of scientific papers, or a set of children's books (but not a mix of both)) then you can use BM25 ranking algorithm for getting reasonably good results.
(Score: 1, Disagree) by Anonymous Coward on Wednesday November 21 2018, @11:23PM (1 child)
Google makes its own optimized hardware for its search engine. Like it or not, that is an advantage that a startup is going to have a hard time competing with. They are really fucking smart. If you are going to compete you're going to need to do something that is totally different from the ground up. That is something that can be done IMHO, but it is a hard nut to crack. And you sure as shit aren't going to crack it by taking an off the shelf SQL server and thinking you're going to make a search engine. Not gonna happen.
You have to completely rethink not just search, but the underlying systems that make the Internet function. Googles competitors in the hardware/OS/application marketspace are already doing that, and they've all seemed to have decided that fracturing the network to kick Google off of it is the only way of competing with them. That is probably true for a big company, but not for a small one.
(Score: 0) by Anonymous Coward on Thursday November 22 2018, @12:07AM
The only way to do it is to grow a true AI, such as a brain in a box, and train it to be smarter and more useful than any search engine or voice assistant ever conceived of.
Of course, Google could do it first.
(Score: 2) by eravnrekaree on Thursday November 22 2018, @05:12PM (1 child)
The problem with a run of the mill MySQL is that anything that uses a single monolithic SQL server would crash and burn under the load. It would need to be highly distributed, the word is "scale out". Hundreds of SQL server instances over a farm. Which means a lot of load balancing, mirroring, slicing, distribution, etc going in a kind of a mesh architecture avoiding a single point of load. Sharding has been used in dealing with these sorts of things as well. SQL may not even be the big problem, the big problem are implementations of it, which usually revolve around a single monolithic server or a few and where replication is primative . The most research however have gone into scale-out however for NoSQL databases.
There is Gigablast which is on GitHub that did implement an open source search engine.
(Score: 3, Informative) by isj on Thursday November 22 2018, @05:34PM
Findx used the gigablast open-soure-search-engine. In hindsight that was a mistake. Email me if you want details. I'm not going to rant about it here.
Regarding sharding: Yes, you have to shard. You also have to build in the assumptions that shards will fail, so you need redundancy and a way to deal with inconsistencies.
(Score: 2) by quietus on Thursday November 22 2018, @06:18PM
Start with the book Introduction to Information Retrieval (Manning, Raghavan and Schütze, Cambridge Press). Study graph theory & algorithms thoroughly*. After that, some (most?) of the latest research around the subject of Information Retrieval/Search (engines) can be found, publicly available, here [iw3c2.org] (the older material can be directly accessed, but the newer material requires ACM membership).
I you want to delve a bit deeper into the mathematical background of graph theory or anything else [mathematically] interesting you encounter, start out with Dover Press books.