While people are starting to understand the importance of privacy it is a major hurdle to get them to select a different search engine.
Search engines eat resources like crazy, so operating costs are non-negligible.
Some sites (including e.g. github) use a whitelist in robots.txt, blocking new crawlers.
The amount of spam, link-farms, referrer-linking, etc. is beyond your worst nightmare.
Returning good results takes a long time to fine-tune.
Monetizing is nearly impossible because advertising networks want to know everything about the users, going against privacy concerns.
Buying search results from other search engines is impossible until you have least x million searches/month. Getting x million searches/month is impossible unless you buy search results from other search engines (or sink a lot of cash into making it yourself).
So what do you soylentils think can be done to increase privacy for ordinary users, search-engine-wise ?
No, you definitely don't want to store it a standard SQL database, even with full-text search.
Look for "inverted indexes".
Depending on the goal of your search engine you may be able to reduce the index size with:
- lemmatization
- stemming
- if word order doesn't matter then store occurrences only once per document
If the document set is relatively uniform (say, a set of scientific papers, or a set of children's books (but not a mix of both)) then you can use BM25 ranking algorithm for getting reasonably good results.
(Score: 4, Interesting) by isj on Wednesday November 21 2018, @07:14PM
No, you definitely don't want to store it a standard SQL database, even with full-text search.
Look for "inverted indexes".
Depending on the goal of your search engine you may be able to reduce the index size with:
- lemmatization
- stemming
- if word order doesn't matter then store occurrences only once per document
If the document set is relatively uniform (say, a set of scientific papers, or a set of children's books (but not a mix of both)) then you can use BM25 ranking algorithm for getting reasonably good results.