While people are starting to understand the importance of privacy it is a major hurdle to get them to select a different search engine.
Search engines eat resources like crazy, so operating costs are non-negligible.
Some sites (including e.g. github) use a whitelist in robots.txt, blocking new crawlers.
The amount of spam, link-farms, referrer-linking, etc. is beyond your worst nightmare.
Returning good results takes a long time to fine-tune.
Monetizing is nearly impossible because advertising networks want to know everything about the users, going against privacy concerns.
Buying search results from other search engines is impossible until you have least x million searches/month. Getting x million searches/month is impossible unless you buy search results from other search engines (or sink a lot of cash into making it yourself).
So what do you soylentils think can be done to increase privacy for ordinary users, search-engine-wise ?
It's easy to downplay the difficulty of getting search right. But to me two examples of the skill behind Google search are searches for software code problems, and searches for sales of items that aren't tremendously mainstream. There are probably five hundred mainstream products in which searching for them on Google directly vs searching for them on Amazon.com, Walmart.com, and Ebay.com gives identical results but for anything outside that Google is reasonably relevant and the rest go right off the rails. Bing doesn't hold up, either.
I'm not defending the company's business model or ethics. I'm just saying that matching them at their own game is not trivial.
It's easy to downplay the difficulty of getting search right
Fair enough. Matching Google step-for-step would be daunting so why try?
Right now I can go to Google and type "How high is the Burj Khalifa" and it comes
back with "2,717′, 2,722′ to tip" in a heartbeat.
That's pretty smart. A lot's happening under the hood to make that happen
because it's capable of answering very generic queries like that. You don't even
have to think.
OTOH, if we had a hierarchy of something like /lists/buildings/tall, or /architecture/tall buildings
we could find some pages devoted to this kind of thing and probably get the height
easily--it would just take a bit more thought and time on the part of the end user.
It would definitely be the kind of trade-off that people make all the time
when they go "off the grid" to some degree. Growing a garden vs. fresh veggies
from the store.
-- Appended to the end of comments you post. Max: 120 chars.
I'm just saying that matching them at [google's] own game is not trivial.
I agree. That doesn't mean that there isn't room for improvement in google's results.
Examples I can think of I encountered in my work at Findx:
Bias toward shops
If you search for a single word, eg "plasterboard", the google results will have a strong bias toward shops where you can buy them. No reviews. No building codes. No evaluation by consumer organisations. So if you search for a single word google thinks you want to buy stuff
Still vulnerable to SEO
An acquaintance noticed that google never showed links to where you could buy the cheapest plasterboards. So apparently the sites with SEO and link farms made it to page 1 every time, but the most useful link for the user was buried on page 3. There isn't much quality difference in plasterboards so wouldn't the cheapest be the "best" result?
Handling of compound words
I noticed that google's handling of compound words isn't that great. They claim they solved "the Swedish problem" (which is what they called the compound words challenge) in 2006. But I recently saw that a news paper's front page had a new compound word in an article link, and the article had the compound word in a different infliction. Google did have the main article crawled (verified with osearch for other unique words), but couldn't find it using the compound word. First after 3 days did it work. I'm not sure what is going on there, but I have a suspicion that analyzing compound words and generating inflictions is done offline and in batch, and there is some lag there. If you're curious then it was the Danish word "smølfedomptør"
Old documents ignored?
I noticed that findx could find an old usenet post that google couldn't find. It was a 10 year old post made available on a webpage. No clue why google didn't find it. So google apparently doesn't crawl everything, or they drop old documents
Apparently doesn't use third-party quality indicators
When looking to buy something google apparently doesn't use third-party quality seals/approval/badges (at least we couldn't find any indications that it does). Many countries have consumer organisations that provide badges to well-behaved webshops. That is a useful ranking parameter.
One more note on compound words: If you want to handle Danish/Norwegian/German/Swedish/Icelandic/Finnish/Russian (and to some extend Italian) you have to deal with compound words. Findx solved it for Danish using a morphological dictionary (STO [cst.ku.dk]). I did some (incomplete) analysis of Danish webpages and it seemed that up to 10-30% of the unique words were compound words made-on-the-spot. So you can never have a complete dictionary for languages that easily form compounds, and you have to deal with them in some other way.
Thanks for the detailed response. Everything you wrote makes sense. For what it's worth, I'm sorry FindX failed. I too was unaware of it, and I had tried Yacy and Searx and a few other options that have since disappeared.
(Score: 2) by bobthecimmerian on Wednesday November 21 2018, @09:01PM (3 children)
It's easy to downplay the difficulty of getting search right. But to me two examples of the skill behind Google search are searches for software code problems, and searches for sales of items that aren't tremendously mainstream. There are probably five hundred mainstream products in which searching for them on Google directly vs searching for them on Amazon.com, Walmart.com, and Ebay.com gives identical results but for anything outside that Google is reasonably relevant and the rest go right off the rails. Bing doesn't hold up, either.
I'm not defending the company's business model or ethics. I'm just saying that matching them at their own game is not trivial.
(Score: 4, Interesting) by istartedi on Wednesday November 21 2018, @09:30PM
Fair enough. Matching Google step-for-step would be daunting so why try? Right now I can go to Google and type "How high is the Burj Khalifa" and it comes back with "2,717′, 2,722′ to tip" in a heartbeat.
That's pretty smart. A lot's happening under the hood to make that happen because it's capable of answering very generic queries like that. You don't even have to think.
OTOH, if we had a hierarchy of something like /lists/buildings/tall, or /architecture/tall buildings we could find some pages devoted to this kind of thing and probably get the height easily--it would just take a bit more thought and time on the part of the end user.
It would definitely be the kind of trade-off that people make all the time when they go "off the grid" to some degree. Growing a garden vs. fresh veggies from the store.
Appended to the end of comments you post. Max: 120 chars.
(Score: 3, Interesting) by isj on Thursday November 22 2018, @12:06AM (1 child)
I agree. That doesn't mean that there isn't room for improvement in google's results.
Examples I can think of I encountered in my work at Findx:
One more note on compound words: If you want to handle Danish/Norwegian/German/Swedish/Icelandic/Finnish/Russian (and to some extend Italian) you have to deal with compound words. Findx solved it for Danish using a morphological dictionary (STO [cst.ku.dk]). I did some (incomplete) analysis of Danish webpages and it seemed that up to 10-30% of the unique words were compound words made-on-the-spot. So you can never have a complete dictionary for languages that easily form compounds, and you have to deal with them in some other way.
(Score: 2) by bobthecimmerian on Sunday November 25 2018, @03:41PM
Thanks for the detailed response. Everything you wrote makes sense. For what it's worth, I'm sorry FindX failed. I too was unaware of it, and I had tried Yacy and Searx and a few other options that have since disappeared.