SoylentNews Comments | The Exaggerated Promise of So-Called Unbiased Data Mining

The Exaggerated Promise of So-Called Unbiased Data Mining

posted by chromas on Monday January 14 2019, @02:22AM

from the Surely-you-jest,-Dr.-Feynman dept.

Probably not that good of an article, but it actually exists, only at Wired, so it is certain that it probably is worth reading. But only if you go in with no preconceptions.

Nobel laureate Richard Feynman once asked his Caltech students to calculate the probability that, if he walked outside the classroom, the first car in the parking lot would have a specific license plate, say 6ZNA74. Assuming every number and letter are equally likely and determined independently, the students estimated the probability to be less than 1 in 17 million. When the students finished their calculations, Feynman revealed that the correct probability was 1: He had seen this license plate on his way into class. Something extremely unlikely is not unlikely at all if it has already happened.

Bayesian probability is all well and good, until it runs up against actuality. But the point here is all about having a Beautiful Mind or π, and seeing patterns everywhere, and how if you see them in Big Data, the patterns are bigger. But no less crazy.

The Feynman trap—ransacking data for patterns without any preconceived idea of what one is looking for—is the Achilles heel of studies based on data mining. Finding something unusual or surprising after it has already occurred is neither unusual nor surprising. Patterns are sure to be found, and are likely to be misleading, absurd, or worse.

This approach to "science" can certainly lead to interesting results, as in this particular study:

A standard neuroscience experiment involves showing a volunteer in an MRI machine various images and asking questions about the images. The measurements are noisy, picking up magnetic signals from the environment and from variations in the density of fatty tissue in different parts of the brain. Sometimes they miss brain activity; sometimes they suggest activity where there is none.
A Dartmouth graduate student used an MRI machine to study the brain activity of a salmon as it was shown photographs and asked questions. The most interesting thing about the study was not that a salmon was studied, but that the salmon was dead. Yep, a dead salmon purchased at a local market was put into the MRI machine, and some patterns were discovered. There were inevitably patterns—and they were invariably meaningless.

Brings to mind (brains!) a certain Irish myth of the Salmon of Knowledge, and the parallel formation of the posthumous Salmon of Doubt by Douglas Adams.

The problem has become endemic nowadays because powerful computers are so good at plundering Big Data. Data miners have found correlations between Twitter words or Google search queries and criminal activity, heart attacks, stock prices, election outcomes, Bitcoin prices, and soccer matches. You might think I am making these examples up. I am not.
There are even stronger correlations with purely random numbers. It is Big Data Hubris to think that data-mined correlations must be meaningful. Finding an unusual pattern in Big Data is no more convincing (or useful) than finding an unusual license plate outside Feynman's classroom.

New Myth: Big Data and the MRIed Dead Salmon of Pattern Imagination.

Original Submission

Starting Score:

point

Moderation

Interesting=1, Disagree=1, Total=2

Extra 'Interesting' Modifier

Karma-Bonus Modifier

Total Score:

This discussion has been archived. No new comments can be posted.

The Exaggerated Promise of So-Called Unbiased Data Mining | Log In/Create an Account | Top | 38 comments | Search Discussion

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

Re:Overfitting (Score: 3, Interesting) by Arik on Monday January 14 2019, @01:31PM

by Arik (4543) on Monday January 14 2019, @01:31PM (#786428) Journal

"Unless you're the almighty Gosh, _we_ don't know whether it will be heads or tails. We don't try to know what the next of your coin tosses will be." Yes, that's what I said. "It feels like you're convoluting the issue." That may be, but in fact I'm doing the opposite. "Big Data works only in Big Generalities. It works in statistics, not in absolutes. " Very true. And that's problematic because most people don't understand statistics. I mean, frankly, *I* definitely don't understand statistics. Not completely, or anything like it. But I have a basic grounding, and with that, it's quite conspicuous that most people really do not have even that. But what makes it worse is virtually everyone *thinks* they understand it. All you have to do to see this being used to manipulate people is turn on the tv, or turn off adblock. Or listen to just about any politician or political candidate. Fundamental confusions related to statistics are reliable tools in the hands of marketers who probably don't even understand what they are doing themselves. "Its promise is never that it will be reality, all anyone is hoping for is that it will be a good match, the more data we give it" Oh, no, the press releases tend to cross the line. But even the more believable claim is still suspect, likely false. They're still limited by GIGO. This is cargo cult statistics, really, just keep throwing bad data into the mix and hope the algorithm magically turns it good. It doesn't work that way. More garbage in just makes for more garbage out. "The Stallman quote is a bad one." You mean the Feynman quote? "He said probability one -- not accounting for someone leaving early, or even leaving before him, or being late out of the previous class and not having left yet. The probability is not one, and we/he do not know whether the car is currently there or not, or will be there at the end of class or not." Yes, we do, as the instance to which he referred had already occurred - he knew, and we know, that the probability was indeed 1.00 - actual fact versus speculative estimation. "The big data that we use is only as biased as the data provided, and that's the best anyone can do. " ??? No it's not, and you're not making any sense. You don't need huge amounts of data to do a good statistical analysis - relatively small datasets are not a problem if they are clean. The real challenge is cleaner data and more accurate analysis.

--
If laughter is the best medicine, who are the best doctors?

Parent

Starting Score:	1		point
Moderation		+1
Interesting=1, Disagree=1, Total=2
Extra 'Interesting' Modifier		0
Karma-Bonus Modifier		+1

Total Score:		3

Moderator Help

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

The Exaggerated Promise of So-Called Unbiased Data Mining

Re:Overfitting (Score: 3, Interesting) by Arik on Monday January 14 2019, @01:31PM