If a scientist wanted to forecast ocean currents to understand how pollution travels after an oil spill, she could use a common approach that looks at currents traveling between 10 and 200 kilometers. Or, she could choose a newer model that also includes shorter currents. This might be more accurate, but it could also require learning new software or running new computational experiments. How to know if it will be worth the time, cost, and effort to use the new method?
A new approach developed by MIT researchers could help data scientists answer this question, whether they are looking at statistics on ocean currents, violent crime, children's reading ability, or any number of other types of datasets.
The team created a new measure, known as the "c-value," that helps users choose between techniques based on the chance that a new method is more accurate for a specific dataset. This measure answers the question "is it likely that the new method is more accurate for this data than the common approach?"
Traditionally, statisticians compare methods by averaging a method's accuracy across all possible datasets. But just because a new method is better for all datasets on average doesn't mean it will actually provide a better estimate using one particular dataset. Averages are not application-specific.
So, researchers from MIT and elsewhere created the c-value, which is a dataset-specific tool. A high c-value means it is unlikely a new method will be less accurate than the original method on a specific data problem.
[...] The c-value is designed to help with data problems in which researchers seek to estimate an unknown parameter using a dataset, such as estimating average student reading ability from a dataset of assessment results and student survey responses. A researcher has two estimation methods and must decide which to use for this particular problem.
[....] "In our case, we are assuming that you conservatively want to stay with the default estimator, and you only want to go to the new estimator if you feel very confident about it. With a high c-value, it's likely that the new estimate is more accurate. If you get a low c-value, you can't say anything conclusive. You might have actually done better, but you just don't know," Broderick explains.
The ultimate goal is to create a measure that is general enough for many more data analysis problems, and while there is still a lot of work to do to realize that objective, Broderick says this is an important and exciting first step in the right direction.
Journal Reference:
Brian L. Trippe, Sameer K. Deshpande, & Tamara Broderick, Confidently Comparing Estimates with the c-value [open], J. Am. Stat. Asoc., 2023. DOI: https://doi.org/10.1080/01621459.2022.2153688
(Score: 1, Interesting) by Anonymous Coward on Monday March 13 2023, @03:37PM
There's a space next to Akaike information, modified R2 and robust estimators for this new c-value.
(Score: 2, Interesting) by shrewdsheep on Monday March 13 2023, @03:55PM
The practical impact of this paper is probably minimal. Having looked at the paper, it is a parametric model, i.e. you assume you know the distribution up to a parameter (a few numbers). Then they choose the well-known example of a multivariate normal to construct a better estimate. The c-value cannot be generically computed for a new estimator, but requires theoretical work to be derived every time. I have not read the paper in full, so maybe they discuss this issue.
(Score: 1, Funny) by Anonymous Coward on Monday March 13 2023, @06:24PM (1 child)
When their partner has failed to climax.
(Score: 0) by Anonymous Coward on Monday March 13 2023, @11:21PM
Partner? You must be doing it wrong.