Wednesday, August 8, 2012

Statistics is hard: Type 1 error edition

Matt Yglesias notes that data dredging is a major potential downside to "Big Data:"

Given a nice big dataset and a good computer, you can come up with any number of correlations that hold up at a 95 percent confidence interval, about 1 in 20 of which will be completely spurious.
But that isn't really right at all. The number of spurious correlations is probably far higher.

Let's start with where the 5% spurious figure comes from. If we test a true hypothesis that there is no linear correlation between two variables at the 5% level, then we will find a significant correlation between the two uncorrelated variables about 5% of the time. If we test a bunch of true null hypotheses that pairs of variables are uncorrelated about 5% of those pairs will show up in our list of significant correlated pairs.

But what percentage of the correlations on the list are spurious?

100% by assumption!

The key parameter for estimating how many of the correlations are spurious is the percentage of pairs that are uncorrelated, call it p. The other parameters that matters which is how often true correlations are detected, the power function of the test, and a distribution of correlation coefficients for the variables are are correlated.

Let's assume for simplicity that all false null hypotheses are rejected 100% of the time. Then we can use Bayes rule to figure out the percentage of spurious correlations:

P(spurious | null rejected) = P(null rejected | spurious)P(spurious)/P(null rejected)
From earlier P(spurious) = p and P(null reject | spurious) = 0.05
P(null rejected) = correct rejections on the 1-p false nulls and 0.05p incorrect rejections = 1-p+0.05p

P(spurious | null rejected) = 0.05p/(1-p+0.05p)

If p > 20/39 then more than 1 in 20 of the correlations we find is going to be spurious. p being close to 1 seems most plausible since we can imagine testing all sorts of bizarre correlations like the relation of the number of vowels in your last name to cheddar consumption on a given day--for each day of the year. That gives us 365 hypotheses to test. We could add 365 more for each type of cheese and that is just the tip of the iceberg. How many of those hypotheses have a genuine relationship?


No comments:

Post a Comment