Suppose that a data mining situation has 40 features that attempt to predict a numerical target value, and individual regressions are run for all 40 of them, with a p-value of .05 as the threshold for significance. How many features would you expect to show significance at this level, even if none are in reality related at all to the target?
- I would really appreciate it if you could explain why as well. Thank you.
The threshold on p-value is 0.05. Hence, we are using a size 0.05 test for testing the significance of the coefficients. Now a type I error occurs when we conclude that a feature is significant, even when it is not. The probability of this type I error is precisely the size of the test .
Note that a type II error cannot occur in the test as in reality the alternative of significance of features in not true for any of them.
Now there are 40 features, each can be wrongly concluded to be significant with probability .
Thus we can expect about many features to show false significance.
Thus the expected number is: 40 x p = 40 x 0.05 = 2 (Ans.)
Get Answers For Free
Most questions answered within 1 hours.