1) When we fit a model to data, which is typically larger?
a) Test Error b) Training Error
2) What are reasons why test error could be LESS than training error? (Pick all that applies)
a) By chance, the test set has easier cases than the training set.
b) The model is highly complex, so training error systematically overestimates test error
c) The model is not very complex, so training error systematically overestimates test error
3) Suppose we want to use cross-validation to estimate the error of the following procedure:
Step 1: Find the k variables most correlated with y
Step 2: Fit a linear regression using those variables as predictors
We will estimate the error for each k from 1 to p, and then choose the best k.
True or false: a correct cross-validation procedure will possibly choose a different set of k variables for every fold.
4) Suppose that we perform forward stepwise regression and use cross-validation to choose the best model size.
Using the full data set to choose the sequence of models is the WRONG way to do cross-validation (we need to redo the model selection step within each training fold). If we do cross-validation the WRONG way, which of the following is true?
a) The selected model will probably be too complex
b) The selected model will probably be too simple
5) One way of carrying out the bootstrap is to average equally over all possible bootstrap samples from the original data set (where two bootstrap data sets are different if they have the same data points but in different order). Unlike the usual implementation of the bootstrap, this method has the advantage of not introducing extra noise due to resampling randomly. (You can use "^" to denote power, as in "n^2")
To carry out this implementation on a data set with n data points, how many bootstrap data sets would we need to average over?
6) If we have n data points, what is the probability that a given data point does not appear in a bootstrap sample?
7) If we use ten-fold cross-validation as a means of model selection, the cross-validation estimate of test error is:
a) biased upward
b) biased downward
c) unbiased
d) potentially any of the above
8) Why can't we use the standard bootstrap for some time series data? (Pick all that applies)
a) The data points in most time series aren't i.i.d.
b) Some points will be used twice in the same sample
c) The standard bootstrap doesn't accurately mimic the real-world data-generating mechanism
Answer 7. is (d)
If we use ten-fold cross-validation as a means of model selection, the cross-validation estimate of test error is potentially biased upward, downward or unbiased.
There are competing biases: on one hand, the cross-validated estimate is based on models trained on smaller training sets than the full model, which means we will tend to overestimate test error for the full model.
On the other hand, cross-validation gives a noisy estimate of test error for each candidate model, and we select the model with the best estimate. This means we are more likely to choose a model whose estimate is smaller than its true test error rate, hence, we may underestimate test error. In any given case, either source of bias may dominate the other.
Get Answers For Free
Most questions answered within 1 hours.