We have used the 1987 baseball salary data to illustrate linear
regression. In this project, we consider
the 1992 baseball salary data set, which is available from
http://www.amstat.org/publications/jse/datasets/baseball.dat.txt
This data set (of dimension 337 × 18 ) contains salary
information (and performance measures) of
337 Major League Baseball players in 1992. More detailed
information can be found at
http://www.amstat.org/publications/jse/datasets/baseball.txt
The data set contains the following variables.
Table 1: Variable Description for the 1992 Baseball Salary Data
Var Columns Description
salary 1 – 4 Salary (in thousands of dollars)
X1 6 – 10 Batting average
X2 12 – 16 On-base percentage (OBP)
X3 18 – 20 Number of runs
X4 22 – 24 Number of hits
X5 26 – 27 Number of doubles
X6 29 – 30 Number of triples
X7 32 – 33 Number of home runs
X8 35 – 37 Number of runs batted in (RBI)
X9 39 – 41 Number of walks
X10 43 – 45 Number of strike-outs
X11 47 – 48 Number of stolen bases
X12 50 – 51 Number of errors
X13 53 Indicator of “free agency eligibility”
X14 55 Indicator of “free agent in 1991/2”
X15 57 Indicator of “arbitration eligibility”
X16 59 Indicator of “arbitration in 1991/2”
ID 61 – 79 Player’s name (in quotation marks)
The data set can be input into R by reading directly from the website, with the following R commands:
baseball <- read.table(file=
"http://www.amstat.org/publications/jse/datasets/baseball.dat.txt",
header = F,
col.names=c("salary", "x1", "x2", "x3", "x4", "x5",
"x6", "x7","x8", "x9", "x10", "x11", "x12", "x13",
"x14", "x15", "x16", "ID"))
baseball$logsalary <- log(baseball$salary);
baseball <- baseball[, -c(1, 18)] # REMOVE salary AND ID
dim(baseball); head(baseball)
Complete the project by following the specific instructions
given below.
1. Starting with the whole model that includes all predictors
(i.e., X1, X2, . . . , X16), apply one
model selection procedure of your choice to select your best model.
(use either "Best Subset Selection" or "Regularization"
methods)
(a) Provide the fitting results from your ‘best’ model, i.e., the
table of Parameter Estimates
and the ANOVA table.
(b) Obtain the resultant R2 and interpret.
Get Answers For Free
Most questions answered within 1 hours.