Goals of
data screening are as follows.
- Accuracy of data entry- We have to cross check
weather we have entered the data correctly or made any typing
error, collection error or something like these.
- Dealing with missing data- We have to notice
those data which are missed in our collected data. Based on that we
have to analysis whether that number and effect are significant or
not. If not significant, we can proceed for further computations.
Otherwise we have to make arrangements to collect or estimate those
data (all or partially as possible and as required) using different
approaches.
- Handling outliers- Through overall view of the
gathered data, we have to notice if there is any outlier and if
possible we have to crosscheck those. If crosschecking is not
possible, we have to assess the effect of those outliers in overall
data and if required, we have to neglect those data for further
computations.
- Test of assumptions- Earlier made assumptions
like normality, linearity, uniformity, symmetricity and others are
to be checked while data screening is performed.
Errors in data
entry-
We have to crosscheck data after entering and thus errors in
data entry can be avoided (or reduced). Further observing any
outlier value, we have to take special care to crosscheck whether
those are entered correctly or not.
Outliers-
Outliers in a set of data can be identified by mere observation
of the data values or through plots like scatter plot or histogram
and so others. For those we have to check whether
- these occurred due to data entry error
- these are cases which are not at all part of the
population
- these are the real cases which are practically different from
others
For outliers we have to analysis its leverage, discrepancy and
influence on the data set.
Missing
data-
We have to note the missing data and check whether missing data
is random or not. Creating two groups one with missing data and
other without missing data we have to perform t-test to examine
whether there is any difference between groups. If difference is
significant we have to proceed through any of following
processes.
- Cases or variables related to missing data may be deleted.
- Missing values may be estimated during analysis. Replacements
can be done using prior knowledge or by replacement of estimated
mean (which does not change mean but reduces the standard
deviation).
- Estimating using regression approach (though it is time
consuming).
After reconstructing the data set we have to again perform
analysis.