When training a machine learning model with some dataset, what are some assumptions we are making about the data? What are some things that it is important for us not to assume? Please give a few examples for each.
Checking model assumptions is essential prior to building a model that will be used for prediction. If assumptions are not met, the model may inaccurately reflect the data and will likely result in inaccurate predictions. Each model has different assumptions that must be met, so checking assumptions is important both in choosing a model and in verifying that it is the appropriate model to use.
Diagnostics
Diagnostics are used to evaluate the model assumptions and figure out whether or not there are observations with a large, undue influence on the analysis. They can be used to optimize the model by making sure the model you use is actually appropriate for the data you are analyzing. There are many ways to assess the validity of a model using diagnostics. Diagnostics is an overarching name that covers the other topics under model assumptions. It may include exploring the model’s basic statistical assumptions, examining the structure of a model by considering more, fewer, or different explanatory variables, or looking for data that is poorly represented by a model such as outliers or that have a large imbalanced effect on the regression model’s prediction.
Diagnostics can take many forms. There are numerical diagnostics you can examine. The statsmodels package provides a summary of many diagnostics through the summary function:
With this summary, we can see important values such as R2, the F-statistic, and many others. You can also analyze a model using a graphical diagnostic such as plotting the residuals against the fitted/predicted values.
Above is the fitted versus residual plot for our weight-height dataset, using height as the predictor. For the most part, this plot is random. However, as fitted values increase, so does the range of residuals. This means that as BMI increases, there is higher variance between our model and the actual data. It also tends to be a more negative residual at higher BMIs. This does not mean that a linear model is incorrect, but it is something to investigate and maybe something to help change or improve the model.
Another residual plot you can do is a scale-location plot. This plot shows whether our residuals are equally distributed along the range of our predictor. If all random variables have the same finite variance, they are considered to be homoscedastic. A plot with randomly spread points indicates the model is appropriate. You plot square-rooted normalized residuals against the fitted values.
In this plot, we want a random distribution that is horizontally banded. This would indicate that the data is homoscedastic and randomization in the relationship between the independent variables and the dependent variable is relatively equal across the independent variables. Our line is mostly horizontally banded at the beginning but seems to slope upwards near the end, meaning that there may not be equal variance everywhere. This may be a result of not fixing the issue we discovered above in the residual-fitted graph and another indicator something may need to be changed in our model.
When doing a regression model, you want to make sure that your residuals are relatively random. If they are not, that may mean that the regression you chose was not correct. For example, if you chose to use a linear regression and the residual plot is clearly not random, that would indicate that the data is not linear.
Note: Plzzz don' t give dislike.....Plzzz comment if u have any problem i will try to resolve it.......
Get Answers For Free
Most questions answered within 1 hours.