+1 917 8105386 [email protected]

STA 138 Fall 2015

Homework 6 - Due Wednesday, November 18th 1. Use the flu.csv dataset, as we did in Homework 4 and 5. (a) Find the best model using forward-backward sub- set selection using AIC, and report the best tting model. (b) Find the best model using forward-backward sub- set selection using BIC, and report the best tting model. (c) Report the AIC and BIC from the model in (b). (d) Using the model from (b), estimate the probability that a female aged 54 with an awareness score of 83 would get a u shot (notice you may not have to use all of the information given based on your model). 2. Continue with problem 1, and use the model found in problem 1, part (b). (a) Use the Hosmer-Lemeshow goodness of t test with g = 8 to test how well our model is tting. State the null and alternative hypothesis, the value of the test-statistic, the p-value, and your conclusion. (b) Plot a histogram of the standardized residuals. Does it appear that the assumption that the are standard normal holds? Why or why not? (c) Are any values of the standardized residuals larger than 3? If so, identify what combination of X vari- ables it was for. (d) What is the observation/s that most in uenced the change in the coecients (had the largest DFbeta)? List the observation and the corresponding values of the predictors. 3. Continue with problem 1, and use the model found in problem 1, part (b). (a) Find the value of AUC, the 95% condence interval for AUC, and plot the ROC. (b) Does this value of AUC suggest that the model has t the data well? Explain your answer. (c) Fit the full model (including all predictors) and re- peat (a) for the full model. (d) What does (c) suggest AUC and adding predictors, if anything? 4. Online you will nd an expanded dataset largework.csv. It has the following columns: Column 1. gender: 1 indicates the subject was male, 0 indicates female. Column 2. age: the age of the subject. Column 3. marriage: with levels 1 = married, 2 = wid- owed, 3 = divorced, 5 = never married. Column 4. min: minutes of Sedentary Activity per Week Column 5. chol: total cholesterol Column 6. sysbp: systolic Blood Pressure measurement Column 7. height: height of the subject Column 8. y: 1 the subject was obese, 0 otherwise. Again, assume our response variable is obese. (a) Display the model formula for the \best" model using forward subset selection and BIC. (b) Display the model formula for the \best" model using backward subset selection and BIC. (c) Display the model formula for the \best" model using backward-forward subset selection and BIC. (d) Display the model formula for the \best" model using forward-backward subset selection and BIC. 5. Continue with problem 4. (a) Display the model formula for the \best" model using all subset selection and BIC. (b) Display the model formula for the \best" model using all subset selection and AIC. (c) For the best model in (a), nd the value of AUC, the 95% condence interval for AUC, and plot the ROC. (d) For the best model in (b), nd the value of AUC, the 95% condence interval for AUC, and plot the ROC. 6. Continue with problem 4. but remove the column for marriage. This can be done by the following (assuming you called your data largework): they = as.factor(largework$y) thex= as.matrix(largework[,-c(3,8)]) (a) Using the lasso penalty, nd the best model accord- ing to AUC. Write down the estimated logistic regression model. (b) Using the ridge penalty, nd the best model accord- ing to AUC. Write down the estimated logistic regression model. (c) What do you think explains the dierence in the models chosen here, compared to the models selected in 5 (a) and (b)? (d) If we had used either AIC or BIC, do you think the models would have been larger or smaller than the ones chosen in (a) and (b) of this problem?

Ready To Get Started?

GET STARTED TODAY