Analysis Categorical Data
. These 3 problems are in an undergraduate Analysis Categorical Data course and I want to ask you that make me a solution with explanation. I attached the task and I provided some excel files by dropbox links below. Those excel files are data. And this task needs R using.
https://www.dropbox.com/s/y9hr6ubkqmapfs2/SAheart.csv?dl=0
https://www.dropbox.com/s/i8moqoqdmk8vqc0/drugs.csv?dl=0
https://www.dropbox.com/s/ecrmar40fanp4e1/cancer.csv?dl=0
All code should be in the appendix. You may include tables, numbers, etc, but code should not be in the body of the report.
Separate problems 1, 2, and 3 into separate reports, but turn them in stapled together (in order). Every group must complete all three problem
Problem 1: Coronary Heart Disease
In the file SAheart.csv on Smartsite, you will find a dataset on the occurance of Coronary Heart Disease (CHD) in South Africa. This dataset has the following columns:
Column 1: sdp: Systolic blood pressure
Column 2: tobacco: Cumulative tobacco (in kg).
Column 3: ldl: Low density lipoprotein cholesterol.
Column 4: adiposity: Amount of fat found in adipose tissue.
Column 5: famhist: An indicator variable, where 1 indicates they do have a family history of heart disease, and 0 indicates they do not.
Column 6: typea: A measure of type A personality, where the higher the measure the more likely it is the person has a type A personality.
Column 7: obesity: An obesity measure, where the higher the measure the more obese the person is.
Column 8: alcohol: An indicator variable, where 1 indicates they do consume alcohol, 0 indicates they do not.
Column 9: age: The age of the subject in years.
Column 10: chd: Your response variable. chd = 1 indicates a person did have Coronary Heart Disease, 0 indicates they did not.
For this problem, perform the following tasks:
1. Model selection using lasso and AUC, as well as model selection using BIC.
2. For the model using BIC, interpretation of coefficients, including confidence intervals.
3. For both models, goodness of fit tests, identification of outliers/leverage points.
4. For both models, predictive power assessment.
5. Select the “best” model if the goal is to have a simple model. Use the above parts to help make your decision, and/or perform a model comparison test.
Problem 2: Drug use
The next dataset is taken from: Hosmer and Lemeshow (2000), and copyrighted by John Wiley & Sons Inc.
The text file drugs.csv under Resources contains data on drug use in high risk drug patients submitted to a facility for rehabilitation. The data has the following columns:
Column 1: age: The age of the subject upon entry.
Column 2: beck: The beck depression score of the subject (continuous).
Column 3: ivhx: A factor variable, indicating if they never used IV drugs (1), previously used IV drugs (2), and recentlly used IV drugs (3).
Column 4: drugtreat: The number of prior drug treatments.
Column 5: race: The race of the subject, where 0 = white, 1 = other.
Column 6: treat: Length of treatment assignment (1 = long, 0 = short).
Column 7: site: The treatment location (1 = site B, 0 = site A)
Column 8: drug: Your response variable. 1 if they remained drug free for 12 months, 0 otherwise.
For this problem, perform the following tasks:
1. Model selection using both AIC and BIC.
2. For the model using BIC, interpretation of coefficients, including confidence intervals.
3. For both models, goodness of fit tests, identification of outliers/leverage points.
4. For both models, predictive power assessment.
5. Select the best model if the goal is to have a model that is good at prediction. Use the above parts to help make your decision, and/or perform a model comparison test.
Problem 3: Cancer in Cities in China
The text file drugs.csv under Resources contains data on lung cancer, smoking status, and two Chinese cities.
The columns of the data are:
Column 1: Freq: The count for each row.
Column 2: City: With levels Beijing and Shanghai.
Column 3: lung: yes indicates lung cancer, no indicates no lung cancer.
Column 4: smoking: Smoker indicates they are a smoker, NonSmoker indicates they are not a smoker.
The goal is to understand how the city and smoking status may effect lung cancer. Find the best fitting model, describe what that model suggests about the relationship between the three variables, and use it to estimate relevant odds ratios. Be sure to include goodness of fit measures, and interpretation when possible.
'

