GitHub Repo: https://github.com/mccurcio/Logistic_with_R ## 1. Executive Summary ## 2. Experimental Results ```{r LoadData, message=FALSE, warning=FALSE, include=FALSE} # DO NOT SHOW IN FINAL REPORT options(tinytex.verbose = FALSE) options(digits = 5) #Load Libraries Libraries <- c("knitr", "readr", "aod") for (p in Libraries) { library(p, character.only = TRUE) } #Load Data df <- read_csv("CHD_preprocessed.csv", col_types = cols(male = col_factor(levels = c("0", "1")), education = col_factor(levels = c("0", "1")), currentSmoker = col_skip(), BPMeds = col_skip(), prevalentStroke = col_factor(levels = c("0", "1")), prevalentHyp = col_factor(levels = c("0", "1")), diabetes = col_factor(levels = c("0", "1")), TenYearCHD = col_factor(levels = c("0", "1")))) ``` ### 2.1 Logistic Regression Model ```{r cache=TRUE} mylogit <- glm(TenYearCHD ~ male + age + education + cigsPerDay + prevalentStroke + prevalentHyp + diabetes + totChol + sysBP + diaBP + BMI + heartRate + glucose, data = df, family = "binomial") summary(mylogit) ``` - **7 most significant variables** - Seven predictors have $\small \alpha < 0.05$. They are significant and associated with acquiring cardiovascular disease. | Rank | Risk Factor | | :--- | :-------------------------- | | 1 | Prevalence of Stroke1 | | 2 | Male1 | | 3 | Prevalence of Hypertension1 | | 4 | Age | | 5 | Cigarettes Per Day | | 6 | Systolic Blood Pressure | | 7 | Glucose |
### 2.2 Wald Test: Do The Seven Factors Fit Our Model - The Wald Chi-Square Test can help determine if our proposed model is valuable and significant. - The Wald test generates a *P-value << 0.001*. - Therefore, we conclude the seven (7) parameters are significant and useful in describing cardiovascular disease. ```{r cache=TRUE} wald.test(b = coef(mylogit), Sigma = vcov(mylogit), Terms = c(2, 3, 5, 6, 7, 10, 14)) ```
### 2.3 Determination of Odds for Seven Variables - We can calculate the odds of acquiring cardiovascular disease for each of the seven variables. - By holding all other values constant we create a dataframe that investigates the odds given Prevalence of Stroke, for example. ```{r Stroke, cache=TRUE} strok_test <- with(df, data.frame(male = "0", age = mean(age), education = "0", cigsPerDay = 0, # Non-smoker prevalentHyp = "0", diabetes = "0", totChol = mean(totChol), sysBP = mean(sysBP), diaBP = mean(diaBP), BMI = mean(BMI), heartRate = mean(heartRate), glucose = mean(glucose), prevalentStroke = c("0", "1")) ) # Convert prevalentStroke from Numeric to FACTOR strok_test$prevalentStroke <- as.factor(strok_test$prevalentStroke) # str(strok_test) strok_test$prevalentStroke <- predict(mylogit, newdata = strok_test, type = "response") #strok_test$prevalentStroke ``` ### 2.4 **Odds Given Prevalence Of Stroke In family history.** 1. WITH Prevalence of Stroke: `r strok_test$prevalentStroke[[2]]` 2. NO Prevalence of Stroke: `r strok_test$prevalentStroke[[1]]` - **Odds = `r strok_test$prevalentStroke[[2]]/strok_test$prevalentStroke[[1]]`**
### 2.5 **Odds Given For Male Vs Female** ```{r MF, cache=TRUE} male_test <- with(df, data.frame(male = c("0","1"), # Factor of Interest age = mean(age), education = "0", cigsPerDay = 0, prevalentHyp = "0", diabetes = "0", totChol = mean(totChol), sysBP = mean(sysBP), diaBP = mean(diaBP), BMI = mean(BMI), heartRate = mean(heartRate), glucose = mean(glucose), prevalentStroke = "0")) # REMEMBER convert male_test from numeric to FACTOR male_test$male <- as.factor(male_test$male) male_test$male <- predict(mylogit, newdata = male_test, type = "response") ``` 1. Males: `r male_test$male[[2]]` 2. Female: `r male_test$male[[1]]` - **Odds = `r male_test$male[[2]]/male_test$male[[1]]`**
### 2.6 Odds Prevalence of Hypertension In Family History ```{r hyperT, cache=TRUE} hyperT_test <- with(df, data.frame(male = "0", age = mean(age), education = "0", cigsPerDay = 0, prevalentHyp = c("0","1"), # Factor of Interest diabetes = "0", totChol = mean(totChol), sysBP = mean(sysBP), diaBP = mean(diaBP), BMI = mean(BMI), heartRate = mean(heartRate), glucose = mean(glucose), prevalentStroke = "0")) # REMEMBER convert male_test from numeric to FACTOR hyperT_test$prevalentHyp <- as.factor(hyperT_test$prevalentHyp) hyperT_test$prevalentHyp <- predict(mylogit, newdata = hyperT_test, type = "response") ``` 1. WITH Prevalence of Hypertension: `r hyperT_test$prevalentHyp[[2]]` 2. NO Prevalence of Hypertension: `r hyperT_test$prevalentHyp[[1]]` - **Odds = `r hyperT_test$prevalentHyp[[2]]/hyperT_test$prevalentHyp[[1]]`**
### 2.7 Odds Given Age ```{r Age, cache=TRUE} age_test <- with(df, data.frame(male = "0", age = c(20,30,40,50,60,70,80), education = "0", cigsPerDay = 0, prevalentHyp = "0", diabetes = "0", totChol = mean(totChol), sysBP = mean(sysBP), diaBP = mean(diaBP), BMI = mean(BMI), heartRate = mean(heartRate), glucose = mean(glucose), prevalentStroke = "0")) age_test$age <- predict(mylogit, newdata = age_test, type = "response") ``` | Age (years) | Probability Given Age | Odds Compared to 20 yr old | | ----------: | :-------------------: | :-------------------------------------: | | 20 | `r age_test$age[[1]]` | `r age_test$age[[1]]/age_test$age[[1]]` | | 30 | `r age_test$age[[2]]` | `r age_test$age[[2]]/age_test$age[[1]]` | | 40 | `r age_test$age[[3]]` | `r age_test$age[[3]]/age_test$age[[1]]` | | 50 | `r age_test$age[[4]]` | `r age_test$age[[4]]/age_test$age[[1]]` | | 60 | `r age_test$age[[5]]` | `r age_test$age[[5]]/age_test$age[[1]]` | | 70 | `r age_test$age[[6]]` | `r age_test$age[[6]]/age_test$age[[1]]` | | 80 | `r age_test$age[[7]]` | `r age_test$age[[7]]/age_test$age[[1]]` |
### 2.8 Odds Given Number Of Cigarettes Per Day ```{r Cigarettes, cache=TRUE} cigs_per_day <- with(df, data.frame(male = "0", age = mean(age), education = "0", cigsPerDay = c(0,10,20,30,40), prevalentHyp = "0", diabetes = "0", totChol = mean(totChol), sysBP = mean(sysBP), diaBP = mean(diaBP), BMI = mean(BMI), heartRate = mean(heartRate), glucose = mean(glucose), prevalentStroke = "0")) cigs_per_day$cigsPerDay <- predict(mylogit, newdata = cigs_per_day, type = "response") #cigs_per_day$cigsPerDay ``` | Age (years) | Probability Given Age | Odds Compared to Zero Cigarettes Per Day | | ----------: | :------------------------------: | :-----------------------------------------------------------: | | 0 | `r cigs_per_day$cigsPerDay[[1]]` | `r cigs_per_day$cigsPerDay[[1]]/cigs_per_day$cigsPerDay[[1]]` | | 10 | `r cigs_per_day$cigsPerDay[[2]]` | `r cigs_per_day$cigsPerDay[[2]]/cigs_per_day$cigsPerDay[[1]]` | | 20 | `r cigs_per_day$cigsPerDay[[3]]` | `r cigs_per_day$cigsPerDay[[3]]/cigs_per_day$cigsPerDay[[1]]` | | 30 | `r cigs_per_day$cigsPerDay[[4]]` | `r cigs_per_day$cigsPerDay[[4]]/cigs_per_day$cigsPerDay[[1]]` | | 40 | `r cigs_per_day$cigsPerDay[[5]]` | `r cigs_per_day$cigsPerDay[[5]]/cigs_per_day$cigsPerDay[[1]]` |
### 2.9 Odds Given Systolic Blood Pressure ```{r BP, cache=TRUE} summary(df$sysBP) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 83.5 117.0 128.0 132.4 144.0 295.0 sysBP_calc <- with(df, data.frame(male = "0", age = mean(age), education = "0", cigsPerDay = 0, prevalentHyp = "0", diabetes = "0", totChol = mean(totChol), sysBP = c(117, 128, 144, 295), diaBP = mean(diaBP), BMI = mean(BMI), heartRate = mean(heartRate), glucose = mean(glucose), prevalentStroke = "0")) sysBP_calc$sysBP <- predict(mylogit, newdata = sysBP_calc, type = "response") #sysBP_calc$sysBP ``` | Systolic BP | Probability Given Systolic BP | Odds Systolic BP | | ----------: | :---------------------------: | :---------------------------------------------: | | 117 | `r sysBP_calc$sysBP[[1]]` | `r sysBP_calc$sysBP[[1]]/sysBP_calc$sysBP[[1]]` | | 128 | `r sysBP_calc$sysBP[[2]]` | `r sysBP_calc$sysBP[[2]]/sysBP_calc$sysBP[[1]]` | | 144 | `r sysBP_calc$sysBP[[3]]` | `r sysBP_calc$sysBP[[3]]/sysBP_calc$sysBP[[1]]` | | Max 295 | `r sysBP_calc$sysBP[[4]]` | `r sysBP_calc$sysBP[[4]]/sysBP_calc$sysBP[[1]]` |
### 2.10 Odds Given Glucose Levels ```{r Glucose, cache=TRUE} summary(df$glucose) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 40 72 80 82 85 394 glucose_calc <- with(df, data.frame(male = "0", age = mean(age), education = "0", cigsPerDay = 0, prevalentHyp = "0", diabetes = "0", totChol = mean(totChol), sysBP = mean(sysBP), diaBP = mean(diaBP), BMI = mean(BMI), heartRate = mean(heartRate), glucose = c(72, 80, 85, 394), prevalentStroke = "0")) glucose_calc$glucose <- predict(mylogit, newdata = glucose_calc, type = "response") # glucose_calc$glucose. # 0.094843 0.100852 0.110194 0.239738 ``` | Glucose | Probabilities | Odds Given Glucose | | -------: | :-----------: | :-------------------: | | 72 | 0.094843 | 1 | | 80 | 0.100852 | `r 0.100852/0.094843` | | 85 | 0.110194 | `r 0.110194/0.094843` | | Max 394 | 0.239738 | `r 0.239738/0.094843` |
## IV. Conclusion 1. #### Notes - For analysis help https://stats.idre.ucla.edu/r/dae/logit-regression/ - For interpretation help https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/ #### Wald test info https://www.mbaskool.com/business-concepts/statistics/6916-wald-test.html https://www.statology.org/wald-test-in-r/ https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faqhow-are-the-likelihood-ratio-wald-and-lagrange-multiplier-score-tests-different-andor-similar/ https://handwiki.org/wiki/Wald_test https://questionerlab.com/what-is-the-use-of-wald-test-in-logistic-regression https://stats.oarc.ucla.edu/r/dae/logit-regression/ https://bookdown.org/mike/data_analysis/wald-test.html https://bookdown.org/mike/data_analysis/hypothesis-testing.html#wald-test - For analysis help https://stats.idre.ucla.edu/r/dae/logit-regression/ - For interpretation help https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/