GitHub Repo: https://github.com/mccurcio/Logistic_with_R
## 1. Executive Summary
## 2. Experimental Results
```{r LoadData, message=FALSE, warning=FALSE, include=FALSE}
# DO NOT SHOW IN FINAL REPORT
options(tinytex.verbose = FALSE)
options(digits = 5)
#Load Libraries
Libraries <- c("knitr", "readr", "aod")
for (p in Libraries) {
library(p, character.only = TRUE)
}
#Load Data
df <- read_csv("CHD_preprocessed.csv",
col_types = cols(male = col_factor(levels = c("0", "1")),
education = col_factor(levels = c("0", "1")),
currentSmoker = col_skip(),
BPMeds = col_skip(),
prevalentStroke = col_factor(levels = c("0", "1")),
prevalentHyp = col_factor(levels = c("0", "1")),
diabetes = col_factor(levels = c("0", "1")),
TenYearCHD = col_factor(levels = c("0", "1"))))
```
### 2.1 Logistic Regression Model
```{r cache=TRUE}
mylogit <- glm(TenYearCHD ~ male + age + education +
cigsPerDay + prevalentStroke + prevalentHyp +
diabetes + totChol + sysBP + diaBP + BMI +
heartRate + glucose,
data = df,
family = "binomial")
summary(mylogit)
```
- **7 most significant variables**
- Seven predictors have $\small \alpha < 0.05$. They are significant and associated with acquiring cardiovascular disease.
| Rank | Risk Factor |
| :--- | :-------------------------- |
| 1 | Prevalence of Stroke1 |
| 2 | Male1 |
| 3 | Prevalence of Hypertension1 |
| 4 | Age |
| 5 | Cigarettes Per Day |
| 6 | Systolic Blood Pressure |
| 7 | Glucose |
### 2.2 Wald Test: Do The Seven Factors Fit Our Model
- The Wald Chi-Square Test can help determine if our proposed model is valuable and significant.
- The Wald test generates a *P-value << 0.001*.
- Therefore, we conclude the seven (7) parameters are significant and useful in describing cardiovascular disease.
```{r cache=TRUE}
wald.test(b = coef(mylogit),
Sigma = vcov(mylogit),
Terms = c(2, 3, 5, 6, 7, 10, 14))
```
### 2.3 Determination of Odds for Seven Variables
- We can calculate the odds of acquiring cardiovascular disease for each of the seven variables.
- By holding all other values constant we create a dataframe that investigates the odds given Prevalence of Stroke, for example.
```{r Stroke, cache=TRUE}
strok_test <- with(df, data.frame(male = "0",
age = mean(age),
education = "0",
cigsPerDay = 0, # Non-smoker
prevalentHyp = "0",
diabetes = "0",
totChol = mean(totChol),
sysBP = mean(sysBP),
diaBP = mean(diaBP),
BMI = mean(BMI),
heartRate = mean(heartRate),
glucose = mean(glucose),
prevalentStroke = c("0", "1"))
)
# Convert prevalentStroke from Numeric to FACTOR
strok_test$prevalentStroke <- as.factor(strok_test$prevalentStroke)
# str(strok_test)
strok_test$prevalentStroke <- predict(mylogit,
newdata = strok_test,
type = "response")
#strok_test$prevalentStroke
```
### 2.4 **Odds Given Prevalence Of Stroke In family history.**
1. WITH Prevalence of Stroke: `r strok_test$prevalentStroke[[2]]`
2. NO Prevalence of Stroke: `r strok_test$prevalentStroke[[1]]`
- **Odds = `r strok_test$prevalentStroke[[2]]/strok_test$prevalentStroke[[1]]`**
### 2.5 **Odds Given For Male Vs Female**
```{r MF, cache=TRUE}
male_test <- with(df, data.frame(male = c("0","1"), # Factor of Interest
age = mean(age),
education = "0",
cigsPerDay = 0,
prevalentHyp = "0",
diabetes = "0",
totChol = mean(totChol),
sysBP = mean(sysBP),
diaBP = mean(diaBP),
BMI = mean(BMI),
heartRate = mean(heartRate),
glucose = mean(glucose),
prevalentStroke = "0"))
# REMEMBER convert male_test from numeric to FACTOR
male_test$male <- as.factor(male_test$male)
male_test$male <- predict(mylogit, newdata = male_test, type = "response")
```
1. Males: `r male_test$male[[2]]`
2. Female: `r male_test$male[[1]]`
- **Odds = `r male_test$male[[2]]/male_test$male[[1]]`**
### 2.6 Odds Prevalence of Hypertension In Family History
```{r hyperT, cache=TRUE}
hyperT_test <- with(df, data.frame(male = "0",
age = mean(age),
education = "0",
cigsPerDay = 0,
prevalentHyp = c("0","1"), # Factor of Interest
diabetes = "0",
totChol = mean(totChol),
sysBP = mean(sysBP),
diaBP = mean(diaBP),
BMI = mean(BMI),
heartRate = mean(heartRate),
glucose = mean(glucose),
prevalentStroke = "0"))
# REMEMBER convert male_test from numeric to FACTOR
hyperT_test$prevalentHyp <- as.factor(hyperT_test$prevalentHyp)
hyperT_test$prevalentHyp <- predict(mylogit, newdata = hyperT_test, type = "response")
```
1. WITH Prevalence of Hypertension: `r hyperT_test$prevalentHyp[[2]]`
2. NO Prevalence of Hypertension: `r hyperT_test$prevalentHyp[[1]]`
- **Odds = `r hyperT_test$prevalentHyp[[2]]/hyperT_test$prevalentHyp[[1]]`**
### 2.7 Odds Given Age
```{r Age, cache=TRUE}
age_test <- with(df, data.frame(male = "0",
age = c(20,30,40,50,60,70,80),
education = "0",
cigsPerDay = 0,
prevalentHyp = "0",
diabetes = "0",
totChol = mean(totChol),
sysBP = mean(sysBP),
diaBP = mean(diaBP),
BMI = mean(BMI),
heartRate = mean(heartRate),
glucose = mean(glucose),
prevalentStroke = "0"))
age_test$age <- predict(mylogit, newdata = age_test, type = "response")
```
| Age (years) | Probability Given Age | Odds Compared to 20 yr old |
| ----------: | :-------------------: | :-------------------------------------: |
| 20 | `r age_test$age[[1]]` | `r age_test$age[[1]]/age_test$age[[1]]` |
| 30 | `r age_test$age[[2]]` | `r age_test$age[[2]]/age_test$age[[1]]` |
| 40 | `r age_test$age[[3]]` | `r age_test$age[[3]]/age_test$age[[1]]` |
| 50 | `r age_test$age[[4]]` | `r age_test$age[[4]]/age_test$age[[1]]` |
| 60 | `r age_test$age[[5]]` | `r age_test$age[[5]]/age_test$age[[1]]` |
| 70 | `r age_test$age[[6]]` | `r age_test$age[[6]]/age_test$age[[1]]` |
| 80 | `r age_test$age[[7]]` | `r age_test$age[[7]]/age_test$age[[1]]` |
### 2.8 Odds Given Number Of Cigarettes Per Day
```{r Cigarettes, cache=TRUE}
cigs_per_day <- with(df, data.frame(male = "0",
age = mean(age),
education = "0",
cigsPerDay = c(0,10,20,30,40),
prevalentHyp = "0",
diabetes = "0",
totChol = mean(totChol),
sysBP = mean(sysBP),
diaBP = mean(diaBP),
BMI = mean(BMI),
heartRate = mean(heartRate),
glucose = mean(glucose),
prevalentStroke = "0"))
cigs_per_day$cigsPerDay <- predict(mylogit, newdata = cigs_per_day, type = "response")
#cigs_per_day$cigsPerDay
```
| Age (years) | Probability Given Age | Odds Compared to Zero Cigarettes Per Day |
| ----------: | :------------------------------: | :-----------------------------------------------------------: |
| 0 | `r cigs_per_day$cigsPerDay[[1]]` | `r cigs_per_day$cigsPerDay[[1]]/cigs_per_day$cigsPerDay[[1]]` |
| 10 | `r cigs_per_day$cigsPerDay[[2]]` | `r cigs_per_day$cigsPerDay[[2]]/cigs_per_day$cigsPerDay[[1]]` |
| 20 | `r cigs_per_day$cigsPerDay[[3]]` | `r cigs_per_day$cigsPerDay[[3]]/cigs_per_day$cigsPerDay[[1]]` |
| 30 | `r cigs_per_day$cigsPerDay[[4]]` | `r cigs_per_day$cigsPerDay[[4]]/cigs_per_day$cigsPerDay[[1]]` |
| 40 | `r cigs_per_day$cigsPerDay[[5]]` | `r cigs_per_day$cigsPerDay[[5]]/cigs_per_day$cigsPerDay[[1]]` |
### 2.9 Odds Given Systolic Blood Pressure
```{r BP, cache=TRUE}
summary(df$sysBP)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 83.5 117.0 128.0 132.4 144.0 295.0
sysBP_calc <- with(df, data.frame(male = "0",
age = mean(age),
education = "0",
cigsPerDay = 0,
prevalentHyp = "0",
diabetes = "0",
totChol = mean(totChol),
sysBP = c(117, 128, 144, 295),
diaBP = mean(diaBP),
BMI = mean(BMI),
heartRate = mean(heartRate),
glucose = mean(glucose),
prevalentStroke = "0"))
sysBP_calc$sysBP <- predict(mylogit, newdata = sysBP_calc, type = "response")
#sysBP_calc$sysBP
```
| Systolic BP | Probability Given Systolic BP | Odds Systolic BP |
| ----------: | :---------------------------: | :---------------------------------------------: |
| 117 | `r sysBP_calc$sysBP[[1]]` | `r sysBP_calc$sysBP[[1]]/sysBP_calc$sysBP[[1]]` |
| 128 | `r sysBP_calc$sysBP[[2]]` | `r sysBP_calc$sysBP[[2]]/sysBP_calc$sysBP[[1]]` |
| 144 | `r sysBP_calc$sysBP[[3]]` | `r sysBP_calc$sysBP[[3]]/sysBP_calc$sysBP[[1]]` |
| Max 295 | `r sysBP_calc$sysBP[[4]]` | `r sysBP_calc$sysBP[[4]]/sysBP_calc$sysBP[[1]]` |
### 2.10 Odds Given Glucose Levels
```{r Glucose, cache=TRUE}
summary(df$glucose)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 40 72 80 82 85 394
glucose_calc <- with(df, data.frame(male = "0",
age = mean(age),
education = "0",
cigsPerDay = 0,
prevalentHyp = "0",
diabetes = "0",
totChol = mean(totChol),
sysBP = mean(sysBP),
diaBP = mean(diaBP),
BMI = mean(BMI),
heartRate = mean(heartRate),
glucose = c(72, 80, 85, 394),
prevalentStroke = "0"))
glucose_calc$glucose <- predict(mylogit, newdata = glucose_calc, type = "response")
# glucose_calc$glucose.
# 0.094843 0.100852 0.110194 0.239738
```
| Glucose | Probabilities | Odds Given Glucose |
| -------: | :-----------: | :-------------------: |
| 72 | 0.094843 | 1 |
| 80 | 0.100852 | `r 0.100852/0.094843` |
| 85 | 0.110194 | `r 0.110194/0.094843` |
| Max 394 | 0.239738 | `r 0.239738/0.094843` |
## IV. Conclusion
1.
#### Notes
- For analysis help
https://stats.idre.ucla.edu/r/dae/logit-regression/
- For interpretation help
https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/
#### Wald test info
https://www.mbaskool.com/business-concepts/statistics/6916-wald-test.html
https://www.statology.org/wald-test-in-r/
https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faqhow-are-the-likelihood-ratio-wald-and-lagrange-multiplier-score-tests-different-andor-similar/
https://handwiki.org/wiki/Wald_test
https://questionerlab.com/what-is-the-use-of-wald-test-in-logistic-regression
https://stats.oarc.ucla.edu/r/dae/logit-regression/
https://bookdown.org/mike/data_analysis/wald-test.html
https://bookdown.org/mike/data_analysis/hypothesis-testing.html#wald-test
- For analysis help
https://stats.idre.ucla.edu/r/dae/logit-regression/
- For interpretation help
https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/