Last updated: 2021-03-05
Checks: 6 1
Knit directory: esoph-micro-cancer-workflow/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish
to commit the R Markdown file and build the HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200916)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 092bf33. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: analysis/figure/
Ignored: data/
Unstaged changes:
Modified: analysis/ordered-logit-model.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/ordered-logit-model.Rmd
) and HTML (docs/ordered-logit-model.html
) files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | cb97ca6 | noah-padgett | 2021-02-25 | updated figures violin |
Rmd | fe971b9 | noah-padgett | 2021-02-13 | violin plot scale fixed |
html | fe971b9 | noah-padgett | 2021-02-13 | violin plot scale fixed |
Q3: Is fuso associated with tumor stage (pTNM) in either data set? Does X bacteria predict stage? Multivariable w/ age, sex, BMI, history of Barrett's
Add to this analysis:
TCGA drop “not reported” from tumor stage.
The goal of this analysis is still to answer the question posed above. However, the methods by which this is done has been expanded upon and the model has been built to be as interpretable as possible. In the results, a deeper explanation of the results is given to help explain the effects of adding covariates and how to interpret the parameters. In general, the model is set up to model how the probability of being in higher tumor stage depends on the individual characteristics of the study participants. Let \(TS\) represents the tumor stage with levels \(j = 1, 2, ..., J\) (TS have levels 0, 1, I, II, III, and IV). We aim to model the odds of being less than or equal to a particular category \[\frac{Pr(TS\leq j)}{Pr(TS> j)}, j = 1, 2, ..., J-1\] We aim to model the logit, which is \[log\left(\frac{Pr(TS\leq j)}{Pr(TS> j)}\right)=logit\left(Pr(TS\leq j)\right)=\beta_{0j}-\left(\eta_1x_1+\eta_2x_2+...+\eta_px_p\right)\] where the terms on the right represent how R parameterizes the ordered logit model with \(\beta_{0j}\) being the intercept/threshold parameter separating category \(j\) from \(j+1\). The \(\eta_p\) regression weights represent the change in logits. A major assumption of the ordered logit model, as parameterized here, is the parallel lines assumption. This means the even though the intercepts differ separate the categories, the regression weights are equal for each successive category.
# in long format
table(dat.16s$tumor.stage)
0 1 I II III IV
11088 264 6336 13464 6072 2640
# by subject
dat <- dat.16s %>% filter(OTU == "Fusobacterium_nucleatum")
table(dat$tumor.stage)
0 1 I II III IV
42 1 24 51 23 10
sum(table(dat$tumor.stage)) # sample size match
[1] 151
mean.dat <- dat.16s.s %>%
group_by(tumor.stage, OTU) %>%
summarize(M = mean(Abundance))
`summarise()` has grouped output by 'tumor.stage'. You can override using the `.groups` argument.
ggplot(dat.16s.s, aes(x=tumor.stage, y=Abundance))+
geom_violin()+
geom_jitter(alpha=0.25,width = 0.25)+
geom_point(data=mean.dat, aes(x=tumor.stage, y = M), size=2, alpha =0.9, color="red")+
labs(x="Tumor Stage", y="Percent Relative Abundance",
title="Distribution of abundance across tumor stage",
subtitle="Red dot is average percent relative abundance")+
scale_y_continuous(trans="pseudo_log")+
facet_wrap(.~OTU, nrow=1, scales="free")+
theme_classic()
Stage “1” has only 1 unique sample and will be dropped from subsequent analyses. And remove NA values (7).
dat.16s.s <- dat.16s.s %>%
filter(tumor.stage != "1")%>%
mutate(tumor.stage = droplevels(tumor.stage, exclude=c("1",NA)))
Now, transform data into a useful wide-format representation to avoid inflating sample size. The long-format above is very useful for making plots, but less useful for the ordered logistic regression to follow.
dat.16s.wide <- dat.16s.s %>%
dplyr::select(OTU, Abundance, Sample.ID, BarrettsHist, bar.c, female, female.c, age.c,BMI.n, bmi.c, sample_type, tumor.cat, tumor.stage) %>%
pivot_wider(names_from = OTU, values_from = Abundance)
This in the intercept only model. Implying that we will only estimate the unique intercepts for the \(J-1\) categories.
## fit ordered logit model and store results 'm'
# use only 1 bacteria
fit <- fit0 <- MASS::polr(tumor.stage ~ 1, data = filter(dat.16s.s, OTU =="Fusobacterium nucleatum"), Hess=TRUE)
## view a summary of the model
summary(fit)
Call:
MASS::polr(formula = tumor.stage ~ 1, data = filter(dat.16s.s,
OTU == "Fusobacterium nucleatum"), Hess = TRUE)
No coefficients
Intercepts:
Value Std. Error t value
0|I -0.945 0.182 -5.194
I|II -0.241 0.164 -1.467
II|III 1.266 0.197 6.421
III|IV 2.639 0.327 8.062
Residual Deviance: 445.35
AIC: 453.35
# save fitted probabilities
# these represent the estimated probability of each tumor stage
pp <- fitted(fit)
pp[1,]
0 I II III IV
0.27997 0.15998 0.34002 0.15335 0.06668
The model can be more fully described as \[\begin{align*} logit\left(Pr(TS\leq 0)\right)&=-0.95\\ logit\left(Pr(TS\leq I)\right)&=-0.24\\ logit\left(Pr(TS\leq II)\right)&=1.27\\ logit\left(Pr(TS\leq III)\right)&=2.64\\ \end{align*}\]
Interpreting the intercepts only model is relatively straightforward. The main interpretation we can obtain from this base model is the probability of each category. We obtain these probabilities by applying the inverse-logit transformation to obtain each of the cumulative probabilities and then use the relevant probabilities to compute each individual category probability. That is, \[\begin{align*} Pr(TS = 0)&= Pr(TS\leq 0) = 0.28\\ Pr(TS = I)&=Pr(TS\leq I) - Pr(TS\leq 0)= 0.44-0.28 =0.16\\ Pr(TS = II)&=Pr(TS\leq II) - Pr(TS\leq I) = 0.78 - 0.44 = 0.34 \\ Pr(TS = III)&=Pr(TS\leq III) - Pr(TS\leq II) =0.93-0.78=0.15\\ Pr(TS = IV)&=1 - Pr(TS\leq IV) =1-0.93=0.07,\\ \end{align*}\] which, in all is not very informative. More information is definately gained as predictors/covariates are introduced.
This model expands on the previous model to include OTU abundance as a predictor of TS. OTU abundance is captured by four variables as predictors of TS (the four OTUs listed at the top of this document).
dat0 <- dat.16s.wide
## fit ordered logit model and store results 'm'
fit <- fit1 <- MASS::polr(tumor.stage ~ 1+ `Streptococcus spp.` + `Fusobacterium nucleatum` + `Prevotella melaninogenica` + `Campylobacter concisus`, data = dat.16s.wide, Hess=TRUE)
## view a summary of the model
summary(fit)
Call:
MASS::polr(formula = tumor.stage ~ 1 + `Streptococcus spp.` +
`Fusobacterium nucleatum` + `Prevotella melaninogenica` +
`Campylobacter concisus`, data = dat.16s.wide, Hess = TRUE)
Coefficients:
Value Std. Error t value
`Streptococcus spp.` 0.000488 0.00577 0.0846
`Fusobacterium nucleatum` 0.013240 0.01245 1.0639
`Prevotella melaninogenica` 0.015590 0.01532 1.0178
`Campylobacter concisus` -0.037641 0.07027 -0.5357
Intercepts:
Value Std. Error t value
0|I -0.821 0.274 -3.001
I|II -0.106 0.267 -0.397
II|III 1.424 0.290 4.907
III|IV 2.803 0.390 7.186
Residual Deviance: 442.49
AIC: 458.49
anova(fit1, fit0) # Chi-square difference test
Likelihood ratio tests of ordinal regression models
Response: tumor.stage
Model
1 1
2 1 + `Streptococcus spp.` + `Fusobacterium nucleatum` + `Prevotella melaninogenica` + `Campylobacter concisus`
Resid. df Resid. Dev Test Df LR stat. Pr(Chi)
1 146 445.3
2 142 442.5 1 vs 2 4 2.864 0.5809
# obtain approximate p-values
ctable <- coef(summary(fit))
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
(ctable <- cbind(ctable, "p value" = p))
Value Std. Error t value p value
`Streptococcus spp.` 0.0004884 0.005772 0.08461 9.326e-01
`Fusobacterium nucleatum` 0.0132397 0.012445 1.06385 2.874e-01
`Prevotella melaninogenica` 0.0155899 0.015317 1.01780 3.088e-01
`Campylobacter concisus` -0.0376410 0.070265 -0.53570 5.922e-01
0|I -0.8207942 0.273501 -3.00106 2.690e-03
I|II -0.1057784 0.266764 -0.39652 6.917e-01
II|III 1.4243029 0.290278 4.90669 9.263e-07
III|IV 2.8030741 0.390066 7.18615 6.664e-13
# obtain CIs
(ci <- confint(fit)) # CIs assuming normality
Waiting for profiling to be done...
2.5 % 97.5 %
`Streptococcus spp.` -0.01089 0.01180
`Fusobacterium nucleatum` -0.01124 0.03852
`Prevotella melaninogenica` -0.01508 0.04568
`Campylobacter concisus` -0.19516 0.09332
## OR and CI
exp(cbind(OR = coef(fit), ci))
OR 2.5 % 97.5 %
`Streptococcus spp.` 1.0005 0.9892 1.012
`Fusobacterium nucleatum` 1.0133 0.9888 1.039
`Prevotella melaninogenica` 1.0157 0.9850 1.047
`Campylobacter concisus` 0.9631 0.8227 1.098
# save fitted logits
pp <- fitted(fit)
# predictive data
dat0 <- cbind(dat0, predict(fit, newdata = dat.16s.wide, "probs"))
## melt data set to long for ggplot2
dat1 <- dat0 %>%
pivot_longer(
cols=`0`:`IV`,
names_to = "TS",
values_to = "Pred.Prob"
) %>%
pivot_longer(
cols=`Streptococcus spp.`:`Campylobacter concisus`,
names_to="OTU",
values_to="Abundance"
)
## plot predicted probabilities across Abundance values for each level of OTU
## facetted by tumor.stage
ggplot(dat1, aes(x = Abundance, y = Pred.Prob, color=TS, group=TS, linetype=TS)) +
geom_line() +
facet_grid(OTU ~., scales="free")+
labs(y="Probability of Tumor Stage",
x="Relative Abundance (%)",
title="Tumor stage likelihood with bacteria abundance",
color="Tumor Stage", linetype="Tumor Stage")+
theme(
panel.grid = element_blank()
)
The results from the model when the abundance of the four OTUs are included in the model becomes a bit more complicated. First of all, the model with OTU abundances did not fit better in terms of AIC, BIC, and no difference was found from a chi-square difference test.
For interpretation purposes, the following is how to interpret the effects of OTU abundance on predicting tumor stage. First, the logit model is \[\begin{align*} logit\left(Pr(TS\leq 0)\right)&=-0.95\\ logit\left(Pr(TS\leq I)\right)&=-0.24\\ logit\left(Pr(TS\leq II)\right)&=1.27\\ logit\left(Pr(TS\leq III)\right)&=2.64\\ \end{align*}\] While interpreting the odds ratios can be done as follows. First, take Streptococcus spp., the OR was 1.00 95% CI (0.99, 1.01). This implies that for individuals who had a 1% higher relative abundance of Streptococcus spp., the odds of being in a higher tumor stage is multiplied 1.00 times (i.e., 0% increase on average) holding the abundance of other OTUs constant. This means that the knowledge of relative abundance of Strepto was not informative over the base probability we estimated in model 0.
For Fusobacterium nucleatum (OR = 1.01, 95% CI [0.99, 1.04]) and Prevotella melaninogenica (OR = 1.01, 95% CI [0.99, 1.05]), the interpretation is as follows because the OR is the same. For individuals who had a 1% higher relative abundance of this OTU (Fuso. or Prevo.), the odds of being in a higher tumor stage is multiplied 1.01 times (i.e., about a 1% increase on average) holding the abundance of other OTUs constant.
dat0 <- dat.16s.wide
## fit ordered logit model and store results 'm'
fit <- fit2 <- MASS::polr(tumor.stage ~ 1+ `Streptococcus spp.` + `Fusobacterium nucleatum` + `Prevotella melaninogenica` + `Campylobacter concisus` + age.c + female.c + bmi.c + bar.c, data = dat.16s.wide, Hess=TRUE)
## view a summary of the model
summary(fit)
Call:
MASS::polr(formula = tumor.stage ~ 1 + `Streptococcus spp.` +
`Fusobacterium nucleatum` + `Prevotella melaninogenica` +
`Campylobacter concisus` + age.c + female.c + bmi.c + bar.c,
data = dat.16s.wide, Hess = TRUE)
Coefficients:
Value Std. Error t value
`Streptococcus spp.` 0.00250 0.00589 0.4240
`Fusobacterium nucleatum` 0.01447 0.01264 1.1453
`Prevotella melaninogenica` 0.01515 0.01595 0.9501
`Campylobacter concisus` -0.03350 0.07537 -0.4444
age.c -0.00415 0.01371 -0.3025
female.c -0.01082 0.43529 -0.0248
bmi.c -0.01602 0.02275 -0.7043
bar.c 0.62262 0.34717 1.7934
Intercepts:
Value Std. Error t value
0|I -0.776 0.276 -2.812
I|II -0.051 0.270 -0.187
II|III 1.500 0.295 5.080
III|IV 2.892 0.394 7.334
Residual Deviance: 438.63
AIC: 462.63
anova(fit2, fit0) # Chi-square difference test
Likelihood ratio tests of ordinal regression models
Response: tumor.stage
Model
1 1
2 1 + `Streptococcus spp.` + `Fusobacterium nucleatum` + `Prevotella melaninogenica` + `Campylobacter concisus` + age.c + female.c + bmi.c + bar.c
Resid. df Resid. Dev Test Df LR stat. Pr(Chi)
1 146 445.3
2 138 438.6 1 vs 2 8 6.715 0.5677
# obtain approximate p-values
ctable <- coef(summary(fit))
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
(ctable <- cbind(ctable, "p value" = p))
Value Std. Error t value p value
`Streptococcus spp.` 0.002499 0.005893 0.42397 6.716e-01
`Fusobacterium nucleatum` 0.014473 0.012636 1.14533 2.521e-01
`Prevotella melaninogenica` 0.015152 0.015948 0.95008 3.421e-01
`Campylobacter concisus` -0.033496 0.075370 -0.44442 6.567e-01
age.c -0.004149 0.013715 -0.30249 7.623e-01
female.c -0.010816 0.435289 -0.02485 9.802e-01
bmi.c -0.016024 0.022751 -0.70430 4.812e-01
bar.c 0.622623 0.347169 1.79343 7.290e-02
0|I -0.775974 0.275914 -2.81238 4.918e-03
I|II -0.050542 0.270417 -0.18691 8.517e-01
II|III 1.500106 0.295276 5.08035 3.767e-07
III|IV 2.892116 0.394360 7.33369 2.239e-13
# obtain CIs
(ci <- confint(fit)) # CIs assuming normality
Waiting for profiling to be done...
2.5 % 97.5 %
`Streptococcus spp.` -0.009111 0.01405
`Fusobacterium nucleatum` -0.010312 0.04018
`Prevotella melaninogenica` -0.016973 0.04624
`Campylobacter concisus` -0.199242 0.10805
age.c -0.031171 0.02275
female.c -0.873271 0.83890
bmi.c -0.061385 0.02842
bar.c -0.056145 1.30794
## OR and CI
exp(cbind(OR = coef(fit), ci))
OR 2.5 % 97.5 %
`Streptococcus spp.` 1.0025 0.9909 1.014
`Fusobacterium nucleatum` 1.0146 0.9897 1.041
`Prevotella melaninogenica` 1.0153 0.9832 1.047
`Campylobacter concisus` 0.9671 0.8194 1.114
age.c 0.9959 0.9693 1.023
female.c 0.9892 0.4176 2.314
bmi.c 0.9841 0.9405 1.029
bar.c 1.8638 0.9454 3.699
# save fitted logits
pp <- fitted(fit)
# predictive data
dat0 <- cbind(dat0, predict(fit, newdata = dat.16s.wide, "probs"))
## melt data set to long for ggplot2
dat1 <- dat0 %>%
pivot_longer(
cols=`0`:`IV`,
names_to = "TS",
values_to = "Pred.Prob"
) %>%
pivot_longer(
cols=`Streptococcus spp.`:`Campylobacter concisus`,
names_to="OTU",
values_to="Abundance"
)
## plot predicted probabilities across Abundance values for each level of OTU
## facetted by tumor.stage
ggplot(dat1, aes(x = Abundance, y = Pred.Prob, color=TS, group=TS, linetype=TS)) +
geom_line() +
facet_grid(OTU ~., scales="free")+
labs(y="Probability of Tumor Stage",
x="Relative Abundance (%)",
title="Tumor stage likelihood with bacteria abundance",
color="Tumor Stage", linetype="Tumor Stage")+
theme(
panel.grid = element_blank()
)
No significant changes. Everything is still insignificant. For interpretation purposes, the results in from this model can be interpreted the same as the previous because all covariates were mean-centered.
dat.16s.wide <- dat.16s.wide %>%
mutate(
T0 = I(as.numeric(tumor.stage) >= 1),
TI = I(as.numeric(tumor.stage) >= 2),
TII = I(as.numeric(tumor.stage) >= 3),
TIII = I(as.numeric(tumor.stage) >= 4),
TIV = I(as.numeric(tumor.stage) >= 5),
)
apply(dat.16s.wide[,c('TI','TII','TIII', 'TIV')],2, table)
TI TII TIII TIV
FALSE 42 66 117 140
TRUE 108 84 33 10
summary(glm(TI ~ 1 + `Streptococcus spp.` + `Fusobacterium nucleatum` + `Prevotella melaninogenica` + `Campylobacter concisus`+ age.c + female.c + bmi.c + bar.c, data = dat.16s.wide, family=binomial(link="logit")))
Call:
glm(formula = TI ~ 1 + `Streptococcus spp.` + `Fusobacterium nucleatum` +
`Prevotella melaninogenica` + `Campylobacter concisus` +
age.c + female.c + bmi.c + bar.c, family = binomial(link = "logit"),
data = dat.16s.wide)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.001 -1.186 0.624 0.828 1.259
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.49816 0.30842 1.62 0.106
`Streptococcus spp.` 0.01025 0.00768 1.34 0.182
`Fusobacterium nucleatum` 0.06398 0.04394 1.46 0.145
`Prevotella melaninogenica` 0.01499 0.02613 0.57 0.566
`Campylobacter concisus` 0.00351 0.11131 0.03 0.975
age.c 0.00969 0.01783 0.54 0.587
female.c -0.16395 0.51003 -0.32 0.748
bmi.c 0.01611 0.03096 0.52 0.603
bar.c 0.88774 0.44663 1.99 0.047 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 177.89 on 149 degrees of freedom
Residual deviance: 163.73 on 141 degrees of freedom
AIC: 181.7
Number of Fisher Scoring iterations: 6
summary(glm(TII ~ 1 + `Streptococcus spp.` + `Fusobacterium nucleatum` + `Prevotella melaninogenica` + `Campylobacter concisus`+ age.c + female.c + bmi.c + bar.c, data = dat.16s.wide, family=binomial(link="logit")))
Call:
glm(formula = TII ~ 1 + `Streptococcus spp.` + `Fusobacterium nucleatum` +
`Prevotella melaninogenica` + `Campylobacter concisus` +
age.c + female.c + bmi.c + bar.c, family = binomial(link = "logit"),
data = dat.16s.wide)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.489 -1.212 0.704 1.108 1.499
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.000384 0.283602 0.00 0.999
`Streptococcus spp.` 0.003275 0.006798 0.48 0.630
`Fusobacterium nucleatum` 0.073311 0.038078 1.93 0.054 .
`Prevotella melaninogenica` -0.001872 0.020207 -0.09 0.926
`Campylobacter concisus` -0.059713 0.088063 -0.68 0.498
age.c -0.014604 0.015986 -0.91 0.361
female.c 0.402529 0.486388 0.83 0.408
bmi.c 0.006519 0.026552 0.25 0.806
bar.c 0.486287 0.399061 1.22 0.223
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 205.78 on 149 degrees of freedom
Residual deviance: 195.34 on 141 degrees of freedom
AIC: 213.3
Number of Fisher Scoring iterations: 5
summary(glm(TIII ~ 1 + `Streptococcus spp.` + `Fusobacterium nucleatum` + `Prevotella melaninogenica` + `Campylobacter concisus`+ age.c + female.c + bmi.c + bar.c, data = dat.16s.wide, family=binomial(link="logit")))
Call:
glm(formula = TIII ~ 1 + `Streptococcus spp.` + `Fusobacterium nucleatum` +
`Prevotella melaninogenica` + `Campylobacter concisus` +
age.c + female.c + bmi.c + bar.c, family = binomial(link = "logit"),
data = dat.16s.wide)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.347 -0.756 -0.576 -0.202 2.107
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.26111 0.36325 -3.47 0.00052 ***
`Streptococcus spp.` -0.00693 0.00900 -0.77 0.44116
`Fusobacterium nucleatum` -0.01641 0.02223 -0.74 0.46031
`Prevotella melaninogenica` 0.03267 0.02188 1.49 0.13532
`Campylobacter concisus` -0.40193 0.36903 -1.09 0.27608
age.c 0.00600 0.01879 0.32 0.74944
female.c -0.32657 0.63515 -0.51 0.60714
bmi.c -0.10841 0.04514 -2.40 0.01632 *
bar.c 0.80964 0.49326 1.64 0.10071
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 158.07 on 149 degrees of freedom
Residual deviance: 144.20 on 141 degrees of freedom
AIC: 162.2
Number of Fisher Scoring iterations: 7
summary(glm(TIV ~ 1 + `Streptococcus spp.` + `Fusobacterium nucleatum` + `Prevotella melaninogenica` + `Campylobacter concisus`+ age.c + female.c + bmi.c + bar.c, data = dat.16s.wide, family=binomial(link="logit")))
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Call:
glm(formula = TIV ~ 1 + `Streptococcus spp.` + `Fusobacterium nucleatum` +
`Prevotella melaninogenica` + `Campylobacter concisus` +
age.c + female.c + bmi.c + bar.c, family = binomial(link = "logit"),
data = dat.16s.wide)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0213 -0.3107 -0.1561 -0.0027 2.2971
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.53618 0.87137 -4.06 4.9e-05 ***
`Streptococcus spp.` -0.00368 0.01603 -0.23 0.8184
`Fusobacterium nucleatum` -2.65756 2.07129 -1.28 0.1995
`Prevotella melaninogenica` 0.15010 0.05747 2.61 0.0090 **
`Campylobacter concisus` 0.42004 0.33434 1.26 0.2090
age.c -0.01818 0.03911 -0.46 0.6420
female.c -1.50786 1.16883 -1.29 0.1970
bmi.c -0.28740 0.10923 -2.63 0.0085 **
bar.c 0.16138 0.94158 0.17 0.8639
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 73.479 on 149 degrees of freedom
Residual deviance: 47.240 on 141 degrees of freedom
AIC: 65.24
Number of Fisher Scoring iterations: 13
# in long format
table(dat.rna.s$tumor.stage)
I II III IV
84 284 204 32
# by subject
dat <- dat.rna.s %>% filter(OTU == "Fusobacterium nucleatum")
table(dat$tumor.stage)
I II III IV
21 71 51 8
sum(table(dat$tumor.stage))
[1] 151
nrow(dat) # matches but there's missing data
[1] 173
sum(is.na(dat$Abundance))
[1] 107
# number of non-missing abundnance data
nrow(dat) - sum(is.na(dat$Abundance))
[1] 66
# matches
dat0 <- dat %>%
filter()
table(dat0$tumor.stage)
I II III IV
21 71 51 8
# plot functions
#root function
root<-function(x){
x <- ifelse(x < 0, 0, x)
x**(0.2)
}
#inverse root function
invroot<-function(x){
x**(5)
}
mean.dat <- dat.rna.s %>%
group_by(tumor.stage, OTU) %>%
summarize(M = mean(Abundance, na.rm=T))
`summarise()` has grouped output by 'tumor.stage'. You can override using the `.groups` argument.
ggplot(dat.rna.s, aes(x=tumor.stage, y=Abundance))+
geom_violin()+
geom_jitter(alpha=0.25,width = 0.25)+
geom_point(data=mean.dat, aes(x=tumor.stage, y = M), size=2, alpha =0.9, color="red")+
labs(x="Tumor Stage", y="Percent Relative Abundance",
title="Distribution of abundance across tumor stage",
subtitle="Red dot is average percent relative abundance")+
scale_y_continuous(
trans=scales::trans_new("root", root, invroot),
breaks=c(0, 0.001,0.01, 0.1, 1,10,50),
labels = c(0, 0.001,0.01, 0.1, 1,10,50),
limits = c(0, 50)
)+
facet_wrap(.~OTU, nrow=1, scales="free")+
theme_classic()
Warning: Removed 428 rows containing non-finite values (stat_ydensity).
Warning: Removed 461 rows containing missing values (geom_point).
Stage IV only has two cases, remove for analysis.
dat.rna.s <- dat.rna.s %>%
filter(is.na(Abundance) == F, tumor.stage!="IV")%>%
mutate(
tumor.stage = factor(
tumor.stage,
levels = c("I", "II", "III"), ordered=T
)
)
Now, transform data into a useful wide-format representation to avoid inflating sample size. The long-format above is very useful for making plots, but less useful for the ordered logistic regression to follow.
dat.rna.wide <- dat.rna.s %>%
dplyr::select(OTU, Abundance, ID, BarrettsHist, bar.c, female.c, age.c, bmi.c, tumor.stage) %>%
pivot_wider(names_from = OTU, values_from = Abundance)
dat.rna.wide$age.c[is.na(dat.rna.wide$age.c)]<-0
This in the intercept only model. Implying that we will only estimate the unique intercepts for the \(J-1\) categories.
## fit ordered logit model and store results 'm'
fit <- fit0 <- MASS::polr(tumor.stage ~ 1, data = dat.rna.wide, Hess=TRUE)
## view a summary of the model
summary(fit)
Call:
MASS::polr(formula = tumor.stage ~ 1, data = dat.rna.wide, Hess = TRUE)
No coefficients
Intercepts:
Value Std. Error t value
I|II -1.276 0.326 -3.909
II|III 0.804 0.292 2.757
Residual Deviance: 115.42
AIC: 119.42
# save fitted probabilities
# these represent the estimated probability of each tumor stage
pp <- fitted(fit)
pp[1,]
I II III
0.2182 0.4727 0.3091
This model expands on the previous model to include OTU abundance as a predictor of TS. OTU abundance is captured by four variables as predictors of TS (the four OTUs listed at the top of this document).
dat0 <- dat.rna.wide
## fit ordered logit model and store results 'm'
fit <- fit1 <- MASS::polr(tumor.stage ~ 1+ `Streptococcus spp.` + `Fusobacterium nucleatum` + `Prevotella melaninogenica` + `Campylobacter concisus`, data = dat.rna.wide, Hess=TRUE)
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## view a summary of the model
summary(fit)
Call:
MASS::polr(formula = tumor.stage ~ 1 + `Streptococcus spp.` +
`Fusobacterium nucleatum` + `Prevotella melaninogenica` +
`Campylobacter concisus`, data = dat.rna.wide, Hess = TRUE)
Coefficients:
Value Std. Error t value
`Streptococcus spp.` 2.7963 4.913 0.569
`Fusobacterium nucleatum` -0.0525 0.159 -0.330
`Prevotella melaninogenica` -0.6643 1.628 -0.408
`Campylobacter concisus` 7.2512 15.257 0.475
Intercepts:
Value Std. Error t value
I|II -1.210 0.367 -3.296
II|III 0.884 0.345 2.565
Residual Deviance: 114.53
AIC: 126.53
anova(fit1, fit0) # Chi-square difference test
Likelihood ratio tests of ordinal regression models
Response: tumor.stage
Model
1 1
2 1 + `Streptococcus spp.` + `Fusobacterium nucleatum` + `Prevotella melaninogenica` + `Campylobacter concisus`
Resid. df Resid. Dev Test Df LR stat. Pr(Chi)
1 53 115.4
2 49 114.5 1 vs 2 4 0.8931 0.9255
# obtain approximate p-values
ctable <- coef(summary(fit))
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
(ctable <- cbind(ctable, "p value" = p))
Value Std. Error t value p value
`Streptococcus spp.` 2.79631 4.9131 0.5692 0.5692510
`Fusobacterium nucleatum` -0.05254 0.1594 -0.3296 0.7416751
`Prevotella melaninogenica` -0.66427 1.6278 -0.4081 0.6832171
`Campylobacter concisus` 7.25117 15.2571 0.4753 0.6345980
I|II -1.21043 0.3673 -3.2957 0.0009819
II|III 0.88364 0.3445 2.5646 0.0103282
# obtain CIs
(ci <- confint(fit)) # CIs assuming normality
Waiting for profiling to be done...
2.5 % 97.5 %
`Streptococcus spp.` -7.0912 13.0226
`Fusobacterium nucleatum` -0.3907 0.2666
`Prevotella melaninogenica` -4.6019 2.3536
`Campylobacter concisus` -19.8730 43.8222
## OR and CI
exp(cbind(OR = coef(fit), ci))
OR 2.5 % 97.5 %
`Streptococcus spp.` 16.3841 8.324e-04 4.525e+05
`Fusobacterium nucleatum` 0.9488 6.766e-01 1.306e+00
`Prevotella melaninogenica` 0.5146 1.003e-02 1.052e+01
`Campylobacter concisus` 1409.7570 2.340e-09 1.076e+19
# predictive data
pp <- as.data.frame(predict(fit, newdata = dat.rna.wide, "probs"))
dat0 <- cbind(dat0, pp)
## melt data set to long for ggplot2
dat1 <- dat0 %>%
pivot_longer(
cols=`I`:`III`,
names_to = "TS",
values_to = "Pred.Prob"
) %>%
pivot_longer(
cols=`Fusobacterium nucleatum`:`Campylobacter concisus`,
names_to="OTU",
values_to="Abundance"
)
## plot predicted probabilities across Abundance values for each level of OTU
## facetted by tumor.stage
ggplot(dat1, aes(x = Abundance, y = Pred.Prob, color=TS, group=TS, linetype=TS)) +
geom_line() +
facet_grid(OTU ~., scales="free")+
labs(y="Probability of Tumor Stage",
x="Relative Abundance (%)",
title="Tumor stage likelihood with bacteria abundance",
color="Tumor Stage", linetype="Tumor Stage")+
scale_x_continuous(
trans=scales::trans_new("root", root, invroot),
breaks=c(0, 0.001,0.01, 0.1, 1,10,25),
labels = c(0, 0.001,0.01, 0.1, 1,10,25),
limits = c(0, 25)
)+
theme(
panel.grid = element_blank()
)
dat0 <- dat.rna.wide
## fit ordered logit model and store results 'm'
fit <- fit2 <- MASS::polr(tumor.stage ~ 1+ `Streptococcus spp.` + `Fusobacterium nucleatum` + `Prevotella melaninogenica` + `Campylobacter concisus` + age.c + female.c + bmi.c + bar.c, data = dat.rna.wide, Hess=TRUE)
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## view a summary of the model
summary(fit)
Call:
MASS::polr(formula = tumor.stage ~ 1 + `Streptococcus spp.` +
`Fusobacterium nucleatum` + `Prevotella melaninogenica` +
`Campylobacter concisus` + age.c + female.c + bmi.c + bar.c,
data = dat.rna.wide, Hess = TRUE)
Coefficients:
Value Std. Error t value
`Streptococcus spp.` 0.8852 5.2418 0.169
`Fusobacterium nucleatum` -0.1576 0.1786 -0.882
`Prevotella melaninogenica` 1.0219 1.9421 0.526
`Campylobacter concisus` -6.2926 17.9404 -0.351
age.c -0.0150 0.0233 -0.645
female.c -2.4460 0.8541 -2.864
bmi.c 0.0144 0.0456 0.315
bar.c -2.0994 1.1047 -1.900
Intercepts:
Value Std. Error t value
I|II -1.644 0.451 -3.645
II|III 1.086 0.374 2.900
Residual Deviance: 97.52
AIC: 117.52
anova(fit2, fit0) # Chi-square difference test
Likelihood ratio tests of ordinal regression models
Response: tumor.stage
Model
1 1
2 1 + `Streptococcus spp.` + `Fusobacterium nucleatum` + `Prevotella melaninogenica` + `Campylobacter concisus` + age.c + female.c + bmi.c + bar.c
Resid. df Resid. Dev Test Df LR stat. Pr(Chi)
1 53 115.42
2 45 97.52 1 vs 2 8 17.9 0.022
# obtain approximate p-values
ctable <- coef(summary(fit))
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
(ctable <- cbind(ctable, "p value" = p))
Value Std. Error t value p value
`Streptococcus spp.` 0.88522 5.24183 0.1689 0.8658937
`Fusobacterium nucleatum` -0.15762 0.17864 -0.8823 0.3776030
`Prevotella melaninogenica` 1.02186 1.94212 0.5262 0.5987809
`Campylobacter concisus` -6.29262 17.94037 -0.3508 0.7257744
age.c -0.01502 0.02329 -0.6451 0.5188823
female.c -2.44604 0.85409 -2.8639 0.0041846
bmi.c 0.01437 0.04562 0.3150 0.7527811
bar.c -2.09943 1.10468 -1.9005 0.0573691
I|II -1.64361 0.45090 -3.6452 0.0002672
II|III 1.08557 0.37430 2.9003 0.0037283
# obtain CIs
(ci <- confint(fit)) # CIs assuming normality
Waiting for profiling to be done...
2.5 % 97.5 %
`Streptococcus spp.` -9.44506 11.77218
`Fusobacterium nucleatum` -0.51454 0.20082
`Prevotella melaninogenica` -3.09305 4.88336
`Campylobacter concisus` -41.54235 31.43536
age.c -0.06142 0.03073
female.c -4.24838 -0.84159
bmi.c -0.07663 0.10587
bar.c -4.39018 0.11094
## OR and CI
exp(cbind(OR = coef(fit), ci))
OR 2.5 % 97.5 %
`Streptococcus spp.` 2.42352 7.908e-05 1.296e+05
`Fusobacterium nucleatum` 0.85418 5.978e-01 1.222e+00
`Prevotella melaninogenica` 2.77834 4.536e-02 1.321e+02
`Campylobacter concisus` 0.00185 9.086e-19 4.490e+13
age.c 0.98509 9.404e-01 1.031e+00
female.c 0.08664 1.429e-02 4.310e-01
bmi.c 1.01447 9.262e-01 1.112e+00
bar.c 0.12253 1.240e-02 1.117e+00
# predictive data
pp <- as.data.frame(predict(fit, newdata = dat.rna.wide, "probs"))
dat0 <- cbind(dat0, pp)
## melt data set to long for ggplot2
dat1 <- dat0 %>%
pivot_longer(
cols=`I`:`III`,
names_to = "TS",
values_to = "Pred.Prob"
) %>%
pivot_longer(
cols=`Fusobacterium nucleatum`:`Campylobacter concisus`,
names_to="OTU",
values_to="Abundance"
)
## plot predicted probabilities across Abundance values for each level of OTU
## facetted by tumor.stage
ggplot(dat1, aes(x = Abundance, y = Pred.Prob, color=TS, group=TS, linetype=TS)) +
geom_line() +
facet_grid(OTU ~., scales="free")+
labs(y="Probability of Tumor Stage",
x="Relative Abundance (%)",
title="Tumor stage likelihood with bacteria abundance",
color="Tumor Stage", linetype="Tumor Stage")+
scale_x_continuous(
trans=scales::trans_new("root", root, invroot),
breaks=c(0, 0.001,0.01, 0.1, 1,10,25),
labels = c(0, 0.001,0.01, 0.1, 1,10,25),
limits = c(0, 25)
)+
theme(
panel.grid = element_blank()
)
dat.rna.wide <- dat.rna.wide %>%
mutate(
TI = I(as.numeric(tumor.stage) >= 1),
TII = I(as.numeric(tumor.stage) >= 2),
TIII = I(as.numeric(tumor.stage) >= 3)
)
apply(dat.rna.wide[,c('TII','TIII')],2, table)
TII TIII
FALSE 12 38
TRUE 43 17
summary(glm(TII ~ 1 + `Streptococcus spp.` + `Fusobacterium nucleatum` + `Prevotella melaninogenica` + `Campylobacter concisus`+ age.c + female.c + bmi.c + bar.c, data = dat.rna.wide, family=binomial(link="logit")))
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Call:
glm(formula = TII ~ 1 + `Streptococcus spp.` + `Fusobacterium nucleatum` +
`Prevotella melaninogenica` + `Campylobacter concisus` +
age.c + female.c + bmi.c + bar.c, family = binomial(link = "logit"),
data = dat.rna.wide)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.157 0.000 0.138 0.333 1.762
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.0420 0.9230 0.05 0.9637
`Streptococcus spp.` 184.3652 108.3115 1.70 0.0887 .
`Fusobacterium nucleatum` -1.0753 0.8912 -1.21 0.2276
`Prevotella melaninogenica` 26.7602 51.5574 0.52 0.6037
`Campylobacter concisus` 99.8426 493.2974 0.20 0.8396
age.c -0.0457 0.0458 -1.00 0.3189
female.c -3.4180 1.3056 -2.62 0.0088 **
bmi.c 0.0139 0.0561 0.25 0.8049
bar.c -2.2659 1.3256 -1.71 0.0874 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 57.706 on 54 degrees of freedom
Residual deviance: 27.629 on 46 degrees of freedom
AIC: 45.63
Number of Fisher Scoring iterations: 9
summary(glm(TIII ~ 1 + `Streptococcus spp.` + `Fusobacterium nucleatum` + `Prevotella melaninogenica` + `Campylobacter concisus`+ age.c + female.c + bmi.c + bar.c, data = dat.rna.wide, family=binomial(link="logit")))
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Call:
glm(formula = TIII ~ 1 + `Streptococcus spp.` + `Fusobacterium nucleatum` +
`Prevotella melaninogenica` + `Campylobacter concisus` +
age.c + female.c + bmi.c + bar.c, family = binomial(link = "logit"),
data = dat.rna.wide)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5739 -0.7366 -0.0001 0.9631 2.2206
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.52e+00 6.71e+02 -0.01 0.993
`Streptococcus spp.` -2.28e+01 1.09e+01 -2.10 0.036 *
`Fusobacterium nucleatum` -1.62e-01 2.70e-01 -0.60 0.548
`Prevotella melaninogenica` -3.18e+01 1.36e+01 -2.33 0.020 *
`Campylobacter concisus` 2.10e+02 9.21e+01 2.29 0.022 *
age.c 7.13e-03 3.00e-02 0.24 0.812
female.c -1.86e+01 2.69e+03 -0.01 0.994
bmi.c -8.57e-02 8.76e-02 -0.98 0.328
bar.c -1.87e+01 3.20e+03 -0.01 0.995
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 68.021 on 54 degrees of freedom
Residual deviance: 45.884 on 46 degrees of freedom
AIC: 63.88
Number of Fisher Scoring iterations: 18
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] cowplot_1.1.1 dendextend_1.14.0 ggdendro_0.1.22 reshape2_1.4.4
[5] car_3.0-10 carData_3.0-4 gvlma_1.0.0.3 patchwork_1.1.1
[9] viridis_0.5.1 viridisLite_0.3.0 gridExtra_2.3 xtable_1.8-4
[13] kableExtra_1.3.1 MASS_7.3-53 data.table_1.13.6 readxl_1.3.1
[17] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.3 purrr_0.3.4
[21] readr_1.4.0 tidyr_1.1.2 tibble_3.0.6 ggplot2_3.3.3
[25] tidyverse_1.3.0 lmerTest_3.1-3 lme4_1.1-26 Matrix_1.2-18
[29] vegan_2.5-7 lattice_0.20-41 permute_0.9-5 phyloseq_1.34.0
[33] workflowr_1.6.2
loaded via a namespace (and not attached):
[1] minqa_1.2.4 colorspace_2.0-0 rio_0.5.16
[4] ellipsis_0.3.1 rprojroot_2.0.2 XVector_0.30.0
[7] fs_1.5.0 rstudioapi_0.13 farver_2.0.3
[10] lubridate_1.7.9.2 xml2_1.3.2 codetools_0.2-16
[13] splines_4.0.3 knitr_1.31 ade4_1.7-16
[16] jsonlite_1.7.2 nloptr_1.2.2.2 broom_0.7.4
[19] cluster_2.1.0 dbplyr_2.1.0 BiocManager_1.30.10
[22] compiler_4.0.3 httr_1.4.2 backports_1.2.1
[25] assertthat_0.2.1 cli_2.3.0 later_1.1.0.1
[28] htmltools_0.5.1.1 prettyunits_1.1.1 tools_4.0.3
[31] igraph_1.2.6 gtable_0.3.0 glue_1.4.2
[34] Rcpp_1.0.6 Biobase_2.50.0 cellranger_1.1.0
[37] vctrs_0.3.6 Biostrings_2.58.0 rhdf5filters_1.2.0
[40] multtest_2.46.0 ape_5.4-1 nlme_3.1-149
[43] iterators_1.0.13 xfun_0.20 ps_1.5.0
[46] openxlsx_4.2.3 rvest_0.3.6 lifecycle_0.2.0
[49] statmod_1.4.35 zlibbioc_1.36.0 scales_1.1.1
[52] hms_1.0.0 promises_1.1.1 parallel_4.0.3
[55] biomformat_1.18.0 rhdf5_2.34.0 curl_4.3
[58] yaml_2.2.1 stringi_1.5.3 highr_0.8
[61] S4Vectors_0.28.1 foreach_1.5.1 BiocGenerics_0.36.0
[64] zip_2.1.1 boot_1.3-25 rlang_0.4.10
[67] pkgconfig_2.0.3 evaluate_0.14 Rhdf5lib_1.12.1
[70] labeling_0.4.2 tidyselect_1.1.0 plyr_1.8.6
[73] magrittr_2.0.1 R6_2.5.0 IRanges_2.24.1
[76] generics_0.1.0 DBI_1.1.1 foreign_0.8-80
[79] pillar_1.4.7 haven_2.3.1 whisker_0.4
[82] withr_2.4.1 mgcv_1.8-33 abind_1.4-5
[85] survival_3.2-7 modelr_0.1.8 crayon_1.4.1
[88] rmarkdown_2.6 progress_1.2.2 grid_4.0.3
[91] git2r_0.28.0 reprex_1.0.0 digest_0.6.27
[94] webshot_0.5.2 httpuv_1.5.5 numDeriv_2016.8-1.1
[97] stats4_4.0.3 munsell_0.5.0