Last updated: 2021-03-20

Checks: 7 0

Knit directory: delta-sift-polydfe/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20210313)

The command set.seed(20210313) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 7952a07

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 7952a07. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Unstaged changes:
    Deleted:    analysis/S1_Stat_models_deltaDIFT_DFEs.Rmd
    Modified:   analysis/_site.yml

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/S1_Stat_models_deltaSIFT_DFEs.Rmd) and HTML (docs/S1_Stat_models_deltaSIFT_DFEs.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
html	52a94dd	Thomas Bataillon	2021-03-20	Build site.
Rmd	3fe9d78	Thomas Bataillon	2021-03-20	Publish the initial files + S1 & S2 + readme for myproject
html	040ba4c	Thomas Bataillon	2021-03-13	Build site.
html	afad139	Thomas Bataillon	2021-03-13	Build site.
html	343029d	Thomas Bataillon	2021-03-13	Build site.
html	706a03c	Thomas Bataillon	2021-03-13	Build site.
Rmd	66965e5	Thomas Bataillon	2021-03-13	Publish the initial files for myproject

brief overview of analysis and updates :

We read in the data made available in the lastest version of the poyDFE outputs summary by Jun Chen.
We define a covariate \(\delta\) as the change in discretized SIFT scores
Rationale for conditioning DFE on \(\delta\) is to illustrate that change in SIFT scores are a powerful way to capture the expected effect of mutations and the fact that DFEs are quite different. There is a sharp divide between \(\delta \leq 0\) and \(\delta >0\)
we explore the amount of variation explained by \(\delta\) for
- pin/piS
- other DFE property: probability of sizable amount of beneficial mutations ( eg \(Prob[ p_b >0.1 ]\))

Reading the data

dfe_sift <- read.table("Data/summary_polyDFE_siftCate_new.txt",header=T)
dim(dfe_sift)

[1] 507  26

names(dfe_sift)

 [1] "species"     "group"       "category"    "cat04"       "from"       
 [6] "to"          "mutation"    "gradient"    "eps"         "Sd"         
[11] "beta"        "pb"          "Sb"          "alpha"       "PiS"        
[16] "PiN"         "PiNPiS"      "syn_counts"  "nsyn_counts" "Lsyn"       
[21] "Lnsyn"       "D1"          "D2"          "D3"          "D4"         
[26] "D5"

dfe_sift$delta <- dfe_sift$to - dfe_sift$from
head( dfe_sift) %>% kable(digits = 2)

species	group	category	from	to	mutation	gradient	eps	Sd	beta	pb	Sb	alpha	PiN	PiNPiS	syn_counts	nsyn_counts	Lsyn	Lnsyn	D1	D2	D3	D4	D5	delta
Ayon	tree	0:DEL-DEL	-2	-2	sdelet	0.00	0.32	-2282.42	1.91	0.43	16.08	1.00	0.00	0.75	14155	129	3659915	46187	0.00	0.00	0.00	0.05	0.52	0
Ayon	tree	0:DEL-EXDEL	-2	-3	delet	0.00	0.00	-7.12	2.33	0.00	0.23	0.00	0.00	0.29	14155	86	3659915	74160	0.00	0.31	0.69	0.00	0.00	-1
Ayon	tree	0:DEL-STABLE	-2	0	benefit	0.21	0.24	-0.65	0.01	0.50	95.49	0.99	0.01	10.12	14155	706	3659915	20133	0.50	0.00	0.00	0.00	0.00	2
Ayon	tree	0:DEL-TOL	-2	-1	benefit	0.00	0.21	-8.37	0.01	0.50	97.07	0.99	0.00	2.46	14155	210	3659915	23847	0.49	0.01	0.00	0.00	0.00	1
Ayon	tree	0:EXDEL-DEL	-3	-2	benefit	0.01	0.02	-26.45	0.01	0.50	90.72	0.99	0.01	6.03	14155	14	3659915	669	0.49	0.01	0.00	0.00	0.00	1
Ayon	tree	0:EXDEL-EXDEL	-3	-3	sdelet	0.00	0.30	-308.69	0.03	0.02	2.85	0.07	0.00	0.98	14155	6	3659915	1970	0.85	0.06	0.05	0.02	0.00	0

Filtering data :

Removing obs with too high \(\epsilon_{anc}\)

I n what follows I use the subset obtained with the condition filter(cat04==0) to generate df_tofit used throughout for fitting . NB Filtering on eps means 15% data obs removed but I think it is needed in order to avoid weird outliers

dim(dfe_sift)

[1] 507  27

species_low_nchr <- c("Qmango", "Shua", "Bnana")
dfe_sift <- dfe_sift %>%  
  filter(!(species %in% species_low_nchr)) %>%
  filter(eps <0.1)  # 
dim(dfe_sift)

[1] 387  27

dim(dfe_sift)

[1] 387  27

(507-387)/507 #data loss

[1] 0.2366864

dfe_sift %>% filter(cat04==0) -> df_tofit
dim(df_tofit)

[1] 336  27

length(unique(df_tofit$species))

[1] 23

Dependent variable : \(\frac{P_0}{P_4}\)

We use the corrected \(\frac{P_0}{P_4}\) analogous to \(\frac{\pi_N}{\pi_S}\)

We correct by adding +1 to every count: \(\pi_N =\frac{n_N + 1}{L_N + 1}\) and \(\pi_S =\frac{n_S + 1}{L_S + 1}\) we can then directly use a log scale

df_tofit$PiNPiScor <- (df_tofit$nsyn_counts+1)*(df_tofit$Lsyn+1)/((df_tofit$Lnsyn+1)*(df_tofit$syn_counts+1)) 

df_tofit %>%
  filter(cat04==0) %>%
  ggplot(aes(x=delta, y=PiNPiScor, weight=(1+nsyn_counts)) )+ 
  # geom_point(aes(size = nsyn_counts)) + 
  geom_jitter(width = 0.2, aes(size = nsyn_counts)) + 
  scale_y_log10() +
  # facet_wrap(~ group) + 
  geom_smooth(method = "loess", formula ="y~ x", se =T, aes(color=NULL))+
  geom_smooth(method = "lm", formula ="y~ x", se =T, aes(color=NULL), color="orange")+
  theme_minimal(base_size = 20)+
  xlab(expression(paste("Change in SIFT score, " , delta)))+
  ylab("corrected P0/P4")+
  scale_color_colorblind()+
  theme(legend.position = "none")+
  NULL

Version	Author	Date
706a03c	Thomas Bataillon	2021-03-13

# ggsave("manuscript_GBE/manuscript_GBE_Jun2/figures_update_Feb2020/corrected_p0p4_delta.pdf", device = "pdf")

A (linear) fit for each species separately

df_tofit %>%
  filter(cat04==0) %>%
  ggplot(aes(x=delta, y=PiNPiScor, color=species, weight=sqrt((1+nsyn_counts))) )+ 
  # geom_point(aes(size = nsyn_counts)) + 
  geom_jitter(width = 0.2, aes(size = nsyn_counts)) + 
  scale_y_log10() +
  # facet_wrap(~ group) + 
  geom_smooth(method = "lm", formula ="y~ x", se =F)+
  theme_minimal(base_size = 20)+
  xlab(expression(paste("Change in SIFT score, " , delta)))+
  ylab("corrected P0/P4")+
  scale_color_viridis_d(direction = -1)+
  theme(legend.position = "none")+
  NULL

Version	Author	Date
706a03c	Thomas Bataillon	2021-03-13

# ggsave("manuscript_GBE/manuscript_GBE_Jun2/figures_update_Feb2020/corrected_p0p4_delta_manyslopes.pdf",device = "pdf")

Linear (mixed) models

dim(df_tofit)

[1] 336  28

df_tofit$weights <- sqrt(df_tofit$nsyn_counts)

qplot(log10(df_tofit$PiNPiScor), bins = 25)

Version	Author	Date
706a03c	Thomas Bataillon	2021-03-13

# ggplot(df_tofit, 
#        aes(x=delta, y = log10(PiNPiScor), 
#            size = )) + 
#   geom_jitter(width = 0.1)+ 
#   geom_smooth(method="lm", se =T)+ NULL
# 

naive0 <- lm(log(PiNPiScor) ~ 1 , 
            data = df_tofit)
naive <- lm(log(PiNPiScor) ~ delta , 
            data = df_tofit) 
summary(naive)


Call:
lm(formula = log(PiNPiScor) ~ delta, data = df_tofit)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.0530 -0.6227 -0.0312  0.5849  3.8786 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.08764    0.05902  -1.485    0.139    
delta        1.05637    0.03635  29.065   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.081 on 334 degrees of freedom
Multiple R-squared:  0.7167,    Adjusted R-squared:  0.7158 
F-statistic: 844.8 on 1 and 334 DF,  p-value: < 2.2e-16

# plot(naive)

AIC(naive)

[1] 1010.021

mixed_1 <-lmer(formula = log(PiNPiScor) ~ delta +(1|species) , 
     data = df_tofit, REML =F )

merTools::modelInfo(mixed_1)

  n.obs n.lvls     AIC    sigma
1   336      1 1011.67 1.070842

lmerTest::ranova(mixed_1)

ANOVA-like table for random-effects: Single term deletions

Model:
log(PiNPiScor) ~ delta + (1 | species)
              npar  logLik    AIC     LRT Df Pr(>Chisq)
<none>           4 -501.84 1011.7                      
(1 | species)    3 -502.01 1010.0 0.35022  1      0.554

mixed_2 <-lmer(formula = log(PiNPiScor) ~  delta +(0 + delta |species), 
     data = df_tofit,  REML =F )

AIC(mixed_2)

[1] 971.8783

merTools::modelInfo(mixed_2)

  n.obs n.lvls      AIC    sigma
1   336      1 971.8783 0.963824

plotREsim(REsim(mixed_1)) # estimated random effects (intercepts) for each sp

Version	Author	Date
706a03c	Thomas Bataillon	2021-03-13

summary(mixed_2)

Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
  method [lmerModLmerTest]
Formula: log(PiNPiScor) ~ delta + (0 + delta | species)
   Data: df_tofit

     AIC      BIC   logLik deviance df.resid 
   971.9    987.1   -481.9    963.9      332 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.6383 -0.5812 -0.0412  0.5825  2.9437 

Random effects:
 Groups   Name  Variance Std.Dev.
 species  delta 0.08705  0.2950  
 Residual       0.92896  0.9638  
Number of obs: 336, groups:  species, 23

Fixed effects:
             Estimate Std. Error        df t value Pr(>|t|)    
(Intercept)  -0.08130    0.05268 313.74534  -1.543    0.124    
delta         1.06093    0.06956  23.23021  15.252 1.35e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
      (Intr)
delta 0.017

plotREsim(REsim(mixed_2)) # estimated random effects (slopes) for each sp relative to the common slope

Version	Author	Date
706a03c	Thomas Bataillon	2021-03-13

Summary of fits

vec_aic = c(AIC(naive0), AIC(naive), AIC(mixed_1), AIC(mixed_2))
models_names = c("lm no slope", "lm slope", "random intercept", "random slope") 
df_table <- tibble(models_names, vec_aic)
knitr::kable(df_table, digits = 2, col.names = c("Models for P0/P4 ratio", "AIC"))

Models for P0/P4 ratio	AIC
lm no slope	1431.74
lm slope	1010.02
random intercept	1011.67
random slope	971.88

1 - (deviance(mixed_2) /(1011.67-1))

[1] 0.04629771

Tech check: influence of eps

naive0 <- lm(log(PiNPiScor) ~ eps , 
            data = df_tofit)
summary(naive0)


Call:
lm(formula = log(PiNPiScor) ~ eps, data = df_tofit)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.0381 -1.2568 -0.4835  0.8376  7.0708 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  -0.2153     0.1270  -1.695    0.091 .
eps           6.0722     5.3377   1.138    0.256  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.027 on 334 degrees of freedom
Multiple R-squared:  0.00386,   Adjusted R-squared:  0.0008773 
F-statistic: 1.294 on 1 and 334 DF,  p-value: 0.2561

naive <- lm(log(PiNPiScor) ~ eps + delta , 
            data = df_tofit) 
summary(naive)


Call:
lm(formula = log(PiNPiScor) ~ eps + delta, data = df_tofit)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1126 -0.5786 -0.0008  0.5477  3.7524 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.005279   0.067501   0.078  0.93771    
eps         -7.866433   2.858775  -2.752  0.00625 ** 
delta        1.073022   0.036498  29.399  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.071 on 333 degrees of freedom
Multiple R-squared:  0.7229,    Adjusted R-squared:  0.7213 
F-statistic: 434.5 on 2 and 333 DF,  p-value: < 2.2e-16

anova(naive, test = "F")

Analysis of Variance Table

Response: log(PiNPiScor)
           Df Sum Sq Mean Sq  F value  Pr(>F)    
eps         1   5.32    5.32   4.6393 0.03197 *  
delta       1 990.98  990.98 864.3094 < 2e-16 ***
Residuals 333 381.80    1.15                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

car::vif(naive)

Registered S3 methods overwritten by 'car':
  method                          from
  influence.merMod                lme4
  cooks.distance.influence.merMod lme4
  dfbeta.influence.merMod         lme4
  dfbetas.influence.merMod        lme4

     eps    delta 
1.028283 1.028283

# plot(naive)

AIC(naive)

[1] 1004.466

mixed_1 <-lmer(formula = log(PiNPiScor) ~ eps + delta +(1|species) , 
     data = df_tofit, REML =F )

merTools::modelInfo(mixed_1)

  n.obs n.lvls      AIC    sigma
1   336      1 1005.487 1.053509

lmerTest::ranova(mixed_1)

ANOVA-like table for random-effects: Single term deletions

Model:
log(PiNPiScor) ~ eps + delta + (1 | species)
              npar  logLik    AIC     LRT Df Pr(>Chisq)
<none>           5 -497.74 1005.5                      
(1 | species)    4 -498.23 1004.5 0.97877  1     0.3225

mixed_2 <-lmer(formula = log(PiNPiScor) ~  eps + delta +(0 + delta |species), 
     data = df_tofit,  REML =F )

AIC(mixed_2)

[1] 964.0875

merTools::modelInfo(mixed_2)

  n.obs n.lvls      AIC     sigma
1   336      1 964.0875 0.9486668

plotREsim(REsim(mixed_1)) # estimated random effects (intercepts) for each sp

Version	Author	Date
706a03c	Thomas Bataillon	2021-03-13

summary(mixed_2)

Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
  method [lmerModLmerTest]
Formula: log(PiNPiScor) ~ eps + delta + (0 + delta | species)
   Data: df_tofit

     AIC      BIC   logLik deviance df.resid 
   964.1    983.2   -477.0    954.1      331 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.7382 -0.5563 -0.0578  0.5964  3.2749 

Random effects:
 Groups   Name  Variance Std.Dev.
 species  delta 0.08844  0.2974  
 Residual       0.89997  0.9487  
Number of obs: 336, groups:  species, 23

Fixed effects:
             Estimate Std. Error        df t value Pr(>|t|)    
(Intercept)   0.01738    0.06055 317.42835   0.287  0.77422    
eps          -8.29817    2.63066 324.65735  -3.154  0.00176 ** 
delta         1.07896    0.06999  23.48516  15.415 8.86e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
      (Intr) eps   
eps   -0.516       
delta  0.056 -0.081

plotREsim(REsim(mixed_2)) # estimated random effects (slopes) for each sp relative to the common slope

Version	Author	Date
706a03c	Thomas Bataillon	2021-03-13

vec_aic = c(AIC(naive0), AIC(naive), AIC(mixed_1), AIC(mixed_2))
models_names = c("lm no slope", "lm slope", "random intercept", "random slope") 
df_table <- tibble(models_names, vec_aic)
knitr::kable(df_table, digits = 2, col.names = c("Models for P0/P4 ratio", "AIC"))

Models for P0/P4 ratio	AIC
lm no slope	1432.44
lm slope	1004.47
random intercept	1005.49
random slope	964.09

Dependent variable : \(p_b\)

Here if we want to analyse variation in \(p_b\), we probably need to discretize the data and go a logistic regression on the two categories

Overview

qplot(df_tofit$pb, bins = 30, fill = (df_tofit$pb>0.1)) + 
  theme_minimal(base_size = 15)

Version	Author	Date
706a03c	Thomas Bataillon	2021-03-13

df_tofit <- mutate(df_tofit, benef_mut = ifelse(pb<0.1,  0, 1)) 
table(df_tofit$benef_mut) %>% kable(col.names = c("benef_mut", "n"))

benef_mut	n
0	194
1	142

  df_tofit %>%
  ggplot(aes(x=delta, y=benef_mut, color=species, weight=sqrt((1+nsyn_counts))) )+ 
  # geom_point(aes(size = nsyn_counts)) + 
  geom_jitter( width = 0.15, height = 0.05, aes(size = nsyn_counts)) + 
  geom_smooth(method = "glm",
              method.args=list(family="binomial"), se=T, aes(color="black"), size = 2) + 
  theme_minimal(base_size = 20)+
  xlab(expression(paste("Change in SIFT score, " , delta)))+
  ylab(expression(paste("Probability of " , p[b], " > 0.1")))+
  scale_color_viridis_d()+
  theme(legend.position = "none")+
  NULL

`geom_smooth()` using formula 'y ~ x'

Warning in eval(family$initialize): non-integer #successes in a binomial glm!

Version	Author	Date
706a03c	Thomas Bataillon	2021-03-13

A loess fit for each species separately

Linear (mixed) models

# df_tofit<- df_tofit[-c(247),] # 1 mega outlier
dim(df_tofit)

[1] 336  30

df_tofit$weights <- sqrt(df_tofit$nsyn_counts)


naive0 <- glm(benef_mut  ~ 1 , family = "binomial",
            data = df_tofit) 
summary(naive0)


Call:
glm(formula = benef_mut ~ 1, family = "binomial", data = df_tofit)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.048  -1.048  -1.048   1.312   1.312  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  -0.3120     0.1104  -2.825  0.00472 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 457.71  on 335  degrees of freedom
Residual deviance: 457.71  on 335  degrees of freedom
AIC: 459.71

Number of Fisher Scoring iterations: 4

naive0$deviance

[1] 457.7148

naive <- glm(benef_mut  ~ delta , family = "binomial",
            data = df_tofit) 
summary(naive)


Call:
glm(formula = benef_mut ~ delta, family = "binomial", data = df_tofit)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6586  -0.5524  -0.2926   0.8434   2.9893  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.4760     0.1519  -3.133  0.00173 ** 
delta         1.3268     0.1418   9.355  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 457.71  on 335  degrees of freedom
Residual deviance: 280.73  on 334  degrees of freedom
AIC: 284.73

Number of Fisher Scoring iterations: 5

anova(naive0, naive, test="Chisq")

Analysis of Deviance Table

Model 1: benef_mut ~ 1
Model 2: benef_mut ~ delta
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       335     457.71                          
2       334     280.73  1   176.99 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(naive)

Version	Author	Date
706a03c	Thomas Bataillon	2021-03-13

Version	Author	Date
706a03c	Thomas Bataillon	2021-03-13

Version	Author	Date
706a03c	Thomas Bataillon	2021-03-13

Version	Author	Date
706a03c	Thomas Bataillon	2021-03-13

1 - (naive$deviance/naive0$deviance) # pseudo R^2

[1] 0.3866719

AIC(naive)

[1] 284.7294

mixed_1 <-glmer(benef_mut ~ delta +(1|species) , family = "binomial",
     data = df_tofit )

AIC(mixed_1)

[1] 281.085

summary(mixed_1)

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: benef_mut ~ delta + (1 | species)
   Data: df_tofit

     AIC      BIC   logLik deviance df.resid 
   281.1    292.5   -137.5    275.1      333 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.2346 -0.3862 -0.1435  0.3912  6.8791 

Random effects:
 Groups  Name        Variance Std.Dev.
 species (Intercept) 0.4739   0.6884  
Number of obs: 336, groups:  species, 23

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.5253     0.2166  -2.425   0.0153 *  
delta         1.4335     0.1589   9.022   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
      (Intr)
delta -0.175

merTools::modelInfo(mixed_1)

  n.obs n.lvls     AIC sigma
1   336      1 281.085     1

mixed_2 <-glmer(formula = benef_mut ~  delta +(0 + delta |species), family = "binomial",
     data = df_tofit)

AIC(mixed_2)

[1] 262.415

summary(mixed_2)

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: benef_mut ~ delta + (0 + delta | species)
   Data: df_tofit

     AIC      BIC   logLik deviance df.resid 
   262.4    273.9   -128.2    256.4      333 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.6050 -0.3534 -0.0513  0.3376  3.5818 

Random effects:
 Groups  Name  Variance Std.Dev.
 species delta 1.315    1.147   
Number of obs: 336, groups:  species, 23

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.5321     0.1656  -3.214  0.00131 ** 
delta         2.0944     0.4051   5.169 2.35e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
      (Intr)
delta -0.130

merTools::modelInfo(mixed_2)

  n.obs n.lvls     AIC sigma
1   336      1 262.415     1

plotREsim(REsim(mixed_1)) # estimated random effects (intercepts) for each sp

Version	Author	Date
706a03c	Thomas Bataillon	2021-03-13

plotREsim(REsim(mixed_2)) # estimated random effects (slopes) for each sp relative to the common slope

Version	Author	Date
706a03c	Thomas Bataillon	2021-03-13

Summary of fits

vec_aic = c(AIC(naive0), AIC(naive), AIC(mixed_1), AIC(mixed_2))
models_names = c("glm no slope", "glm with slope", "glm with random intercept", "glm with random slope") 
df_table <- tibble(models_names, vec_aic)
knitr::kable(df_table, digits = 2, col.names = c("Models for proportion beneficial ", "AIC"))

Models for proportion beneficial	AIC
glm no slope	459.71
glm with slope	284.73
glm with random intercept	281.08
glm with random slope	262.41

Some pseudo R^2 to compare with the linear models on the piNpiS ratio Using the deviance of mixed_2 and comparing it to the most naive model (glm no slope) aka naive0 yields a pseudo R^2:

 1 -(deviance(mixed_2)/naive0$deviance)

[1] 0.516144

tech check with eps included in the model

naive0 <- glm(benef_mut  ~ eps , family = "binomial",
            data = df_tofit) 
summary(naive0)


Call:
glm(formula = benef_mut ~ eps, family = "binomial", data = df_tofit)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7952  -0.9635  -0.9568   1.3507   1.4153  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.5438     0.1298  -4.191 2.78e-05 ***
eps          19.7114     5.7119   3.451 0.000559 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 457.71  on 335  degrees of freedom
Residual deviance: 444.68  on 334  degrees of freedom
AIC: 448.68

Number of Fisher Scoring iterations: 4

naive0$deviance

[1] 444.6848

naive <- glm(benef_mut  ~ eps + delta , family = "binomial",
            data = df_tofit) 
summary(naive)


Call:
glm(formula = benef_mut ~ eps + delta, family = "binomial", data = df_tofit)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5701  -0.5137  -0.2721   0.6350   2.9211  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.6695     0.1854  -3.611 0.000306 ***
eps          12.6527     6.7851   1.865 0.062214 .  
delta         1.3042     0.1418   9.196  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 457.71  on 335  degrees of freedom
Residual deviance: 277.12  on 333  degrees of freedom
AIC: 283.12

Number of Fisher Scoring iterations: 5

anova(naive0, naive, test="Chisq")

Analysis of Deviance Table

Model 1: benef_mut ~ eps
Model 2: benef_mut ~ eps + delta
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       334     444.68                          
2       333     277.12  1   167.56 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

1 - (naive$deviance/naive0$deviance) # pseudo R^2

[1] 0.3768174

AIC(naive)

[1] 283.1198

mixed_1 <-glmer(benef_mut ~ eps + delta +(1|species) , family = "binomial",
     data = df_tofit )

AIC(mixed_1)

[1] 278.2478

summary(mixed_1)

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: benef_mut ~ eps + delta + (1 | species)
   Data: df_tofit

     AIC      BIC   logLik deviance df.resid 
   278.2    293.5   -135.1    270.2      332 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.9130 -0.3566 -0.1126  0.4241  5.7448 

Random effects:
 Groups  Name        Variance Std.Dev.
 species (Intercept) 0.5647   0.7515  
Number of obs: 336, groups:  species, 23

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.7972     0.2639  -3.021  0.00252 ** 
eps          16.6332     7.7992   2.133  0.03295 *  
delta         1.4224     0.1604   8.868  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
      (Intr) eps   
eps   -0.493       
delta -0.210  0.063

merTools::modelInfo(mixed_1)

  n.obs n.lvls      AIC sigma
1   336      1 278.2478     1

mixed_2 <-glmer(formula = benef_mut ~ eps +  delta +(0 + delta |species), family = "binomial",
     data = df_tofit)

AIC(mixed_2)

[1] 263.2358

summary(mixed_2)

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: benef_mut ~ eps + delta + (0 + delta | species)
   Data: df_tofit

     AIC      BIC   logLik deviance df.resid 
   263.2    278.5   -127.6    255.2      332 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.4725 -0.3508 -0.0542  0.3461  3.8218 

Random effects:
 Groups  Name  Variance Std.Dev.
 species delta 1.219    1.104   
Number of obs: 336, groups:  species, 23

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.6571     0.2032  -3.234  0.00122 ** 
eps           8.1923     7.5323   1.088  0.27676    
delta         2.0437     0.3969   5.149 2.62e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
      (Intr) eps   
eps   -0.573       
delta -0.064 -0.089

merTools::modelInfo(mixed_2)

  n.obs n.lvls      AIC sigma
1   336      1 263.2358     1

plotREsim(REsim(mixed_1)) # estimated random effects (intercepts) for each sp

Version	Author	Date
706a03c	Thomas Bataillon	2021-03-13

plotREsim(REsim(mixed_2)) # estimated random effects (slopes) for each sp relative to the common slope

Version	Author	Date
706a03c	Thomas Bataillon	2021-03-13

car::vif(naive)

     eps    delta 
1.001045 1.001045

 df_tofit %>%
  ggplot(aes(x=eps, y=benef_mut, color=species))+ 
  # geom_point(aes(size = nsyn_counts)) + 
  geom_jitter( width = 0.0025, height = 0.05, aes(size = nsyn_counts)) + 
  geom_smooth(method = "glm",
              method.args=list(family="binomial"), se=T, aes(color=NULL), color="black") + 
  theme_minimal(base_size = 15)+
  xlab("eps")+
  ylab("Probability of p_b >0.1 ")+
  scale_color_viridis_d()+
  theme(legend.position = "none")+
  NULL

`geom_smooth()` using formula 'y ~ x'

Version	Author	Date
706a03c	Thomas Bataillon	2021-03-13

car::vif(naive)

     eps    delta 
1.001045 1.001045

 df_tofit %>%
  ggplot(aes(x=delta, y=benef_mut, color=species))+ 
  # geom_point(aes(size = nsyn_counts)) + 
  geom_jitter( width = 0.0025, height = 0.05, aes(size = nsyn_counts)) + 
  geom_smooth(method = "glm",
              method.args=list(family="binomial"), se=T, aes(color=NULL), color="black") + 
  theme_minimal(base_size = 15)+
  xlab(expression(paste("Change in SIFT score, " , delta)))+
  ylab("Probability of p_b >0.1 ")+
  scale_color_viridis_d()+
  theme(legend.position = "none")+
  NULL

`geom_smooth()` using formula 'y ~ x'

Version	Author	Date
706a03c	Thomas Bataillon	2021-03-13

 # updated pseudo R2
  1 -(deviance(mixed_2)/naive0$deviance)

[1] 0.5023703

sessionInfo()

R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lmerTest_3.1-3  merTools_0.5.2  arm_1.11-2      MASS_7.3-51.6  
 [5] lme4_1.1-23     Matrix_1.2-18   knitr_1.30      magrittr_1.5   
 [9] dplyr_1.0.2     ggthemes_4.2.0  ggplot2_3.3.2   workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] nlme_3.1-148        fs_1.5.0            RColorBrewer_1.1-2 
 [4] rprojroot_1.3-2     numDeriv_2016.8-1.1 tools_4.0.2        
 [7] TMB_1.7.18          backports_1.1.8     R6_2.4.1           
[10] rpart_4.1-15        Hmisc_4.4-2         mgcv_1.8-31        
[13] colorspace_1.4-1    nnet_7.3-14         withr_2.2.0        
[16] tidyselect_1.1.0    gridExtra_2.3       curl_4.3           
[19] compiler_4.0.2      git2r_0.27.1        htmlTable_2.1.0    
[22] labeling_0.3        scales_1.1.1        checkmate_2.0.0    
[25] mvtnorm_1.1-1       blme_1.0-5          stringr_1.4.0      
[28] digest_0.6.25       foreign_0.8-80      minqa_1.2.4        
[31] rmarkdown_2.3       rio_0.5.16          base64enc_0.1-3    
[34] jpeg_0.1-8.1        pkgconfig_2.0.3     htmltools_0.5.0    
[37] fastmap_1.0.1       highr_0.8           readxl_1.3.1       
[40] htmlwidgets_1.5.2   rlang_0.4.7         rstudioapi_0.11    
[43] shiny_1.5.0         generics_0.0.2      farver_2.0.3       
[46] broom.mixed_0.2.6   zip_2.1.1           car_3.0-9          
[49] Formula_1.2-4       Rcpp_1.0.5          munsell_0.5.0      
[52] abind_1.4-5         lifecycle_0.2.0     stringi_1.4.6      
[55] whisker_0.4         yaml_2.2.1          carData_3.0-4      
[58] plyr_1.8.6          grid_4.0.2          promises_1.1.1     
[61] forcats_0.5.0       crayon_1.3.4        lattice_0.20-41    
[64] haven_2.3.1         splines_4.0.2       hms_0.5.3          
[67] pillar_1.4.6        boot_1.3-25         reshape2_1.4.4     
[70] codetools_0.2-16    glue_1.4.1          evaluate_0.14      
[73] latticeExtra_0.6-29 data.table_1.13.0   png_0.1-7          
[76] vctrs_0.3.2         nloptr_1.2.2.2      httpuv_1.5.4       
[79] foreach_1.5.1       cellranger_1.1.0    gtable_0.3.0       
[82] purrr_0.3.4         tidyr_1.1.1         openxlsx_4.2.2     
[85] xfun_0.19           mime_0.9            xtable_1.8-4       
[88] broom_0.7.0         coda_0.19-4         later_1.1.0.1      
[91] survival_3.1-12     viridisLite_0.3.0   tibble_3.0.3       
[94] iterators_1.0.13    cluster_2.1.0       statmod_1.4.34     
[97] ellipsis_0.3.1

Supplementary text for GBE ms GBE-210319

Hunting for beneficial mutations

J Chen, T. Bataillon, S Glemin, M. Lascoux

Last update: 2021-03-20

brief overview of analysis and updates :

Reading the data

Filtering data :

Dependent variable : \(\frac{P_0}{P_4}\)

A (linear) fit for each species separately

Linear (mixed) models

Summary of fits

Tech check: influence of eps

Dependent variable : \(p_b\)

Overview

A loess fit for each species separately

Linear (mixed) models

Summary of fits

tech check with eps included in the model