Testing resistance to pesticides

# Bayesian and frequentist approaches to binomial dose responses in R

For a given species, a simple mortality response to environmental conditions can represented with the probabilistic outcome (death), which occurs with probabilty $$p$$. This simple process is know as a Bernoulli random variable.

A motivating example is how a pest responds to increasing doses of a pesticide. Invertebrate pests cause 10-20% of yield losses in modern food systems. While cultural practices such as crop rotatation and biological control through beneficial insects increasingly form a core component of effective and sustainable management, pesticides remain a widely used tool. The following data set from Maino et al. (2018) shows the mortality response of two populations to the pesticide omethoate.

library(tidyverse)
library(rethinking)
source('data/useful_scripts/mydarktheme.R')
options(mc.cores = parallel::detectCores() - 1)
rstan_options(auto_write = TRUE)
## # A tibble: 14 x 6
## # Groups:   treatment, pop [2]
##    treatment pop           dose alive  dead total
##    <chr>     <chr>        <dbl> <int> <int> <int>
##  1 omethoate control      0.007    52     0    52
##  2 omethoate control      0.07     49     0    49
##  3 omethoate control      0.7      53     3    56
##  4 omethoate control      7        36    13    49
##  5 omethoate control     70         0    51    51
##  6 omethoate control    700         0    48    48
##  7 omethoate control   7000         0    51    51
##  8 omethoate resistant    0.007    49     0    49
##  9 omethoate resistant    0.07     47     0    47
## 10 omethoate resistant    0.7      50     0    50
## 11 omethoate resistant    7        51     0    51
## 12 omethoate resistant   70         9    37    46
## 13 omethoate resistant  700         4    45    49
## 14 omethoate resistant 7000         0    52    52

The Bernoulli mortality probability $$p$$ changes with the set of conditions represented by the design matrix $$X$$, where rows are observations and columns are covariates. It is convienient to assume $$p$$ follows the relationship:

$p = \frac{1}{1 + e^{-X\boldsymbol{b}}}$

where $$b$$ is a vector of coefficients. In this equation, $$p$$ is constrained to values between 0 and 1. The $$Xb$$ component is just a linear equation of the form $$b_1x_1 + b_2x_2 + ... +b_nx_n$$ which can be solved through rearrangement:

$\ln \frac{p}{1 - p} = logit(p) = X\boldsymbol{b}$

With the logit link function and the linear equation specifying the relationship of binomial parameter $$p$$ with its covariates, we now have the ingredients for a general linear model. For $$X$$ we allow a unique intercept and dose slope for each population, which is specified usign the following formula notation.

m1 = glm(cbind(dead, alive) ~ pop/log(dose) - 1, family = binomial('logit'),
data = d )

We can print the model matrix can be observed using using the model.matrix function.

model.matrix(m1)
##    popcontrol popresistant popcontrol:log(dose) popresistant:log(dose)
## 1           1            0           -4.9618451              0.0000000
## 2           1            0           -2.6592600              0.0000000
## 3           1            0           -0.3566749              0.0000000
## 4           1            0            1.9459101              0.0000000
## 5           1            0            4.2484952              0.0000000
## 6           1            0            6.5510803              0.0000000
## 7           1            0            8.8536654              0.0000000
## 8           0            1            0.0000000             -4.9618451
## 9           0            1            0.0000000             -2.6592600
## 10          0            1            0.0000000             -0.3566749
## 11          0            1            0.0000000              1.9459101
## 12          0            1            0.0000000              4.2484952
## 13          0            1            0.0000000              6.5510803
## 14          0            1            0.0000000              8.8536654
## attr(,"assign")
## [1] 1 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$pop ## [1] "contr.treatment" The summary output tells us that one population called “resistant” appears to have a higher intercept. summary(m1) ## ## Call: ## glm(formula = cbind(dead, alive) ~ pop/log(dose) - 1, family = binomial("logit"), ## data = d) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.66626 -0.37482 -0.02094 0.29396 2.08559 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## popcontrol -3.7440 0.6145 -6.093 1.11e-09 *** ## popresistant -5.7053 0.9141 -6.242 4.33e-10 *** ## popcontrol:log(dose) 1.6188 0.2548 6.354 2.09e-10 *** ## popresistant:log(dose) 1.5064 0.2253 6.686 2.29e-11 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 817.127 on 14 degrees of freedom ## Residual deviance: 28.145 on 10 degrees of freedom ## AIC: 50.208 ## ## Number of Fisher Scoring iterations: 7 This can be confirmed visually by plotting the model with 95% prediction intervals. m1pred = expand.grid( pop = c('control','resistant'), dose = 10^seq(-2, 4, length = 100) ) m1pred$logdose = log(m1pred$dose) pred = predict(m1, newdata = m1pred, type = 'response' , se.fit = TRUE) m1pred$p = pred$fit m1pred$se = pred$se.fit p_freq = ggplot() + geom_point(data = d, aes(dose, dead/total, colour = pop)) + geom_line(data = m1pred, aes(dose, p, colour = pop)) + geom_ribbon(data = m1pred, aes(dose, ymin = p - 1.96*se, max = p + 1.96*se, fill = pop), alpha = 0.5) + mydarktheme + xlab('omethoate (mg/L)') + ggtitle('Frequentist') + scale_x_log10() p_freq We can also rearrange the data into a Bernoulli format (each row is an individual trial) using the following1. db = d %>% rowwise() %>% mutate(outcome = list(rep(c(0, 1), c(alive, dead)))) %>% dplyr::select(-alive,-dead, -total) %>% unnest() head(db) ## # A tibble: 6 x 4 ## treatment pop dose outcome ## <chr> <chr> <dbl> <dbl> ## 1 omethoate control 0.007 0 ## 2 omethoate control 0.007 0 ## 3 omethoate control 0.007 0 ## 4 omethoate control 0.007 0 ## 5 omethoate control 0.007 0 ## 6 omethoate control 0.007 0 The model coefficients are the same despite our trial size changing to n = 1. m1 = glm(outcome ~ pop/log(dose) - 1, family = binomial('logit'), data = db ) summary(m1) ## ## Call: ## glm(formula = outcome ~ pop/log(dose) - 1, family = binomial("logit"), ## data = db) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.89083 -0.06234 -0.00392 0.04577 2.94434 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## popcontrol -3.7440 0.6145 -6.093 1.11e-09 *** ## popresistant -5.7053 0.9141 -6.242 4.33e-10 *** ## popcontrol:log(dose) 1.6188 0.2547 6.355 2.09e-10 *** ## popresistant:log(dose) 1.5064 0.2253 6.686 2.29e-11 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 970.41 on 700 degrees of freedom ## Residual deviance: 181.42 on 696 degrees of freedom ## AIC: 189.42 ## ## Number of Fisher Scoring iterations: 8 Now lets try fitting the same model using a Bayesian approach. I am using the rethinking package by Richard McElreath, whose book Statistical Rethinking I highly recommend. From the plots below, the fits are nearly indistinguishable. db_stan = db %>% mutate( logdose = log(dose), pop = ifelse(pop == 'control', 1, 2)) %>% as.data.frame m2 <- rethinking::map( alist( outcome ~ dbinom( 1 , p ) , logit(p) <- a[pop] + b[pop]*logdose , a[pop] ~ dnorm(0, 10), b[pop] ~ dnorm(0, 10) ), data = db_stan, start = list(a = c(-3, -5), b = c(1.4, 1.6)) #, chains=2 , iter=2500 , warmup=500 ) precis(m2, depth = 2) ## Mean StdDev 5.5% 94.5% ## a[1] -3.73 0.61 -4.70 -2.75 ## a[2] -5.66 0.90 -7.09 -4.22 ## b[1] 1.61 0.25 1.21 2.02 ## b[2] 1.49 0.22 1.14 1.85 m2pred = expand.grid( pop = 1:2, logdose = log(10^seq(-2, 4, length = 100)) ) mu <- link( m2 , data=m2pred) ## [ 100 / 1000 ] [ 200 / 1000 ] [ 300 / 1000 ] [ 400 / 1000 ] [ 500 / 1000 ] [ 600 / 1000 ] [ 700 / 1000 ] [ 800 / 1000 ] [ 900 / 1000 ] [ 1000 / 1000 ] m2pred$mu.mean <- apply( mu , 2 , mean )
m2pred[,c('PIL', 'PIH')] <- t(apply( mu , 2 , PI , prob = 0.95))
m2pred$pop = ifelse(m2pred$pop == 1, 'control', 'resistant')

p_bayes = ggplot(m2pred, aes(x = exp(logdose))) +
geom_line(aes(y = mu.mean, colour = pop)) +
geom_ribbon(aes(ymin = PIL, ymax = PIH , fill = pop), alpha = 0.5) +
geom_point(data = d, aes(dose, dead/total, colour = pop)) +
scale_x_log10() +
xlab('omethoate (mg/L)') +
ggtitle('Bayesian') +
mydarktheme
cowplot::plot_grid(p_freq, p_bayes, ncol = 2)

The dose at which some proportion of mortality occurs (e.g. 50%) has classically been a common summary statistic as it can be convienient to talk about mortality in terms of doses rather than model coefficients. This can be computed algebraically from the regression equation for each population as $$logit(y) = a + b x_{dose}$$, which can be solved as $$x_{dose} = (log(\frac{y}{1-y}) - a)/b$$. The lethal dose required for 50% mortality occurs when $$y = 0.5$$ or $$x_{LD50} = -a/b$$.

-coef(m1)[1]/coef(m1)[2]
## popcontrol
## -0.6562276

This is a compound parameter of two model parameters that each have their own variance and co-variance and it possible to estimate the distribution of this quantity analytically, but not straightforward (at least not for me). The delta method can be used to estimate variance of $$x_{LD50}$$ by assuming the quantity is multivariate normally distributed. The dose.p function in the MASS package can do this for us using the following code.

library(MASS)
# we can simply run dose.p(m1, cf = c(1,3) or more verbosely
cf = c(1, 3) # The terms in the coefficient vector giving the intercept and coefficient of (log-)dose
p = 0.5 # Probabilities at which to predict the dose needed.
eta <- family(m1)$linkfun(p) b <- coef(m1)[cf] x.p <- (eta - b[1L])/b[2L] names(x.p) <- paste("p = ", format(p), ":", sep = "") pd <- -cbind(1, x.p)/b[2L] SE <- sqrt(((pd %*% vcov(m1)[cf, cf]) * pd) %*% c(1, 1)) # LD50 exp(c(x.p - 1.96 * SE, x.p + 1.96 * SE)) ## [1] 7.294287 13.992988 I find the solution to this problem using the Bayesian framework more intuitive. The covariance structure of the model paramaters is captured in the posterior distribution, so we can sample model parameters from the posterior distribution and then do some algebra solving for $$x_{LD50}$$ as before and plot for the control population post = extract.samples(m2, n = 1e5) logx50 = -post$a[,1]/post$b[,1] exp(mean(logx50)) ## [1] 10.13417 exp(HPDI(logx50, prob = 0.95)) ## |0.95 0.95| ## 7.202363 14.310366 In general, solving for the distribution of arbitrary compound parameters is much more straightforward in a Bayesian context. But perhaps the most compelling case for a Bayesian approach with dose responses is when it comes to estimating whether or not the LC50s of both populations are different (and then how different the responses are expressed as a resistance factor). The implications of the covariance structure between the four model paramaters hurts my poor brain. You see many papers getting around this issue by somewhat informally inspecting the overlap of confidence intervals, which is likely to be overconservative. The resistance factor is usaully provided as the ratio of the LC50 values, which usually lacks a confidence interval. # LC50 using maximum liklihood # control con = dose.p(m1, c(1, 3), p = 0.5) exp(con) ## Dose SE ## p = 0.5: 10.10291 0.16619 exp(c(con - 1.96 * attr(con, 'SE'), con + 1.96 * attr(con, 'SE'))) ## [1] 7.294287 13.992988 # resistant res = dose.p(m1, c(2, 4), p = 0.5) res ## Dose SE ## p = 0.5: 3.787473 0.1785665 exp(c(res - 1.96 * attr(con, 'SE'), res + 1.96 * attr(con, 'SE'))) ## [1] 31.87240 61.14238 # resistance factor exp(res[1])/exp(con[1]) ## p = 0.5: ## 4.369501 I personally circumvent the issue by performing a log-liklihood ratio test on the full model with simplified models that assume universal slope, or intercept parameters, but this approach is a little less direct than ideal. m1_full = glm(cbind(dead, alive) ~ pop*log(dose), family = binomial('logit'), data = d ) m1_no_pop = glm(cbind(dead, alive) ~ log(dose), family = binomial('logit'), data = d ) anova(m1_full, m1_no_pop, test = 'LRT') ## Analysis of Deviance Table ## ## Model 1: cbind(dead, alive) ~ pop * log(dose) ## Model 2: cbind(dead, alive) ~ log(dose) ## Resid. Df Resid. Dev Df Deviance Pr(>Chi) ## 1 10 28.145 ## 2 12 59.612 -2 -31.468 1.469e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 But the issue of confidence bounds around the resistance factor remains. Sampling from our posterior distribution makes this all look like a walk in the park. # control LC50 exp(mean(-post$a[,1]/post$b[,1])) ## [1] 10.13417 # resistant LC50 exp(mean(-post$a[,2]/post$b[,2])) ## [1] 43.87604 # resistance factor with 95% credible intervals x50_diff = exp(-post$a[,2]/post$b[,2] - -post$a[,1]/post\$b[,1])
mean(x50_diff)
## [1] 4.472003
HPDI(x50_diff, prob = 0.95)
##    |0.95    0.95|
## 2.441506 6.813392

[1] Upon reading a bit more I found out that stan can indeed handle aggregated binomial data. So we could also do this:

d_stan = d %>%
ungroup %>%
mutate(
pop = if_else(pop == 'control', 1, 2),
logdose = log(dose)
) %>%
as.data.frame
m2.1 <- rethinking::map(
alist(
dead ~ dbinom(total, p ) ,
logit(p) <- a[pop] + b[pop]*logdose ,
a[pop]  ~ dnorm(0, 10),
b[pop]  ~ dnorm(0, 10)
),
data = d_stan, start = list(a = c(-3, -5), b = c(1.4, 1.6))
#, chains=2 , iter=2500 , warmup=500
)

1. 1