Testing differences in dose responses in two populations

Last updated: 28 November 2018

Dose response are used in a variety of situations where the impact of exposure, dose, or concentration on some measurement endpoint is of interest.

Dose responses in ecology

For a given species, a simple mortality response to environmental conditions can be represented with the probabilistic outcome (death), which occurs with probability . This simple process is know as a Bernoulli random variable.

A motivating example is how a pest responds to increasing doses of a pesticide. Invertebrate pests cause 10-20% of yield losses in modern food systems. While cultural practices such as crop rotation and biological control through beneficial insects increasingly form a core component of effective and sustainable management, pesticides remain a widely used tool. The following data set from Maino et al. (2018) shows the mortality response of two populations to the pesticide omethoate.

library(tidyverse)
library(rstan)

d = read_csv('data/DoseResponseMaino2018.csv', col_types = 'ccniii') %>%
  mutate(dead = dead + incapacitated) %>%
  dplyr::select(-incapacitated) %>%
  group_by(treatment, pop, dose) %>%
  summarise_all(sum) %>%
  mutate(total = dead + alive) %>%
  filter(dose != 0)

> d                                                                           
# A tibble: 14 × 6
# Groups:   treatment, pop [2]
   treatment pop           dose alive  dead total
   <chr>     <chr>        <dbl> <int> <int> <int>
 1 omethoate control      0.007    52     0    52
 2 omethoate control      0.07     49     0    49
 3 omethoate control      0.7      53     3    56
 4 omethoate control      7        36    13    49
 5 omethoate control     70         0    51    51
 6 omethoate control    700         0    48    48
 7 omethoate control   7000         0    51    51
 8 omethoate resistant    0.007    49     0    49
 9 omethoate resistant    0.07     47     0    47
10 omethoate resistant    0.7      50     0    50
11 omethoate resistant    7        51     0    51
12 omethoate resistant   70         9    37    46
13 omethoate resistant  700         4    45    49
14 omethoate resistant 7000         0    52    52

Logistic regression and the Bernoulli probability distribution

The Bernoulli mortality probability changes with the set of conditions represented by the design matrix , where rows are observations and columns are covariates. It is convenient to assume follows the relationship:

where is a vector of coefficients. In this equation, is constrained to values between 0 and 1. The component is just a linear equation of the form which can be solved through rearrangement:

With the logit link function and the linear equation specifying the relationship of binomial parameter with its covariates, we now have the ingredients for a general linear model. For we allow a unique intercept and dose slope for each population, which is specified using the following formula notation.

m1 = glm(cbind(dead, alive) ~ pop/log(dose) - 1, family = binomial('logit'), 
         data = d )

We can print the model matrix can be observed using using the model.matrix function.

> model.matrix(m1)
   popcontrol popresistant popcontrol:log(dose) popresistant:log(dose)
1           1            0           -4.9618451              0.0000000
2           1            0           -2.6592600              0.0000000
3           1            0           -0.3566749              0.0000000
4           1            0            1.9459101              0.0000000
5           1            0            4.2484952              0.0000000
6           1            0            6.5510803              0.0000000
7           1            0            8.8536654              0.0000000
8           0            1            0.0000000             -4.9618451
9           0            1            0.0000000             -2.6592600
10          0            1            0.0000000             -0.3566749
11          0            1            0.0000000              1.9459101
12          0            1            0.0000000              4.2484952
13          0            1            0.0000000              6.5510803
14          0            1            0.0000000              8.8536654

The summary output tells us that one population called "resistant" appears to have a higher intercept.

> summary(m1)

Call:
glm(formula = cbind(dead, alive) ~ pop/log(dose) - 1, family = binomial("logit"), 
    data = d)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.66626  -0.37482  -0.02094   0.29396   2.08559  

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
popcontrol              -3.7440     0.6145  -6.093 1.11e-09 ***
popresistant            -5.7053     0.9141  -6.242 4.33e-10 ***
popcontrol:log(dose)     1.6188     0.2548   6.354 2.09e-10 ***
popresistant:log(dose)   1.5064     0.2253   6.686 2.29e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 817.127  on 14  degrees of freedom
Residual deviance:  28.145  on 10  degrees of freedom
AIC: 50.208

Number of Fisher Scoring iterations: 7

This can be confirmed visually by plotting the model with 95% confidence intervals.

m1pred = expand.grid(
  pop = c('control','resistant'),
  dose = 10^seq(-2, 4, length = 100)
)
m1pred$logdose = log(m1pred$dose)
pred = predict(m1, newdata = m1pred, type = 'response' , se.fit = TRUE)
m1pred$p = pred$fit
m1pred$se = pred$se.fit

ggplot() + 
  geom_point(data = d, aes(dose, dead/total, colour = pop)) +
  geom_line(data = m1pred, aes(dose, p, colour = pop)) +
  geom_ribbon(data = m1pred, 
              aes(dose, ymin = p - 1.96*se, max = p + 1.96*se, fill = pop), 
              alpha = 0.5) +
  mydarktheme +
  xlab('omethoate (mg/L)') +
  ggtitle('Frequentist') +
  scale_x_log10()

Dose response curve using lm and logit link in R

We can also rearrange the data into a Bernoulli format (each row is an individual trial) using the following.

db = d %>%
  rowwise() %>%
  mutate(outcome = list(rep(c(0, 1), c(alive, dead)))) %>%
  dplyr::select(-alive,-dead, -total) %>%
  unnest()

> head(db)
# A tibble: 6 × 4
# Groups:   treatment, pop [1]
  treatment pop      dose outcome
  <chr>     <chr>   <dbl>   <dbl>
1 omethoate control 0.007       0
2 omethoate control 0.007       0
3 omethoate control 0.007       0
4 omethoate control 0.007       0
5 omethoate control 0.007       0
6 omethoate control 0.007       0

The model coefficients and standard errors are the same despite our trial size changing to .

m1 = glm(outcome ~ pop/log(dose) - 1, family = binomial('logit'), 
         data = db )

> summary(m1)

Call:
glm(formula = outcome ~ pop/log(dose) - 1, family = binomial("logit"), 
    data = db)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.89083  -0.06234  -0.00392   0.04577   2.94434  

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
popcontrol              -3.7440     0.6145  -6.093 1.11e-09 ***
popresistant            -5.7053     0.9141  -6.242 4.33e-10 ***
popcontrol:log(dose)     1.6188     0.2547   6.355 2.09e-10 ***
popresistant:log(dose)   1.5064     0.2253   6.686 2.29e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 970.41  on 700  degrees of freedom
Residual deviance: 181.42  on 696  degrees of freedom
AIC: 189.42

Number of Fisher Scoring iterations: 8

Bayesian dose response model

For anyone looking to dive into the Bayesian world I cannot overstate how good the course, and similarly titled book, Statistical Rethinking is by Richard McElreath. I initially used his rethinking package to fit these models, but more recently have moved over to rstan, which has a steeper learning curve but I thought it was about time I learnt how to write models in stan.

So let's refit the same model using a Bayesian approach.

The data format required by stan is a bit more fiddly than lm, as the models are very flexible but carefully typed, e.g. integers, arrays, matrices, etc. This is required so that stan know what kind of data it will expect so that the model can be precompiled for computational efficiency. Computational efficiency is important as the Markov chain Monte Carlo methods for sampling from the probability distribution is a more intensive numerical procedure compared with maximum likelihood optimisation methods in frequentist statistics.

stan_data <- list(
    N = nrow(db),
    y = db$outcome,
    pop = ifelse(db$pop == "control", 1, 2),
    dose = db$dose
)

Here is the stan code saved in model.stan specifying the model.

data {

  int <lower = 0> N; 
  array [N] int <lower = 0, upper = 1> y ;
  array [N] int <lower = 1> pop;
  vector <lower = 0> [N] dose; 

}

parameters {

  vector [2] a;
  vector [2] b;
  
}

model {
     // Prior models for the unobserved parameters
    a ~ normal(0, 10);
    b ~ normal(1, 10);

    y ~ bernoulli_logit(a[pop] + (b[pop] .* log(dose)));

}

fit <- stan(
    file = "model.stan",
    data = stan_data,
    warmup = 500,
    iter = 3000,
    chains = 4,
    cores = 4,
    thin = 1,
    seed=123
)

The parameter estimates are quite similar to those estimated above.

> print(fit, probs=c(0.025, 0.975))
Inference for Stan model: model-202208041918-1-68bf99.
4 chains, each with iter=3000; warmup=500; thin=1; 
post-warmup draws per chain=2500, total post-warmup draws=10000.

       mean se_mean   sd   2.5%  97.5% n_eff Rhat
a[1]  -3.92    0.01 0.64  -5.32  -2.80  3377    1
a[2]  -5.93    0.02 0.94  -8.06  -4.28  3407    1
b[1]   1.70    0.00 0.27   1.24   2.32  3321    1
b[2]   1.57    0.00 0.23   1.15   2.08  3435    1
lp__ -93.02    0.03 1.44 -96.65 -91.22  3101    1

Samples were drawn using NUTS(diag_e) at Thu Aug 04 19:18:11 2022.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

The posterior distribution

One of the cool things about Bayesian approaches is that each parameter is assumed to have a probability distribution (rather than an error). What's more, the posterior probability distributions can take a range of forms, i.e., they are not necessarily normally distributed.

e <- rstan::extract(fit)
a <- tibble(control = e$a[, 1], resistant = e$a[, 2]) %>%
    mutate(par = "a", id = 1:n())
b <- tibble(control = e$b[, 1], resistant = e$b[, 2]) %>%
    mutate(par = "b", id = 1:n())

dl <- bind_rows(a, b) %>%
    pivot_longer(-c(par, id))

ggplot(dl, aes(x = value, color = name, fill = name)) +
    geom_density(alpha = 0.5) +
    facet_wrap(~par, scales = "free") +
    mydarktheme

Posterior distribution of dose response parameters

Here we can see how the posterior distributions of the parameters covary with the other parameter estimates. The density of the points corresponds to the probability distribution of the parameters.


dl %>%
    pivot_wider(names_from = par, values_from = value) %>%
    ggplot(aes(a, b, color = name)) +
    geom_point(alpha = 0.1) +
    facet_wrap(~name, scales = "free") +
    mydarktheme +
    guides(color = "none")

The posterior distribution of the dose response parameters and their correlation structure

Retrieving model predictions from the posterior distribution can be a bit fiddly as there is no analytical expression based on parameter estimates. Rather, parameter samples from the posterior distribution must be transformed into model predictions and the quantiles must be estimated numerically. There are likely packages that can take care of this for you, but I haven't looked into it just yet.


pred <-
    expand.grid(dose = 10^seq(-2, 4, by = 0.1), pop = c(1, 2)) %>%
    as_tibble()

pred$pred <- NA
pred$predu95 <- NA
pred$predl95 <- NA
for (i in 1:nrow(pred)) {
    pop <- pred$pop[i]
    dose <- pred$dose[i]
    logit <- e$a[, pop] + e$b[, pop] * log(dose)
    p <- 1 / (1 + exp(-logit))
    qs <- quantile(p, c(0.025, 0.5, 0.975))
    pred$predl95[i] <- qs[1]
    pred$pred[i] <- qs[2]
    pred$predu95[i] <- qs[3]
}

pred %>%
    mutate(pop = c("control", "resistant")[pop]) %>%
    ggplot(aes(x = dose, y = pred, color = pop, fill = pop)) +
    geom_point(data = d, aes(dose, dead / total, color = pop)) +
    geom_line() +
    geom_ribbon(aes(ymin = predl95, ymax = predu95), alpha = 0.5) +
    mydarktheme +
    xlab("omethoate (mg/L)") +
    ggtitle("Bayesian") +
    scale_x_log10()

Comparing the two model predictions

Hover your mouse over the plot below to compare the two predictions.

Upon reading a bit more I found out that stan can indeed handle aggregated binomial data.

Source code available here

Part 2 of this post is available here.