Quantifying detection of eggs

# Tutorial on quantifying species detection probabilities during surveillance with stan in R

A practical question in species surveillance is “How much search effort is required for detection?”. This can be quantified under controlled conditions where the number and location of target species are known and participants are recruited to see how success rate varies.

Let’s use an example of an easter egg hunt where the adult (the researcher) wants to quantify how much effort it takes a child (the participant) to find an easter egg. Imagine that we let each child complete the easter egg course, and it is reset to the same condition before each new child.

library(tidyverse)
library(rethinking)

Here is a data summary of a such a trial 1.

participant experience search_time_h area_searched_ha search_effort_h_ha no_sets_encountered no_sets_detected percent_detected
4 yes 1.67 0.29 5.69 4 4 100
3 yes 2.17 0.25 8.84 6 5 83
1 yes 1.42 0.45 3.14 8 5 63
6 no 1.08 0.42 2.57 5 3 60
7 no 1.00 0.90 1.11 13 7 54
2 yes 1.42 0.84 1.69 15 5 33
11 yes 1.07 0.65 1.64 9 3 33
12 no 1.35 0.20 6.74 3 1 33
8 yes 0.92 1.20 0.77 23 7 30
10 no 1.00 0.76 1.32 15 4 27
9 no 1.28 0.24 5.43 5 1 20
5 yes 1.17 1.43 0.81 22 2 9

This table shows some child level characteristics that might effect detection, such as whether or not they have had any prior experience looking for easter eggs. But suppose we want to control for some property of the hidden easter egg, such as the size of the egg in grams, where we might pressume that larger eggs are easier to find. Here I separate out the data into Bernoulli form (1s for detections and 0s for misses) and assign a random egg size between 10 and 100 g.

set.seed(111) # fix random numbers
# convert summary data to long format (Bernoulli trials)
total_eggs = 23
db = d %>%
rowwise() %>%
mutate(detection = list(rep(c(0, 1), c(no_sets_encountered - no_sets_detected, no_sets_detected)))) %>%
dplyr::select(-no_sets_encountered,-no_sets_detected, -percent_detected) %>%
unnest() %>%

# assign id to eggs and then random size
group_by(participant) %>%
mutate(egg_id = sample(1:total_eggs,n(),replace = FALSE)) %>%
ungroup %>%
mutate(egg_size = runif(total_eggs, 10,100)[egg_id]) %>%
mutate(experience = ifelse(experience == 'yes', 1, 0)) %>%
mutate(log_search_effort_h_ha = log(search_effort_h_ha)) %>%

# it is good practice to use centred variable (mean = 0) for parameter interpretation and fitting
mutate(egg_size_c = egg_size - mean(egg_size)) %>%
as.data.frame

knitr::kable(head(db))
participant experience search_time_h area_searched_ha search_effort_h_ha detection egg_id egg_size log_search_effort_h_ha egg_size_c
4 1 1.67 0.29 5.69 1 14 88.35757 1.738710 27.820235
4 1 1.67 0.29 5.69 1 2 24.79464 1.738710 -35.742694
4 1 1.67 0.29 5.69 1 11 92.79748 1.738710 32.260140
4 1 1.67 0.29 5.69 1 10 56.31577 1.738710 -4.221572
3 1 2.17 0.25 8.84 0 9 65.42770 2.179287 4.890362
3 1 2.17 0.25 8.84 1 22 17.12628 2.179287 -43.411054

So now that we have some characterics of the child (e.g. search effort), and each hidden egg (e.g. size) that are expected to affect detection probabilities it is time to model it.

Each encounter ($$X_{i,j}$$) with an egg j by child i can be considered a draw from a Bernoulli distribution with parameter $$p_{i,j}$$, which is the probability of detecting the egg given it was present:

$X_{i,j} \sim \textrm{dbern}(p_{i,j})$

The are many ways to model the probability of detection. One is to assume an exponential relationship with search effort where $$p_{i,j}$$ is given by:

$p_{i,j} = 1 - exp(-\lambda_{i,j} f_j)$

where $$f_i$$ is the search effort per unit area of child i, and $$\lambda_{i,j}$$ is the rate of detection of plant set j by observer i, modelled as a log-linear function of factors identified as likely to influence detectability or:

$\ln(\lambda_{i,j}) = a + b.s_j + c.e_i + \textrm{obs}_i$ where $$a$$ is the intercept term, $$b$$ is the effect of egg size $$s_j$$, and $$c$$ is the effect child experience $$e_i$$. The reference level of child experience is ‘no experience’.

For an intuitive explanation of commonly used probability distributions, see this previous post and this one on bayesian vs. frequentist binomial responses.

On to fitting. First, get some reasonable initial values for the model.

a = 0.17
b = 0.001
c = 0.125
db %>%
mutate(pred =  1 - exp(-(a + b*egg_size_c + c*experience + log_search_effort_h_ha))) %>%
dplyr::select(pred, detection) %>%
head
##        pred detection
## 1 0.8727409         1
## 2 0.8643894         1
## 3 0.8733047         1
## 4 0.8685973         1
## 5 0.9161878         0
## 6 0.9120402         1

Then fit the model using Richard McElreath’s rethinking package and rstan. Note that we are kind of pushing the limits the syntax that the rethinking package can handle so there are a couple of errors when computing statisitics like the WAIC.

m1 <- rethinking::map2stan(
alist(
detection ~ dbinom( 1 , 1 - exp(-lambda_f) ),
log(lambda_f) <-  (a + b*egg_size_c + c*experience + log_search_effort_h_ha + obs[participant]),
a ~ dnorm(0, 10),
b ~ dnorm(0, 10),
c ~ dnorm(0, 10),
obs[participant] ~ dnorm(0, sigma) ,
sigma ~ dcauchy(0,1)
),
data = db,
start = list(a=0.17, b = 0.001, c=0.125, sigma = 0.3) ,
chains=1 , iter=10000 , warmup=1000

)
save(m1, file = 'data/detection_prob/m1.Rdata')

The precis output shows that 0.89 credible intervals excluded both the experience covariate and the randomly generated egg size covariate.

# summary output
precis(m1)
##        Mean StdDev lower 0.89 upper 0.89 n_eff Rhat
## a     -1.66   0.40      -2.30      -1.05  3198    1
## b      0.01   0.01       0.00       0.02  1824    1
## c      0.34   0.51      -0.48       1.08  3414    1
## sigma  0.56   0.31       0.07       0.94   490    1
# plot detection probability vs. effort
pars = precis(m1)@output$Mean names(pars) = rownames(precis(m1)@output) obs = precis(m1, depth = 2)@output$Mean[5:16]
pred_line =  expand.grid(search_effort_h_ha = 10^seq(-1, 1, length = 100),
egg_size = c(60.5),
experience = 0:1,
participant = 1) %>% # placeholder participant
mutate(log_search_effort_h_ha = log(search_effort_h_ha)) %>%
mutate(egg_size_c = egg_size - mean(db$egg_size)) # create matrix of zeros getting mean of random effect participant_zeros = matrix(0, 1000, max(d$participant))
mu <- link(m1, data=pred_line, replace=list(obs=participant_zeros))
## [ 100 / 1000 ]
[ 200 / 1000 ]
[ 300 / 1000 ]
[ 400 / 1000 ]
[ 500 / 1000 ]
[ 600 / 1000 ]
[ 700 / 1000 ]
[ 800 / 1000 ]
[ 900 / 1000 ]
[ 1000 / 1000 ]
pred_line$pred <- 1 - exp(-apply( mu , 2 , mean )) pred_line[,c('lower','upper')] <- t(1 - exp(-apply(mu, 2, PI, prob = 0.89))) # predictions for each participant with random effect (at mean egg size) d = d %>% mutate(log_search_effort_h_ha = log(search_effort_h_ha)) %>% mutate(pred = 1 - exp(-exp((pars['a'] + pars['b']*0 + pars['c']*(experience=='yes') + log_search_effort_h_ha + obs[participant])))) p1 = ggplot(pred_line, aes(search_effort_h_ha, pred, colour = factor(experience))) + geom_line(aes(linetype = factor(experience))) + geom_line(aes(y=lower, linetype = factor(experience))) + geom_line(aes(y=upper, linetype = factor(experience))) + geom_ribbon(aes(ymin=lower,ymax=upper,fill = factor(experience), color = NULL), alpha = 0.2) + geom_point(data = d, aes(linetype = NULL), colour = 'white', shape = 1) + geom_point(data = d, aes(y = percent_detected/100), colour = 'white') + ylab('predicted detection probability') + xlab('search effort, h/ha') + guides(fill=F, colour = F, linetype = F) + ggtitle('Model 1') + mydarktheme cowplot::plot_grid( p1, p1 + ggtitle('')+ scale_x_log10(), ncol=2) Figure 1. Plot of model 1 predicted detection probability as a function of search effort. Filled circles show observed detected percentages by each participant, while hollow circles show predicted probabilities for each observer’s measured effort level and fitted random effect (egg size is set to mean). Shaded regions shows 89% credible interval. Prediction for experienced individuals shown with blue (vs red for not-experienced). In choosing to implement this model in stan, I found myself a world of hurt, which mostly resulted in pushing the rethinking package to its limits. I have never seen so many error messages as I fumbled my way through red words trying to make a log-link function work with a binomial regression (or some other way to restrict the value of $$\lambda$$ to positive values else break the contraint $$0<p<1$$). Most of this stems from my lack of understanding around syntax in stan but this exercise has convinced me to invest more time to learn it properly and write models directly in stan. That said, rethinking is an excellent framework in which to learn and play around with Bayesian models. All this pain taught me that binomial probabilities have to be between 0 and 1 and if you have some linear function y = a + x * b, which can take on any real value, then 1 - exp(-y) is only bounded by 0 and 1 when y is positive, which is why we use the log-link function to restrict y (i.e. exp(y) > 0). More conventionally, linear functions are converted to valid probabilities using the logit-link (i.e. logit(y) = 1/(1 + y) > 0. I explored this more conventional approach by using the logit link function which resulted in our previous linear function becoming framed around the log odds ratio $$\ln(\frac{p}{1-p})$$. I ended up with a pretty similar model. $p_{i,j} = \frac{1}{1 + exp(-y_{i,j})}$ where $y_{i,j} = a + b.s_j + c.e_i + \log(f_i) + \textrm{obs}_i$ and the variables are defined as before. Expressing the equation in terms of the odds ratio helps interpretabilty. $\frac{p}{1-p} = exp(a + b.s_j + c.e_i + \textrm{obs}_i)f_i$ The odds of detection scales linearly with survey effort. m2 <- rethinking::map2stan( alist( detection ~ dbinom( 1 , p ), logit(p) <- a + b*egg_size_c + c*experience + log_search_effort_h_ha + obs[participant] , a ~ dnorm(0, 10), b ~ dnorm(0, 10), c ~ dnorm(0, 10), obs[participant] ~ dnorm(0, sigma) , sigma ~ dcauchy(0,1) ), data = db, chains=1 , iter=10000 , warmup=1000 ) save(m2, file = 'data/detection_prob/m2.Rdata') Interestingly, model coefficients are similar despite the different link function. # summarise model precis(m2) ## Mean StdDev lower 0.89 upper 0.89 n_eff Rhat ## a -1.25 0.50 -2.01 -0.47 3587 1 ## b 0.01 0.01 0.00 0.02 7950 1 ## c 0.31 0.64 -0.67 1.29 2964 1 ## sigma 0.64 0.38 0.05 1.11 718 1 # plot detection probability vs effort pars = precis(m2)@output$Mean
names(pars) = rownames(precis(m2)@output)
obs = precis(m2, depth = 2)@output$Mean[4:16] pred_line = expand.grid(search_effort_h_ha = 10^seq(-1, 1, length = 100), egg_size = c(60.5), experience = 0:1, participant = 1) %>% # participant place holder mutate(egg_size_c = egg_size - mean(db$egg_size)) %>%
mutate(log_search_effort_h_ha = log(search_effort_h_ha))

# create matrix of zeros getting mean of random effect
participant_zeros = matrix(0, 1000, max(d$participant)) mu <- link(m2, data=pred_line, replace=list(obs=participant_zeros)) ## [ 100 / 1000 ] [ 200 / 1000 ] [ 300 / 1000 ] [ 400 / 1000 ] [ 500 / 1000 ] [ 600 / 1000 ] [ 700 / 1000 ] [ 800 / 1000 ] [ 900 / 1000 ] [ 1000 / 1000 ] pred_line$pred <- (apply( mu , 2 , mean ))
pred_line[,c('lower','upper')] <- t((apply( mu , 2 , PI, prob = 0.89 )))

# predictions for each participant with random effect (at mean egg size)
d = d %>%
mutate(pred = (inv_logit(pars['a'] + pars['b']*0 + pars['c']*(experience=='yes') +log(search_effort_h_ha) + obs[participant])))
p2 = ggplot(pred_line, aes(search_effort_h_ha, pred, colour = factor(experience))) +
geom_line(aes(linetype = factor(experience))) +
geom_line(aes(y=lower, linetype = factor(experience))) +
geom_line(aes(y=upper, linetype = factor(experience))) +
geom_ribbon(aes(ymin=lower,ymax=upper,fill = factor(experience), color = NULL), alpha = 0.2) +
geom_point(data = d, aes(linetype = NULL), colour = 'white', shape = 1) +
geom_point(data = d, aes(y = percent_detected/100), colour = 'white') +
ylab('predicted detection probability') +
xlab('search effort, h/ha') +
guides(fill=F, colour = F, linetype = F) +
ggtitle('Model 2') +
mydarktheme

cowplot::plot_grid( p1 , p1 + ggtitle('') + scale_x_log10(),
p2 , p2 + ggtitle('') + scale_x_log10(), ncol=2)

Figure 2. Plot of model 2 predicted detection probability as a function of search effort. Filled circles show observed detected percentages by each participant, while hollow circles show predicted probabilities for each observer’s measured effort level and fitted random effect (egg size is set to mean). Shaded regions shows 89% credible interval. Prediction for experienced individuals shown with blue (vs red for not-experienced).

The DIC of model 2 is lower than model 1.

DIC(m1)[1]
## [1] 161.7463
DIC(m2)[1]
## [1] 160.9068

1. This is real data but instead of easter eggs, participants were asked to survey for hawkweed in the Victorian alps. Data is from Moore, J. L., Hauser, C. E., Bear, J. L., Williams, N. S. and McCarthy, M. A. (2011), Estimating detection–effort curves for plants using search experiments. Ecological Applications, 21: 601-607. doi:10.1890/10-0590.1