Proportions are a funny thing in statistics. Some people just seem to love percentages. But there is a dark side to modelling a response variable as a percentage.

For example, I might be tempted to fit a linear model to mortality data on some insects exposed to heat stress for some time. To prove the point I will simulate some data.

`library(tidyverse)`

```
time = rep(0:9, 10)
n = 100
a = -1
b = 0.1
p = 1/(1 + exp(-(a + b*time)))
d = data_frame(time=time, dead = rbinom(length(p), n, p), n=n, dead_p=dead/n)
head(d)
```

```
## # A tibble: 6 x 4
## time dead n dead_p
## <int> <int> <dbl> <dbl>
## 1 0 27 100 0.27
## 2 1 24 100 0.24
## 3 2 26 100 0.26
## 4 3 34 100 0.34
## 5 4 39 100 0.39
## 6 5 35 100 0.35
```

Now I will fit a simple linear model to the response.

```
m1 = lm(dead_p ~ time, data=d)
summary(m1)
```

```
##
## Call:
## lm(formula = dead_p ~ time, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.093673 -0.029095 0.000664 0.025573 0.091145
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.275964 0.007662 36.02 <2e-16 ***
## time 0.020964 0.001435 14.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04122 on 98 degrees of freedom
## Multiple R-squared: 0.6853, Adjusted R-squared: 0.6821
## F-statistic: 213.4 on 1 and 98 DF, p-value: < 2.2e-16
```

But underpinning these proportions are Bernoulli trials where the probability of each insect dieing is p, which depends on time. Most of the time, a proportion can be rephrased as a binomial trial in one way or another.

```
m2 = glm(cbind(dead, n - dead) ~ time, family = binomial('logit'), data=d)
summary(m2)
```

```
##
## Call:
## glm(formula = cbind(dead, n - dead) ~ time, family = binomial("logit"),
## data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.92690 -0.58860 0.03242 0.52819 1.92748
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.948457 0.040203 -23.59 <2e-16 ***
## time 0.090829 0.007321 12.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 228.130 on 99 degrees of freedom
## Residual deviance: 71.846 on 98 degrees of freedom
## AIC: 572.48
##
## Number of Fisher Scoring iterations: 3
```

We can plot both model predictions of the data.

```
d$pred1 = predict(m1)
d$pred2 = predict(m2, type='response')
ggplot(d, aes(x=time, y=dead_p)) +
geom_point(col='red') +
geom_line(aes(y=pred1), col='red') +
geom_line(aes(y=pred2), col='green') +
mydarktheme
```

It doesn’t look like there is much between each prediction, so why should it matter which model we choose? Well, a good reason it that we know before hand that the relationship between survival and heat stress shouldn’t be linear, because the mortality won’t keep increasing indefinitely - it must stop once all insects are dead!

Indeed, if we use the fitted curves to extrapolate the results into longer exposure times with more simulated data, we see pretty quickly that the linear model breaks.

```
time = rep(10:90, 10)
p = 1/(1 + exp(-(a + b*time)))
d2 = data_frame(time=time, dead = rbinom(length(p), n, p), dead_p = dead/n)
d2$pred1 = predict(m1, newdata=d2)
d2$pred2 = predict(m2, newdata=d2, type='response')
ggplot(d2, aes(x=time, y=dead_p)) +
geom_point(data=d2, col = 'pink') +
geom_line(aes(y=pred1), col='red') +
geom_line(aes(y=pred2), col='blue') +
geom_point(data=d, col='red') +
mydarktheme
```

But perhaps a more subtle reason is that the different liklihood functions associated with each model provide a different interpretation of the response variance and the fitted coefficients. The linear model assumes the response is continuous, while the binomial model only assumes that the dependence of p on time is continuous. The coefficients in the binomial model relate to the likelihood p/(1-p).

Interestingly, in spite of a warning, we get nearly the same inference if we fit the glm using the proportion data, but the estimated error is out by a factor of 10. This is a nice clue (think: se = sd/sqrt(n))! To correct this, we can add a weighting term which corresponds to the size of the trial.

```
m2_p =
glm(dead_p ~ time, family = binomial('logit'), data=d)
m2_p_weighted =
glm(dead_p ~ time, family = binomial('logit'), data=d, weights=d$n)
broom::tidy(m2)
```

```
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.948 0.0402 -23.6 4.67e-123
## 2 time 0.0908 0.00732 12.4 2.43e- 35
```

`broom::tidy(m2_p)`

```
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.948 0.402 -2.36 0.0183
## 2 time 0.0908 0.0732 1.24 0.215
```

`broom::tidy(m2_p_weighted)`

```
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.948 0.0402 -23.6 4.67e-123
## 2 time 0.0908 0.00732 12.4 2.43e- 35
```

The cause of the discrepency is that there is no information on the number of trials in a simple proportion. For exaple, if I observed 497 heads in 1000 flips, or 1 head in two, both would yield p = 0.5, but I would be more certain of a coin’s fairness in the former scenario. This is also another reason to not model proportions if possible, i.e. you are throwing away hard earned data!