Building distributions with a coin flip

Generating probability distributions with natural examples

How many probability distributions can we generate by imagining simple natural processes? In this post I use a simple binomial random number generator to produce different random variables with a variety of distributions. Using built in probability densities functions in R, I show how the simulated data (plot bars) approach the exact probability density (plot lines) and provide an intuitive interpretation of model parameters of commonly encountered distributions.

A biological example

“Nothing in Biology Makes Sense Except in the Light of Evolution” - Theodosius Dobzhansky, 1973

Nucleotides form the molecular basis of the genetic code, the order of which is approximately random (to the human eye). In the following, we will imagine reading a strand of DNA.

The number of A’s (Adenine) in one letter of genetic code can be 1 (if A) or 0 (if G, G or T). This is a random Bernoulli variable where \(p = 1/4\) and a special case of a Binomial distribution where \(n = 1\)

samples1 = rbinom(n = 10000, size = 1, prob = 1/4)

p1 = ggplot()+ 
  geom_histogram(aes(samples1), breaks = 0:2 - 0.5, colour = 'black', closed = 'left') + 
  geom_line(aes(0:1, dbinom(0:1, 1, 0.25) * 10000), col = "red") +
  
  # aesthetics
  ggtitle('Bernoulli distribution') +
  xlab('number of A bases per nucleotide') +
  ylab('counts per 10000 nucleotides') +
  mydarktheme

p1

If we count the A’s in 10 letters of code, This random variable becomes binomial with \(n = 10\) and \(p = 1/4\).

samples = rbinom(n = 10000, size = 10, prob = 0.25)

p2 = ggplot()+ 
  geom_histogram(aes(samples), breaks = 0:10 -  0.5, colour = 'black', closed = 'left') + 
  geom_line(aes(0:10, dbinom(0:10, 10, 0.25) * 10000), col = "red") +
  
  # aesthetics
  ggtitle('Binomial distribution') +
  xlab('number of A bases in 10 nucleotides') +
  mydarktheme + 
  scale_x_continuous(breaks = 0:10)

p2

Counting the number of non-A letters until we see the first A gives us a geometric distribution with \(p = 1/4\).

samples = replicate(n = 10000, 
            which(rbinom(n = 10, size = 1, prob = 0.25) == 1)[1] - 1
          )

p3 = ggplot()+ 
  geom_histogram(aes(samples), breaks = 0:11 - 0.5, colour = 'black', closed = 'left') + 
  geom_line(aes(0:10,  dgeom(0:10, 0.25) * 10000), col = "red") +
  
  # aesthetics
  ggtitle('Geometric distribution') +
  xlab('number of nucleotides until first A base') +
  mydarktheme 

p3
## Warning: Removed 573 rows containing non-finite values (stat_bin).

If we instead count the number of non-A’s until we see 2 A’s we get a negative binomial distribution with \(n = 2\) and \(p = 1/4\).

samples = replicate(n = 10000, 
            which(cumsum(rbinom(n = 20, size = 1, prob = 0.25)) == 2)[1] - 2
          )

p4 = ggplot()+ 
  geom_histogram(aes(samples), breaks = 0:11 - 0.5, colour = 'black', closed = 'left') + 
  geom_line(aes(0:10,  dnbinom(0:10, 2, 0.25) * 10000), col = "red") +
  
  # aesthetics
  ggtitle('Negative binomial distribution') +
  xlab("number of non-A nucleotides until two A's") +
  mydarktheme 

p4

A codon consists of three nucleotides and codes for an amino acid. The amino acid tryptophan is coded by TGG, which is one of \(4^3 = 64\) codons.

Counting the number of Tryptophan codons in 1000 codons can be represented as a Poisson distribution with $= np = 1000 (1/4)^3 $

samples = replicate(n = 10000, 
            sum(rbinom(n = 1000, size = 3, prob = 0.25) == 3)
          )

x= 1:50
p5 = ggplot()+ 
  geom_histogram(aes(samples), breaks = x - 0.5, colour = 'black', closed = 'left') + 
  geom_line( aes(x , dpois(x, 1000 *(1/4)^3) * 10000), col = "red") +

  # aesthetics
  ggtitle('Poisson distribution') +
  xlab('number of tryptophan (TGG) in 1000 codons') +
  mydarktheme

p5

However, this distribution is also approximated by a binomial trial with \(n = 1000\) and \(p = (1/4)^3\) (shown as green dots).

p5 + geom_point(aes(x, dbinom(x, 1000, (1/4)^3) * 10000), col = "green") + 
  ggtitle('Poisson and Binomial distribution')

The approximation improves as \(n \to \infty\) and \(p \to 0\). For example, a Poisson distribution would be a poor approximation in the first example of counting A’s in a single random DNA letter (as shown by the green line below). We cannot have more than one A in a single letter as Poisson predicts!

p1 + geom_line(aes(0:5 , dpois(0:5, 1 * 1/2) * 10000), col = "green")

We now extend this story to continuous distributions - where the outcome is not an integer.

If it takes a gene sequencer 1 millisecond to read a codon (as above), it will take us 1 second to read 1000 codons. Remember the Poisson rate of tryptophan per 1000 codons was \(np = 1000 \times (1/4)^3 \approx 16\) or in other words 1/16 seconds until a tryptophan was read.

The time until the first tryptophan read is exponentially distributed.

samples = replicate(n = 10000, 
            which(rbinom(n = 1000, size = 3, prob = 0.25) ==3)[1] 
          )/1000

bins = 100
x = seq(0, 1, length = bins)
p6 = ggplot()+ 
  geom_histogram(aes(samples), breaks = x, colour = 'black', closed = 'left') + 
  geom_line(aes(x,  dexp(x, 16) * 10000/bins), col = "red") +
  
  # aesthetics
  ggtitle('Exponential distribution') +
  xlab('time until first trytophan (TGG), milliseconds') +
  mydarktheme 

p6

The time it takes to for three tryptophan reads is gamma distributed with \(r = 3\).

samples1 = replicate(n = 10000, 
            which(rbinom(n = 1000, size = 3, prob = 0.25) == 3)[1] 
           )/1000

samples2 = sample(samples1)
samples3 = sample(samples1)

samples  = samples1 + samples2 + samples3

bins = 100
x = seq(0, 1, length = bins)
p6 = ggplot()+ 
  geom_histogram(aes(samples), breaks = x, colour = 'black', closed = 'left') + 
  geom_line(aes(x,  dgamma(x, shape = 3, rate = 16) * 10000/bins), col = "red") +
  
  # aesthetics
  ggtitle('Gamma distribution') +
  xlab('time until three trytophans (TGG), milliseconds') +
  mydarktheme 

p6

Suppose we add more and more of these random exponential variables together. What we find is that the distribution approaches the famous normal distribution. Indeed, normal distributions are so ubiquitous in nature precisely because they arise from the sumation of random variables.

Note that below is the first example where I have have had to replace my poor little rbinom based exponential sampler with rexp as it just became too slow.

r = 1000 # number of exp distributions to add
gamma = 16

nsamples = rexp(10000 * r, 16)
samples = apply(array(nsamples, dim = c(10000, r)), 1, sum)

# normal distribution parameters
mu = r / gamma 
std = sqrt(r * (1/gamma)^2)

bins = 50
x = seq(mu - 3 * std, mu + 3 * std, length = bins)
p7 =ggplot()+ 
  geom_histogram(aes(samples), breaks = x, colour = 'black', closed = 'left') + 
  geom_line(aes(x,  dnorm(x, mean = mu, sd = std) * 10000 * (x[2] - x[1])), col = "red") +
  
  # aesthetics
  ggtitle('Normal distribution') +
  xlab(paste('time until',r ,'trytophans (TGG), seconds')) +
  mydarktheme 

p7

It’s nice to think about how simple processes can generate such a variety of probability distribution functions. A great graphic was produced by Ehsan azhdari showing the relationships between probability distributions.

Relationships among some of univariate probability distributions are illustrated with connected lines. dashed lines means approximate relationship, CC BY-SA 3.0 CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=22628518

Relationships among some of univariate probability distributions are illustrated with connected lines. dashed lines means approximate relationship, CC BY-SA 3.0 CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=22628518

Related

comments powered by Disqus