We just finished watching a YouTube tutorial on Dirichlet distributions given by Foursquare data scientist Max Sklar. According to Wikipedia, the Dirichlet distribution is the “multivariate generalization of the beta distribution.” Yikes! We confess that the last time we saw a beta distribution was around the time Bill Clinton was being impeached. In this post we brush the cobwebs from our minds and derive a basic identity between beta and gamma functions and illustrate our calculations with R.

Beta distributions are defined by density functions

where and are shape parameters and is the beta function defined as:

The definition of above reflects the fact that is a probability density function and must integrate to 1. Rather infuriatingly, Wikipedia states that the second line simply “follows from the definition of the gamma function” without any further explanation. We derive this identity below.

Let and be independent random variables with gamma distributions

We transform and to random variables and using the equations

Here comes a wonderful fact: the pdf of is a beta distribution. Before we prove this, we visualize it with some R code:

library(ggplot2)

# change these parameters if you want
alpha <- 5
beta <- 15
nSamples <- 1000

# generate samples for U and V according to gamma distributions with shape alpha and beta
u <- rgamma(1000, shape = alpha)
v <- rgamma(1000, shape = beta)

# transform the variables to x and y
x <- u / (u + v)
y <- u + v

# data for pdf of the beta function with shapes alpha and beta
xBeta <- seq(min(x), max(x), length = 100)
yBeta <- dbeta(xBeta, shape1 = alpha, shape2 = beta)

# plot the density of x versus the pdf of the beta distribution
myPlot <- ggplot() +
    geom_histogram(aes(x, y = ..density..),
                   binwidth = 0.05,
                   color = "white",
                   fill = "#E69F00") +
    geom_line(aes(x = xBeta, y = yBeta), color = "#56B4E9") +
    theme(axis.title.x = element_blank()) +
    theme(axis.title.y = element_blank())


# ...profit!
plot(myPlot)

This code generates the pretty picture at the top of this article. The orange bars are a histogram for 1000 random samples of , and the blue line represents the graph of beta distribution . You can see that the blue line matches up nicely with the orange bars.

On to the business of showing our identity. Since and are independent, the joint distribution of is

The Jacobian of the transformation is

The joint distribution of is thus

and the pdf of is found by integrating along the values of this equation.

Our identity follows from the fact that is a pdf and must integrate to 1.

Conclusion

This was a nice little exercise to help us remember some statistics.



blog comments powered by Disqus

Written by

Stephanie Yang

Published

28 March 2014

Tags