Skip to main content
  1. Blog Posts/

Fun Facts about Variance

·1935 words·10 mins

You can’t understand statistics without understanding variance. Unfortunately, there are some important bits of information about variance that I don’t remember ever being taught as a whole. I’ve put these tidbits together in one place so you don’t have to piece them together from textbooks and wikipedia.

1. Definitions #

First, the basics. These are thing you are taught as a student, but it’s worth saying them again.

1a. Variance, Standard Deviation, and Precision #

Say (x) is a sample of (n) values, where (x_i) is the (i)th value: (x = [x_1, x_2, …, x_n]). The sample mean is (\mu = \frac{\sum_i^n x_i}{n}) (that is, the sum of all the values, divided by the number of values). We can also call this the expectation or expected value of (x), (E(x)), but don’t confuse this with expected value in decision-making, which is slightly different.

The sample variance, (\sigma^2), is the average squared distance between each value and the sample mean.

σx2=in(xiμ)2n \sigma^2_x = \frac{\sum_i^n (x_i - \mu)^2}{n}

Sample standard deviation, (\sigma) is the square root of the sample variance. This is simple, but also easy to say the wrong way around, so let’s write it down.

σ=σ2;Standard deviation=Variance; \begin{align} \sigma =& \sqrt{\sigma^2};\\ \text{Standard deviation} =& \sqrt{\text{Variance}}; \end{align}

We can also talk about the precision, (\lambda), which is just 1 divided by the variance.

λ=1σ2\lambda = \frac{1}{\sigma^2}

This is particularly useful in Bayesian statistics, and we’ll come back to it later.

1b. Sample vs Population Variance #

For reasons that I won’t fully explain here (but see wikipedia), our best estimate of the variance of the population that (x) was sampled from isn’t actually (\frac{\sum_i^n (x_i - \mu)^2}{n}) but (\frac{\sum_i^n (x_i - \mu)^2}{n-1}). The corresponding population standard deviation is (\sqrt{\frac{\sum_i^n (x_i - \mu)^2}{n-1}}).

Software like R and numpy will give you the population estimate, using (n-1), rather than the sample estimate by default. This is usually what you want, but make sure it doesn’t catch you out!

suppressPackageStartupMessages( library(tidyverse))
suppressPackageStartupMessages( library(glue)     )
theme_set(cowplot::theme_cowplot(font_size=18))
x = c(1,2,3,4,5)
pop.var    = function(x) sum((x - mean(x))^2) / (length(x)-1) # Replicates built-in var(x)
sample.var = function(x) sum((x - mean(x))^2) /  length(x)    # Our function
pop.sd     = function(x) sqrt(pop.var(x))
sample.sd  = function(x) sqrt(sample.var(x))
var(x) # Built-in function
## [1] 2.5
pop.var(x)
## [1] 2.5
sample.var(x)
## [1] 2
sd(x)
## [1] 1.581139
pop.sd(x)
## [1] 1.581139
sample.sd(x)
## [1] 1.414214

1c. Covariance and Correlation #

Covariance measures the relationship between two variables, which we’ll call (x) and (y). First, note that we can write the variance of (x) as

σx2=in(xiE(x))2n=in[(xiE(x))(xiE(x))]n \sigma_x^2 = \frac{\sum_i^n (x_i - E(x))^2}{n} = \frac{\sum_i^n [(x_i - E(x))(x_i - E(x))]}{n}

The covariance between (x) and (y) is very similar

cov(x,y)=σx,y=in[(xiE(x))(yiE(y))]n cov(x, y) = \sigma_{x,y} = \frac{\sum_i^n [(x_i - E(x))(y_i - E(y))]}{n}

In word: for each pair of points ([x_i, y_i]) in ([x, y]), find the distance of each point from the mean, multiply them together, and then take the average of this across all points. This works because if both (x_i) and (y_i) are both higher or both lower than average, ((x_i - E(x))(y_i - E(y))) will be positive If one is higher and one lower, it will be negative.

Since covariance depends on the variance of (x) and (y), it’s useful to standardise it by dividing by (sd(x) \times sd(y)). This is the correlation coefficient.

cor(x,y)=ρxy=cov(x,y)sd(x)sd(y)=σx,yσxσy cor(x, y) = \rho_{xy} = \frac{cov(x, y)}{sd(x)sd(y)} = \frac{\sigma_{x,y}}{\sigma_x \sigma_y}

Here’s an example:

n = 5
x = c(-2, -1, 0, 1, 2)
y = x
mx = mean(x)
my = mean(y)
(xy.products = (x - mx)*(y - my))
## [1] 4 1 0 1 4
(xy.covariance = mean(xy.products))
## [1] 2
# Note we use the sample SD, not the population SD estimate (see above)
(xy.correlation = (xy.covariance / (sample.sd(x) * sample.sd(y))))
## [1] 1

We can flip this around to show that

cov(x,y)=cor(x,y)×sd(x)×sd(y);σx,y=ρxyσxσy; \begin{align} cov(x, y) =& cor(x, y) \times sd(x) \times sd(y);\\ \sigma_{x,y} =& \rho_{xy}\sigma_x\sigma_y; \end{align}

Covariance Matrices #

If we have a collection of variables, (D = [x, y, z]), we can put all this together to calculate the covariance matrix, (\bf{\Sigma}). In this, the diagonal elements are just the variances of each variable, (\bf{\Sigma_{i,i}} = \sigma^2_i)

[σx2,,,σy2,,,σz2] \begin{bmatrix} \sigma^2_x, -, - \\ -, \sigma^2_y, - \\ -, -, \sigma^2_z \\ \end{bmatrix}

while the off-diagonal elements are the covariances between different variables, (\bf{\Sigma_{i,j}} = \sigma_{ij} = \rho_{ij}\sigma_i\sigma_j)

[,σxy,σxzσyx,,σyzσzx,σzy,] \begin{bmatrix} -, \sigma_{xy}, \sigma_{xz} \\ \sigma_{yx}, -, \sigma_{yz} \\ \sigma_{zx}, \sigma_{zy}, - \\ \end{bmatrix}

Putting this together, you get

Σ=[σx2,σxy,σxzσyx,σy2,σyzσzx,σzy,σz2] \bf{\Sigma} = \begin{bmatrix} \sigma^2_x, \sigma_{xy}, \sigma_{xz} \\ \sigma_{yx}, \sigma^2_y, \sigma_{yz} \\ \sigma_{zx}, \sigma_{zy}, \sigma^2_z \\ \end{bmatrix}

Phew.

You can use a covariance matrix to do stuff, but that’s beyond the scope of this post.

2. Fun Facts #

2a. Variance is Additive #

This one is useful. If the variance of (x) is (\sigma^2_x) and the variance of (y) is (\sigma^2_y), and (x) and (y) are uncorrelated, the variance of (x+y) and of (x-y) is (\sigma^2_x + \sigma^2_y).

σx+y2=σx2+σy2\sigma^2_{x+y} = \sigma^2_{x} + \sigma^2_{y}

Note that (x + y = [x_1 + y_1, x_2 + y_2, …, x_n + y_n]), etc.

Here’s a demonstration.

sample.with.variance = function(v, n=100){
  x =   MASS::mvrnorm(n=1000, mu=c(0), Sigma=matrix(v), empirical=T)
  return(c(x))
}
sample.with.correlation = function(var1, var2, corr=0, n=100){
  # Draw random 2D samples with given variances and correlation
  sd1 = sqrt(var1)
  sd2 = sqrt(var2)
  cov.matrix = matrix(c(var1, corr*sd1*sd2, corr*sd2*sd1, var2), nrow = 2)
  res = MASS::mvrnorm(n=n, mu=c(0, 0), Sigma=cov.matrix, empirical = T) %>% data.frame()
  names(res) = c('x', 'y')
  res
}

do.example = function(var1, var2, corr){
  xy = sample.with.correlation(var1=5, var2=6, corr=corr)
  x = xy$x
  y = xy$y
  glue::glue('Var(x) = {var(x)}') %>% print()
  glue::glue('Var(y) = {var(y)}') %>% print()
  glue::glue('cor(x, y) = {cor(x, y) %>% round(2)}') %>% print()
  glue::glue('Var(x + y) = {var(x + y) %>% round(2) }') %>% print()
  glue::glue('Var(x - y) = {var(x - y) %>% round(2)}') %>% print()
  cat('\n')
}

do.example(var1=5, var2=6, corr=0)
## Var(x) = 5
## Var(y) = 6
## cor(x, y) = 0
## Var(x + y) = 11
## Var(x - y) = 11

If the two samples are correlated, the formula is a little more complicated. If (x) and (y) are positively correlated, so that large values of (x) tend to be paired with large values of (y), adding them together makes the variance even greater:

σx+y2=σx2+σy2+2σx,y; \sigma^2_{x+y} = \sigma^2_{x} + \sigma^2_{y} + 2\sigma_{x, y};\\ (remember, (\sigma_{x, y} = covariance(x, y))).

do.example(var1=5, var2=6, corr=+.5)
## Var(x) = 5
## Var(y) = 6
## cor(x, y) = 0.5
## Var(x + y) = 16.48
## Var(x - y) = 5.52

If they are negatively correlated, so large values of (x) tend to be paired with small values of (y), they tend to cancel out when you add them together, and the variance shrinks:

do.example(var1=5, var2=6, corr=-.5)
## Var(x) = 5
## Var(y) = 6
## cor(x, y) = -0.5
## Var(x + y) = 5.52
## Var(x - y) = 16.48

If you’re subtracting rather than adding, the rule is the other way around

σxy2=σx2+σy22×cov(x,y); \sigma^2_{x-y} = \sigma^2_{x} + \sigma^2_{y} - 2 \times cov(x, y);

2b. Adding Standard Deviations #

All of this means that the standard deviation of (x+y) is

SD(x+y)=σx+y=σx+y2=σx2+σy2+2×cov(x,y); \begin{align} SD(x+y) =& \sigma_{x+y}\\ =& \sqrt{\sigma^2_{x+y}}\\ =& \sqrt{\sigma^2_{x} + \sigma^2_{y} + 2 \times cov(x, y)}; \end{align}

When (x) and (y) are uncorrelated, this is just

σx+y=σx+y2=σx2+σy2; \sigma_{x+y} = \sqrt{\sigma^2_{x+y}} = \sqrt{\sigma^2_{x} + \sigma^2_{y}};

This looks familiar. Remember Pythagoras’ theorem?

c=a2+b2 c = \sqrt{a^2 + b^2}

It turns out the reasons for this are kind of cool, but beyond the scope of this post. See this paper and this stackexchange post for more.

2c. Standard Deviations Multiply and Divide #

The standard deviation of (x \times k) is (k) times the standard deviation of (x):

sd(k×x)=σkx=k×σx sd(k \times x) = \sigma_{kx} = k\times \sigma_x

This means that the variance is

var(k×x)=σkx2=(k×σx)2=k2×σx2 var(k \times x) = \sigma^2_{kx} = (k\times \sigma_x)^2 = k^2 \times \sigma_x^2

If we multiply (x) by (k), we multiply its standard deviation by (k), and it’s variance by (k^2).

Division works in the same way:

sd(xk)=σxk=σxk;var(xk)=(σxk)2=(σxk)2=σx2k2; \begin{align} sd(\frac{x}{k}) =& \sigma_{\frac{x}{k}} = \frac{\sigma_x}{k};\\ var(\frac{x}{k}) =& (\sigma_{\frac{x}{k}})^2 = (\frac{\sigma_x}{k})^2 = \frac{\sigma^2_x}{k^2}; \end{align}

Here’s the proof.

x = sample.with.variance(9)
k = 4
# Standard deviations
sd(x)
## [1] 3
sd(x * k)
## [1] 12
k * sd(x)
## [1] 12
sd(x / k)
## [1] 0.75
sd(x) / k
## [1] 0.75
# Variances
var(x)
## [1] 9
var(k * x)
## [1] 144
k^2 * var(x)
## [1] 144
var(x / k)
## [1] 0.5625
var(x) / k^2
## [1] 0.5625

2d. Averaging reduces Variance #

The average of (x) and (y) is (\frac{x+y}{2}). If (x) and (y) are uncorrelated, the variance of the average is

σMean(x,y)2=σx+y22=σx2+σy222 \sigma^2_{Mean(x,y)} = \sigma^2_{\frac{x+y}{2}} = \dfrac{\sigma^2_x + \sigma^2_y}{2^2}

and the standard deviation is

σMean(x,y)=σx+y2=σx+y22=σx2+σy22 \sigma_{Mean(x,y)} = \sigma_{\frac{x+y}{2}} = \sqrt{\sigma^2_{\frac{x+y}{2}}} = \dfrac{\sqrt{\sigma^2_x + \sigma^2_y}}{2}

If (\sigma^2 = \sigma^2_x = \sigma^2_y), this simplifies to

σMean(x,y)2=2σx222=12σx2;σMean(x,y)=12σx2=12σx; \begin{align} \sigma^2_{Mean(x,y)} =& \dfrac{2\sigma^2_x}{2^2} = \frac{1}{2}\sigma^2_x; \\ \sigma_{Mean(x,y)} =& \sqrt{\frac{1}{2}\sigma^2_x} = \frac{1}{\sqrt{2}}\sigma_x; \\ \end{align}

In other words, when you average two variables together, you half the variance, or divide the standard deviation by (\sqrt{2}).

If (x) and (y) are correlated, as before (\sigma^2_{\frac{x+y}{2}} = \frac{\sigma^2_x + \sigma^2_y + 2\times \sigma_{x,y}}{2^2}) and (\sigma_{\frac{x+y}{2}} = \frac{\sqrt{\sigma^2_x + \sigma^2_y + 2\times \sigma_{x,y}}}{2}), but let’s not worry about that too much.

xy = sample.with.correlation(16, 16, 0)
x = xy$x
y = xy$y
mean.xy = (x+y)/2

var(x)
## [1] 16
var(y)
## [1] 16
var(mean.xy)
## [1] 8
sd(x)
## [1] 4
sd(y)
## [1] 4
sd(mean.xy)
## [1] 2.828427
sd(x) / sqrt(2)
## [1] 2.828427

3. Variance and Uncertainty #

Variance is also used as a measure of uncertainty. In classical statistics, we can talk about the variance of an estimate, which is the squared standard error. In Bayesian statistics, we can talk about the variance of a posterior distribution for a parameter. In both cases, the higher the variance, the more uncertain we are. It’s often easier to use the precision, (\lambda = \frac{1}{\sigma^2}): the higher the precision, the more certain we are. There is a lot we can do with these values, but I’m going to talk about how we combine estimates.

3a. Precision is Additive #

If we average two estimates, (x) and (y), the precision of the mean is the sum of the precision of the individual estimates:

λMean(x,y)=λx+λy \lambda_{Mean(x, y)} = \lambda_x + \lambda_y

precision = function(x) 1 / var(x)
xy = sample.with.correlation(16, 16, 0)
x = xy$x
y = xy$y
mean.xy = (x+y)/2

precision(x)
## [1] 0.0625
precision(y)
## [1] 0.0625
precision(mean.xy)
## [1] 0.125
1 / var(mean.xy)
## [1] 0.125

3b. Precision-Weighting #

If one estimate is more certain (has higher precision ) than the other, you can do better than just averaging them together. A better idea is to calculate a weighted-average, giving more weight to the more certain estimate. This precision-weighting is an important part of Bayesian inference.

If estimate (\hat\theta_1) has variance (\lambda_1), and estimate (\hat\theta_2) has variance (\lambda_2), their precision-weighted average is

θ^1&2=λ1θ^1+λ2θ^2λ1+λ2 \hat\theta_{1\&2} = \frac{\lambda_1\hat\theta_1 + \lambda_2\hat\theta_2}{\lambda_1 + \lambda_2}

and the precision of this combined estimate is, as above,

λ1&2=λ1+λ2 \lambda_{1\&2} = \lambda_1 + \lambda_2

Neato.

4. Conclusions #

So, at this point hopefully you know some useful things about variance you didn’t know before now. Otherwise, you’ve wasted about 10 minutes, and you’ll never get them back. There are a few other relevant topics I considered covering, but thought better of. I might cover these in a future post:

  • ANOVA (Analysis of Variance): (\sigma^2_{Total} = \sigma^2_{Explained} + \sigma^2_{Unexplained}).
  • Variance and uncertainty of frequentist estimates.
  • Working with samples from posterior distributions.
  • Comparing and combining regression parameters (F-Contrasts).