Variance is a standard measure of spread of data. It may be obvious
that it’s easy to keep a running average of a stream of data, but it
may not be so obvious that it’s also easy to keep a running
variance. The method generalizes to multivariate data streams, of
which the *covariance matrix* keeps track of variances and mutual
correlation coefficients. We will eventually travel through
eigenvectors and eigenvalues to orient and decompose covariance
matrices.

## Mean and Variance

Consider a column $N$-vector of numerical data:

For instance, here are some random data:

The * mean* or average value of this vector is a number: the sum of
the vector’selements divided by the number $N$ of elements:

for instance,

(we drop the Transpose superscript since the sum of a row vector is the same as the sum of a column vector).

If you have a running stream of data, it’s easy to maintain its average incrementally, meaning you don’t have to store the entire vector to compute the mean-so-far. As $N$ increases to $N+1$, note that

for instance, letting $x_6=66$ be a new datum arriving on the scene:

We see two algorithm choices for incrementally updating the average; informally:

- keep a
**running sum**of the data, and when you need the new average, just divide by the current number $N$ (second line in the example immediately above) - keep a
**running average**, and when you need the new average, multiply the old average by the old count $N$, add the new datum $x_{N+1}$, and divide the result by the new count of elements $N+1$ (third line in the example above)

The first alternative may be cheaper since you don’t divide until you need the average. The second alternative has a certain mathematical elegance, but you must multiply and divide to get the new average.

With only a small variation in the algorithm, you can keep a * sliding-window* average over some fixed width $W$. Along with the running sum or average, keep a copy of the oldest datum used to calculate it. As the new datum arrives, subtract off the oldest one, add the newest one, and divide by the window width $W$. You will need special treatment until the window fills the first time, and possibly at the end if your stream of data is finite.

For instance, away from the ends, a width-$3$ moving average over the sample data can be kept as follows, imagining that we add incoming new values from the right of the window and subtract old values from the left of the window:

Now, about variance. Given $N$ data, the sample variance is reckoned as follows (though below we apply Bessel’s Correction to estimate population variance):

If we let

be an $N$-vector of all $1$’s, and let

denote the $N$vector of * residuals* of the data from their mean, then the variance has the following beautiful expression as the scaled inner product:

For instance, with our sample data

Expanding out the inner product, a form useful for incremental updating emerges:

For instance,

When a new datum $x_{N+1}$ arrives, update the variance in a way similar to the way we update the mean:

For instance

This is really quite nice. Notice that

for any $N$, so it’s the *running sum of squares*. We see that the variance is the running sum of squares divided by the current number $N$ of data minus the square of the current average. This is our incremental formula for variance. The only serious hazard we see here is that of *catastrophic cancelation*: if the two quantities are of comparable magnitude, we can lose virtually all the precision.

## Gaussian Distribution

From the data, we might want a formula for the Gaussian distribution that best fits the data. The best-fit Gaussian should use the unbiased estimator for the population variance, related to the sample variance we have been calculating, but with $N-1$ in the denominator instead of $N$ – Bessel’s correction. Letting

The univariate Gaussian or Normal distribution depends on $\mu$ and $\sigma$ and represents the probability that $x$ might lie in the infinitesimal interval $x$ and $x+dx$:

We can plot the Gaussian that best fits our sample data along with a histogram of our data and see a plausible match (we scaled up the Gaussian by the area of the histogram, a somewhat arbitrary number depending on the choice of histogram bin width):

To quantitatively test the goodness-of-fit, apply something like the Chi-Squared test or the Kolmogorov-Smirnov test.

## Multivariate Distributions

The particular form $\mathbf{x}_N^T\mathbf{x}$ goes over easily to an elegant form for multivariate data, where we keep variances and correlation coefficients in a matrix.

Consider a pair of column $N$-vectors of numerical data:

Plotting $y$’s against $x$’s, we might see something like the following:

Now, calculate the mean of each vector:

and the residuals

Now imagine a column of rows in an outer product with a row of columns:

Divide this by $N$ and call the result $S_N^2$:

Each sub-product in the matrix can be computed incrementally as described above, plus $\tilde{\mathbf{x}}_N^T\tilde{\mathbf{y}}_N = \tilde{\mathbf{y}}_N^T\tilde{\mathbf{x}}_N$, so it is cheap to keep a running value for the matrix.

Just as with the univariate Gaussian, we convert the sample covariance matrix into an unbiased estimator of the population covariance matrix by multiplying by $N$ and dividing by $N-1$. Let

The bivariate Gaussian or normal distribution has such a beautiful form that it can be remembered without looking it up:

where $\left|\Sigma^2\right|^{1/2}$ is the square root of the determinant of $\Sigma^2$ and $\Sigma^{-2}$ is the inverse of $\Sigma^2$. The formula generalizes to $D$ dimensions, in which case the coefficient $2\pi$ must be written $(2\pi)^{D/2}$.

I created the example cloud of points above by sampling 500 points from a bivariate Gaussian formula at a mean point of $\begin{bmatrix}\bar{x} & \bar{y}\end{bmatrix} = \begin{bmatrix}5 & 5\end{bmatrix} $ and the covariance matrix:

If we calculate $\Sigma_N^2=\Sigma_{500}^2$, the unbiased population covariance estimate, from the data, we get

plausibly close to the original (though a formal test, as mentioned, would be required to say quantitatively how close).

Later, we have to figure out exactly *how* to sample a multivariate
Gaussian like this (hints: do the sampling in a new coordinate system
in which the $x$ and $y$ variables are uncorrelated. That is a
coordinate system in which the principal axes of the cloud are aligned
with the $x$ and $y$ axes. That entails a Principal-Component Analysis
(PCA) of the covariance matrix or its inverse and their square roots,
which can be cheaply done with the Singular-Value Decomposition (SVD).