The Light Cone

Where past and future meet at a point in spacetime

Welford's Way to Kalman City

Should you square differences or should you difference squares? Both methods harbor algebraically correct ways to compute incremental (co)variance, but their numerical properties are different, as pointed out by John D. Cook here and here.

When you square numbers, you do two risky things: you make big numbers bigger and you make small numbers smaller. Together, these effects can be catastrophic. Suppose, for instance, that you’re working with positive numbers that look like $b+\epsilon$, where $b\geq 1$ and $0<\epsilon<1$. Squaring, you get $b^2+2b\epsilon+\epsilon^2$. The big part, $b$, gets bigger: $b^2$; and the small part, $\epsilon$, gets smaller: $\epsilon^2$. Everyone knows you shouldn’t add big numbers to little ones on computers. The big ones crowd out space for the little ones. You will eventually run out of dynamic range – the number of bits in your machine numbers – and your squares will evaluate to $b^2+2b\epsilon$ or even just to $b^2$. This is the dreaded underflow.

Remember those Mandelbrot-set-viewing programs? If you kept zooming, the picture would eventually get all soft and pillowy and you would never get to see more detail. That’s because you ran out of dynamic range. There wasn’t any room for all the heavenly beauty hiding in the $\epsilon^2$. It takes special techniques to keep going; see high-precision deep zoom and fractaljourney.

Now, when you try to subtract squares from one another, you get another opportunity to mess up. Everyone knows you shouldn’t subtract nearly equal numbers on computers and expect the little bits to contain good information (another John D. Cook article, here, elaborates). It’s another aspect of crowding out: the little bits you’re trying to recover by subtracting were never there in the first place.

But why would it make a difference to square first, then subtract, versus subtract first, then square? If you square first and then subtract, you’re subtracting extra-big numbers – bigger than you need. If you subtract first, you’re still exposed to crowding out, but less so. Squared numbers crowd out approximately twice the bits of non-squared numbers. You can avoid going to special techniques longer if you subtract first and then square.

This is rather a deep and important observation, especially for sequential estimation techniques like the Kalman filter and its many variations. These techniques rely on accurate, iterative propagation of covariance matrices. Losing even a little significance each round can make a final result useless or worse, i.e., not obviously wrong.

As a result, a whole technology of avoiding squares has arisen in the filtering business. Bierman’s work is in my own lineage due to my time at the Jet Propulsion Laboratory, and searching for any of “Square-Root Information Filter,” “Square-Root Sigma-Point Information Filter,” “Square-Root Unscented Kalman Filter,” and so on will uncover a robust, contemporary literature on continuing research and applications of sophisticated methods for avoiding the squaring of machine numbers.

I can see a direct provenance of these techniques in Welford’s method, the best method cited by John above. Welford’s is a recurrence formula, giving the sum of $N+1$ squared residuals in terms of the sum of $N$ squared residuals, where the residuals are differences of data from their means. Since the means can be calculated incrementally (never requiring storage of all the data), Welford’s is also incremental, requiring only the storage of prior results. Welford’s formula is also pretty to look at and easy to remember.

Let $S_N$ be the sum of squared residuals of the first $N$ data:

where $\bar{x}_N$ is the mean of the $N$ data $x_i,\,i\in [1..N]$:

Then Welford’s update formula is

Generally, I prefer to remember the derivations of formulas along with the formulas. Doing so gives deeper, permanent understanding plus a chance to reuse parts of the derivation in other formulas. If I don’t remember the derivation, it’s a little like knowing a tune without knowing its chord progression and verse structure. You might still be able to play the tune, but not with as much conviction or creativity.

A brute-force expansion of the terms leads to a lot of messy algebra that all cancels out, proving the formula correct, but not yielding any insight into how the author may have seen the pretty and memorable form. An article at Planetmath.org gave me the key to the puzzle: write $S_{N+1}$ as a sum of the same terms as in $S_N$, but with a correction $\gamma$:

where

Regroup the inner sum, then square:

because

That’s enough for an algorithm, but not quite enough to get to the pretty, easy-to-remember formula (don’t get me started on the practical utility of beauty in mathematics – for another time!).

We just need to show that

But

so

Now, the term we need to analyze is

We already found that

so we just look at

Isn’t that fantastic?