The Light Cone

Where past and future meet at a point in spacetime

Category Theory Reveals Value of Autoiconicity

Suddenly it hit me: Category Theory is to Math as Lisp is to Programming. I’ve always had an instinctive preference for autoiconic programming languages over non-autoiconic, and now I know why.

An autoiconic language is one in which the format for code and the format for data are identical. Add “quote” and you have instant metaprogramming built-in. That means that when you write code, you’re really writing a DSL (Domain-Specific Language), always. With non-autoiconic languages, e.g., all the C’s, Javas, Algols and Fortrans, that is, the mainstream, DSLs are a special case and require tons of enabling machinery like expression-object-models, serializers, lexers, yaccers, compilers, decompilers, metadata managers, and on and on. All standard or near standard, and all equally unnecessary since none of it at bottom does any more nor less than quote, quasi-quote, and unquote.

With Haskell and its ancestors Miranda, KRC, and SASL; and with Scala, F#, and all the ML’s, DSL programming is done via user-defined types. With these languages, DSL programming is not a special case, but remoting is still a special case requiring extra machinery. This is not a small thing, but really the crux of the argument. The real gold in the era of cloud programmability is DSL + remoting. I need to send expressions written in my DSLs to servers – for data affinity, and to clients – for data privacy. Why is remoting a special case? Because I don’t have quote and eval. Why not? This, I don’t know. It must be a conscious decision on the part of language designers, because all these languages conspicuously lack quote and eval. I do not know the reason. Be that as it may…

Granted one might say “the difference is just a matter of degree.” True, but that’s like saying the difference between a rock and the entire Earth is just a matter of degree. That’s why the Lisps won’t die: it’s just plain easier to write remotable DSLs in Lisp, by orders of magnitude, and you always need remotable DSLs. And remotable DSLs are enabled by distributed metaprogramming platforms.

I argue that any sufficiently rich application will become a distributed metaprogramming platform eventually, and will need all that stuff. Especially on the web, where code-remoting is a necessity – confer the massive amounts of machinery and years of work behind .NET Expressions for IQueryable and friends. Beautiful, it’s true, because the thinking and the engineering behind them were first-rate and I was privileged to participate. But, if our mainstream languages had been Mathematica or some kinds of Lisp, remoting and rewriting would have been for-free; a “duh, obviously;” not even an afterthought; but a built-in assumption. Reflecting on the whole remotable-DSL-enabling infrastructure of .NET Expressions, it’s nothing but an implementation of quote, quasi-quote, and unquote for C#. All that work just to get into the mineshaft where the gold of remotable DSLs lies.

This entire screed is just an addendum to Greenspun’s Tenth Rule, my asserting that the specific thing that these “sufficiently complicated C or Fortran” programs strive to find in their “ad-hoc, informally specified, bug-ridden, slow implementations of half of Common Lisp” – the specific half they need – is the half that enables remotable DSLs, namely distributed, remotable metaprogramming via quote, quasi-quote, and unquote.

What does this have to do with Category Theory? That is a use of autoiconicity in mathematics. It uses the language of mathematics to explain mathematics. Every branch of mathematics is just a DSL of Category Theory.

All this philosophizing came about because Fogus, of “Joy of Clojure” recently tweeted about SICP – Structure and Interpretation of Computer Programs, which is now influencing its second or third generation of developers. That reminded me of Philip Wadler’s critique of SICP and got me thinking why I always thought that, outside of his pedagogical context, Wadler’s argument was weak, specifically because it did not note the trade-off of losing remotability by giving up on autoiconicity, granting that when Wadler wrote, remoting was not center-stage the way it is now with the cloud. Then I remembered that John D. Cook, of “The Endeavor” sent around a link to a deeply intriguing new book by David I. Spivak on category theory, and it hit me.

Welford’s Way to Kalman City

Should you square differences or should you difference squares? Both methods harbor algebraically correct ways to compute incremental (co)variance, but their numerical properties are different, as pointed out by John D. Cook here and here.

When you square numbers, you do two risky things: you make big numbers bigger and you make small numbers smaller. Together, these effects can be catastrophic. Suppose, for instance, that you’re working with positive numbers that look like $b+\epsilon$, where $b\geq 1$ and $0<\epsilon<1$. Squaring, you get $b^2+2b\epsilon+\epsilon^2$. The big part, $b$, gets bigger: $b^2$; and the small part, $\epsilon$, gets smaller: $\epsilon^2$. Everyone knows you shouldn’t add big numbers to little ones on computers. The big ones crowd out space for the little ones. You will eventually run out of dynamic range – the number of bits in your machine numbers – and your squares will evaluate to $b^2+2b\epsilon$ or even just to $b^2$. This is the dreaded underflow.

Remember those Mandelbrot-set-viewing programs? If you kept zooming, the picture would eventually get all soft and pillowy and you would never get to see more detail. That’s because you ran out of dynamic range. There wasn’t any room for all the heavenly beauty hiding in the $\epsilon^2$. It takes special techniques to keep going; see high-precision deep zoom and fractaljourney.

Now, when you try to subtract squares from one another, you get another opportunity to mess up. Everyone knows you shouldn’t subtract nearly equal numbers on computers and expect the little bits to contain good information (another John D. Cook article, here, elaborates). It’s another aspect of crowding out: the little bits you’re trying to recover by subtracting were never there in the first place.

But why would it make a difference to square first, then subtract, versus subtract first, then square? If you square first and then subtract, you’re subtracting extra-big numbers – bigger than you need. If you subtract first, you’re still exposed to crowding out, but less so. Squared numbers crowd out approximately twice the bits of non-squared numbers. You can avoid going to special techniques longer if you subtract first and then square.

This is rather a deep and important observation, especially for sequential estimation techniques like the Kalman filter and its many variations. These techniques rely on accurate, iterative propagation of covariance matrices. Losing even a little significance each round can make a final result useless or worse, i.e., not obviously wrong.

As a result, a whole technology of avoiding squares has arisen in the filtering business. Bierman’s work is in my own lineage due to my time at the Jet Propulsion Laboratory, and searching for any of “Square-Root Information Filter,” “Square-Root Sigma-Point Information Filter,” “Square-Root Unscented Kalman Filter,” and so on will uncover a robust, contemporary literature on continuing research and applications of sophisticated methods for avoiding the squaring of machine numbers.

I can see a direct provenance of these techniques in Welford’s method, the best method cited by John above. Welford’s is a recurrence formula, giving the sum of $N+1$ squared residuals in terms of the sum of $N$ squared residuals, where the residuals are differences of data from their means. Since the means can be calculated incrementally (never requiring storage of all the data), Welford’s is also incremental, requiring only the storage of prior results. Welford’s formula is also pretty to look at and easy to remember.

Let $S_N$ be the sum of squared residuals of the first $N$ data:

where $\bar{x}_N$ is the mean of the $N$ data $x_i,\,i\in [1..N]$:

Then Welford’s update formula is

Generally, I prefer to remember the derivations of formulas along with the formulas. Doing so gives deeper, permanent understanding plus a chance to reuse parts of the derivation in other formulas. If I don’t remember the derivation, it’s a little like knowing a tune without knowing its chord progression and verse structure. You might still be able to play the tune, but not with as much conviction or creativity.

A brute-force expansion of the terms leads to a lot of messy algebra that all cancels out, proving the formula correct, but not yielding any insight into how the author may have seen the pretty and memorable form. An article at Planetmath.org gave me the key to the puzzle: write $S_{N+1}$ as a sum of the same terms as in $S_N$, but with a correction $\gamma$:

where

Regroup the inner sum, then square:

because

That’s enough for an algorithm, but not quite enough to get to the pretty, easy-to-remember formula (don’t get me started on the practical utility of beauty in mathematics – for another time!).

We just need to show that

But

so

Now, the term we need to analyze is

We already found that

so we just look at

Isn’t that fantastic?

Incremental Covariance

Variance is a standard measure of spread of data. It may be obvious that it’s easy to keep a running average of a stream of data, but it may not be so obvious that it’s also easy to keep a running variance. The method generalizes to multivariate data streams, of which the covariance matrix keeps track of variances and mutual correlation coefficients. We will eventually travel through eigenvectors and eigenvalues to orient and decompose covariance matrices.

Mean and Variance

Consider a column $N$-vector of numerical data:

For instance, here are some random data:

The mean or average value of this vector is a number: the sum of the vector’selements divided by the number $N$ of elements:

for instance,

(we drop the Transpose superscript since the sum of a row vector is the same as the sum of a column vector).

If you have a running stream of data, it’s easy to maintain its average incrementally, meaning you don’t have to store the entire vector to compute the mean-so-far. As $N$ increases to $N+1$, note that

for instance, letting $x_6=66$ be a new datum arriving on the scene:

We see two algorithm choices for incrementally updating the average; informally:

  1. keep a running sum of the data, and when you need the new average, just divide by the current number $N$ (second line in the example immediately above)
  2. keep a running average, and when you need the new average, multiply the old average by the old count $N$, add the new datum $x_{N+1}$, and divide the result by the new count of elements $N+1$ (third line in the example above)

The first alternative may be cheaper since you don’t divide until you need the average. The second alternative has a certain mathematical elegance, but you must multiply and divide to get the new average.

With only a small variation in the algorithm, you can keep a sliding-window average over some fixed width $W$. Along with the running sum or average, keep a copy of the oldest datum used to calculate it. As the new datum arrives, subtract off the oldest one, add the newest one, and divide by the window width $W$. You will need special treatment until the window fills the first time, and possibly at the end if your stream of data is finite.

For instance, away from the ends, a width-$3$ moving average over the sample data can be kept as follows, imagining that we add incoming new values from the right of the window and subtract old values from the left of the window:

Now, about variance. Given $N$ data, the sample variance is reckoned as follows (though below we apply Bessel’s Correction to estimate population variance):

If we let

be an $N$-vector of all $1$’s, and let

denote the $N$vector of residuals of the data from their mean, then the variance has the following beautiful expression as the scaled inner product:

For instance, with our sample data

Expanding out the inner product, a form useful for incremental updating emerges:

For instance,

When a new datum $x_{N+1}$ arrives, update the variance in a way similar to the way we update the mean:

For instance

This is really quite nice. Notice that

for any $N$, so it’s the running sum of squares. We see that the variance is the running sum of squares divided by the current number $N$ of data minus the square of the current average. This is our incremental formula for variance. The only serious hazard we see here is that of catastrophic cancelation: if the two quantities are of comparable magnitude, we can lose virtually all the precision.

Gaussian Distribution

From the data, we might want a formula for the Gaussian distribution that best fits the data. The best-fit Gaussian should use the unbiased estimator for the population variance, related to the sample variance we have been calculating, but with $N-1$ in the denominator instead of $N$ – Bessel’s correction. Letting

The univariate Gaussian or Normal distribution depends on $\mu$ and $\sigma$ and represents the probability that $x$ might lie in the infinitesimal interval $x$ and $x+dx$:

We can plot the Gaussian that best fits our sample data along with a histogram of our data and see a plausible match (we scaled up the Gaussian by the area of the histogram, a somewhat arbitrary number depending on the choice of histogram bin width):

To quantitatively test the goodness-of-fit, apply something like the Chi-Squared test or the Kolmogorov-Smirnov test.

Multivariate Distributions

The particular form $\mathbf{x}_N^T\mathbf{x}$ goes over easily to an elegant form for multivariate data, where we keep variances and correlation coefficients in a matrix.

Consider a pair of column $N$-vectors of numerical data:

Plotting $y$’s against $x$’s, we might see something like the following:

Now, calculate the mean of each vector:

and the residuals

Now imagine a column of rows in an outer product with a row of columns:

Divide this by $N$ and call the result $S_N^2$:

Each sub-product in the matrix can be computed incrementally as described above, plus $\tilde{\mathbf{x}}_N^T\tilde{\mathbf{y}}_N = \tilde{\mathbf{y}}_N^T\tilde{\mathbf{x}}_N$, so it is cheap to keep a running value for the matrix.

Just as with the univariate Gaussian, we convert the sample covariance matrix into an unbiased estimator of the population covariance matrix by multiplying by $N$ and dividing by $N-1$. Let

The bivariate Gaussian or normal distribution has such a beautiful form that it can be remembered without looking it up:

where $\left|\Sigma^2\right|^{1/2}$ is the square root of the determinant of $\Sigma^2$ and $\Sigma^{-2}$ is the inverse of $\Sigma^2$. The formula generalizes to $D$ dimensions, in which case the coefficient $2\pi$ must be written $(2\pi)^{D/2}$.

I created the example cloud of points above by sampling 500 points from a bivariate Gaussian formula at a mean point of $\begin{bmatrix}\bar{x} & \bar{y}\end{bmatrix} = \begin{bmatrix}5 & 5\end{bmatrix} $ and the covariance matrix:

If we calculate $\Sigma_N^2=\Sigma_{500}^2$, the unbiased population covariance estimate, from the data, we get

plausibly close to the original (though a formal test, as mentioned, would be required to say quantitatively how close).

Later, we have to figure out exactly how to sample a multivariate Gaussian like this (hints: do the sampling in a new coordinate system in which the $x$ and $y$ variables are uncorrelated. That is a coordinate system in which the principal axes of the cloud are aligned with the $x$ and $y$ axes. That entails a Principal-Component Analysis (PCA) of the covariance matrix or its inverse and their square roots, which can be cheaply done with the Singular-Value Decomposition (SVD).

Asynchronous Lazy Lists

SYNCHRONOUS LAZY LISTS

Lazy lists make it easy to write infinite data sets without doing infinite computations. For instance, the following static method in C# will generate a lazy list that serves up all the integers (actually, all the 32-bit integers in an infinite cycle, but we quibble):

A lazy list for an infinite stream of ints
1
2
3
4
public static LazyList<int> integers(int n)
{
    return new LazyList<int> {value = n, rest = () => integers(n + 1)};
}

Notice that an instance of this LazyList type has two properties (or fields, if you prefer): a value and a rest, which is a function, (lambda expression) that, when invoked, will produce the rest of the infinite stream. That’s the essence of the technique: keep the infinite stuff in a function, recursively, and pick out values only when you need them.

Next is my implementation of the LazyList class. Of course, I could use C#’s built-in Lazy type, but I am, on purpose, doing the details myself for deeper understanding and exploration. We’re going to morph these into task-based LazyLists that produce the downstream values asynchronously:

Implementation of the LazyList class
1
2
3
4
5
6
7
8
9
10
11
12
13
public class LazyList<T>
{
    public T value;
    public Func<LazyList<T>> rest;

    public T nth(int n)
    {
        return
            (n <= 1) ?
            value :
            rest().nth(n - 1);
    }
}

In addition to the properties already mentioned, I’ve included an nth method that walks up the lazy list in O(n) time to get the n-th item (without the tail-recursion optimization, this code will also consume O(n) space on the stack). This is not designed to fetch the n-th item directly, as is possible with some spigot algorithms, and it might be very interesting to explore spigots using lazy techniques in C#, but perhaps another time.

Next is a little more meat on the bones: the obligatory factorial:

A LazyList that produces factorials
1
2
3
4
5
6
7
8
public static LazyList<int> factorials(int n)
{
    return new LazyList<int>
    {
        value = (n <= 1) ? 1 : n * factorials(n - 1).value,
        rest = () => factorials(n + 1)
    };
}

This one uses recursion in the production of value. There are two uses of recursion: value recursively uses prior values, rest recursively produces future values. This is pretty neat. Both uses of recursion are completely synchronous and all this code runs on one thread.

We could go on to fibonaccis and more exotic streams; we could anonymize these for remoting, memoize them for efficiency, as explained an earlier post on this blog; but that’s not where we’re going today.

One last thing before going async, and thats for a bit of code that exercises the above, using LinqPad:

Exercising the LazyLists
1
2
3
4
5
6
7
8
9
10
11
12
13
void Main()
{
    var k1 = integers(1);
    // Expect 5
    k1.nth(5).Dump();

    var k2 = factorials(1);
    // Expect 120
    k2.nth(5).Dump();

    // A hint for what's coming up next
    AsyncMain();
}

Much more with synchronous lazy streams can be found in this wonderful paper: Music of the Streams by Doug McIlroy. But on to asynchrony for us.

ASYNCHRONOUS LAZY LISTS WITH TASK

Let’s start with ‘what-we-want’ in AsyncMain:

AsyncMain
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
public static async void AsyncMain()
{
    DumpThreadId("Async main, k3: ");
    var k3 = asyncIntegers(1);
    (await k3.nth(5)).Dump();

    DumpThreadId("Async main, k4: ");
    var k4 = asyncFactorials(1);
    (await k4.nth(5)).Dump();

    var z0 = (await Task.Run(() =>
    {   DumpThreadId("Anonymous task: ");
        Thread.Sleep(ran.Next(200));
        return 42;
    }));

    DumpThreadId("Async main, z0: ");
}

We made a new main because we needed it to be async and we can’t mark the regular Main as async. What we want, here, is to await the computation of asyncIntegers and asyncFactorials in the body of AsyncMain. This can cause various threads to jump around in our code, so we’re careful to print thread id’s after every await and in the body of every lambda expression that runs asynchronously in a task, via the following static method:

How to dump a thread id
1
2
3
public static void DumpThreadId(string msg = "")
{   System.Threading.Thread.CurrentThread.ManagedThreadId.Dump(msg + "Thread Id");
}

We’ll also insert a number of artificial random delays to try to entice the thread scheduler to send different threads through our code. None of these Sleeps are part of the algorithm at all; they’re here to induce interesting behavior.

Now look at the implementation of asyncIntegers:

asyncIntegers
1
2
3
4
5
6
7
8
9
10
public static AsyncList<int> asyncIntegers(int n)
{   return new AsyncList<int>
    {   value = n,
        rest = () =>
        {   DumpThreadId("AsyncIntegers, rest: ");
            Thread.Sleep(ran.Next(200));
            return asyncIntegers(n + 1);
        }
    };
}

Looks just like the implementation of lazy integers. The variable k3 = asyncIntegers(5) in AsyncMain must be of type AsyncList. The deeper differences are in nth, the method we use to walk the lazy list:

AsyncList
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
public class AsyncList<T>
{   public T value;
    public Func<AsyncList<T>> rest;

    public async Task<T> nth(int n)
    {   return
            (n <= 1) ?
            await (Task.Run(() =>
            {   DumpThreadId("AsyncList, nth: ");
                return value;
            })) :
            await Task.Run(() =>
            {   Thread.Sleep(ran.Next(200));
                return rest().nth(n - 1);
            });
    }
}

We use only one facet of the multifaceted Task class. The rule we use is that Task.Run, when given a Func<T>, produces a Task<T>. Awaiting a Task<T> prduces a T. So, except for the routing through Task, it looks like invoking a Func<T> to produce a T. This seems like a minimal intrusion of task-ness on our original lazy-list design.

Look at the first branch of nth, when n <= 1. We await a Task.Run of a Func<T> that produces value, so the result of the await is of type T, the type of value. This T gets returned through an async method, namely nth, so it gets converted back into a Task<T> on the way out. In the body of AsyncMain, where we are calling nth, we await this Task<T>, getting back a T in the end.

T’s become Task<T> when run under Task.Run or when returned from any async method that would otherwise produce a T. Task<T>’s become T’s when awaited (there’s a monad in here somewhere, but that’s for another time, too).

On the other branch of nth, The new rest will walk up the lazy list just as before, only awaiting the value of rest().nth(n-1) to get a T; and returning it through the async to get a Task<T> on the way out (EDIT: thanks to Brian Grunkmeyer for a bug fix here).

On both branches, nth is of type Task<T>, just what we need to await on in the caller to get T’s.

Here is the more meaty async factorial, which doesn’t differ in any interesting way from the lazy-list factorial:

asyncFactorial
1
2
3
4
5
6
7
8
9
10
public static AsyncList<int> asyncFactorials(int n)
{   return new AsyncList<int>
    {   value = (n <= 1) ? 1 : n * asyncFactorials(n - 1).value,
        rest = () =>
        {   DumpThreadId("AsyncFactorials, rest: ");
            Thread.Sleep(ran.Next(200));
            return asyncFactorials(n + 1);
        }
    };
}

It may take a few tries to get a lot of thread variety: the thread-pool scheduler seems to have hints from your code and tries to reuse the same thread whenever it can. But eventually, you can get a trace like the following that shows various thread moving around these tasks.

asyncFactorial
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
5
120
Async main, k3: Thread Id
16

AsyncIntegers, rest: Thread Id
22

AsyncIntegers, rest: Thread Id
37

AsyncIntegers, rest: Thread Id
37

AsyncIntegers, rest: Thread Id
37

AsyncList, nth: Thread Id
37

5
Async main, k4: Thread Id
37

AsyncFactorials, rest: Thread Id
37

AsyncFactorials, rest: Thread Id
37

AsyncFactorials, rest: Thread Id
37

AsyncFactorials, rest: Thread Id
37

AsyncList, nth: Thread Id
26

120
Anonymous task: Thread Id
37

Async main, z0: Thread Id
37

A gist for this LINQPad script can be found here: https://gist.github.com/4580491.

Fast Arbitrary Non-Uniform Randoms

You give me an array of integers representing the distribution of outcomes of some discrete random process (translation: you give me the specification of a loaded die: “I want 1 to come up 7/39 of the time; 2 to come up 5/39 of the time; I’d better NEVER see a 3 ‘cause I’m betting the farm on that; 4 to come up 11/39 of the time; 5 to come up 3/39 of the time; and 6 to come up 13/39 of the time.”). You also give me a uniform pseudo-random number generator (PRNG) like Mathematica’s Random Integer. I give you back a new PRNG that honors the distribution you gave me, in the sense that, statistically, it generates outcomes with the same distribution as you gave me.

The problem is a practical, real-world problem. Any time you need to generate random numbers according to some arbitrary, given distribution such as in simulation, traversing a Bayesian network, a decision tree, a packet retry backoff schedule, you name it, the problem comes up. In my work, it comes up every time I need to generate random expressions for testing parsers or for Monte-Carlo search for formulas. Most of the time, you want to generate expressions non-uniformly. For instance, you’d like to generate addition expressions more frequently than division expressions or function calls more frequently than function definitions.

There are a lot of solutions to this problem. Probability mavens will immediately see that the problem boils down to inverting the CDF. We can do this by:

  • linear search – O(N) space, O(N) time, where N is the number of different outcomes – the number of elements in the original array – the number of sides on the loaded die – 6 in our example
  • binary search – O(N) space, O(log(N)) time – this works because the CDF is monotonically increasing
  • by constructing an explicit inverse in an array – O(S) space, O(1) time, where S is the sum of the numbers in the array – the total number of trials imputed in the original distribution – 39 in our example
  • by my favorite method: Walker’s method of aliases – O(N) space and O(1) time.

At first glance, this seems ridiculous, but there is a way. In a nutshell: paint a dartboard with colors in the given proportions, throw darts, lookup the colors (thanks to Terence Spies for this beautiful idea).

Less metaphorically: imagine that the counts are colored balls distributed in bins:

  1. 7 balls of color number 1 in bin number 1
  2. 5 balls of color number 2 in bin number 2
  3. 0 balls of color number 3 in bin number 3
  4. 11 balls of color number 4 in bin number 4
  5. 3 balls of color number 5 in bin number 5
  6. 13 balls of color number six in bin number 6

Scale up the given counts of balls until the total is divisible by N – multiply every count by D/S where D=lcm(N,S)=lcm(6,39)=78: D/S=2

Now we have

  1. 14 balls of color number 1 in bin number 1
  2. 10 balls of color number 2 in bin number 2
  3. 0 balls of color number 3 in bin number 3
  4. 22 balls of color number 4 in bin number 4
  5. 6 balls of color number 5 in bin number 5
  6. 26 balls of color number six in bin number 6

Redistribute the counts amongst the bins so that every bin contains the same number of counts, namely D/N=13, and that the following conditions obtain:

  • every new bin contains at most 2 colors
  • you never take ALL the balls out of any bin – every bin contains at least one ball of its original color if it had any

There is a very elegant algorithm for this redistribution process. It’s a one-liner whose statement is a proof of its correctness. Think a minute, if you like, to see if you can come up with it, but here it is:

Fill up the shortest bin from the tallest bin; remove the new filled bin from the game and continue.

Do this procedure on a piece of paper:

  1. Take 13 balls from bin 6 leaving 13 balls; fill up and remove bin 3
  2. Take 7 balls from bin 4 leaving 15 balls; fill up and remove bin 5
  3. Take 3 balls from bin 4 leaving 12 balls; fill up and remove bin 2
  4. Take 1 ball from bin 1 leaving 13 balls; fill up and remove bin 4
  5. Bins 1 and 6 are left with 13 balls each; we’re done (mechanically, you can imagine taking 0 balls from bin 6, filling up bin 1; removing bin 1, leaving just one bin – bin 6).

you should end up with

  1. 13 balls of color 1 in bin 1 and no balls of any foreign color
  2. 10 balls of color 2 and 3 balls of color 4 in bin 2
  3. 0 balls of color 3 and 13 balls of color 6 in bin 3
  4. 12 balls of color 4 and 1 balls of color 1 in bin 4
  5. 6 balls of color 5 and 7 balls of color 4 in bin 5
  6. 13 balls of color 6 in bin 6 and no balls of any foreign color

Now, to generate new randoms, roll once to choose a bin RandomInteger[{1,N}], and roll again to choose a height RandomInteger[{1,D/N}]. If the randomly chosen height is less than or equal to the number of balls of the home color, return the number of the home color; otherwise, return the number of the foreign color.

Nifty, eh? Here is a Mathematica notebook with an implementation tuned for clarity: https://dl.dropbox.com/u/1997638/FastNonUniformPseudoRandoms.cdf

Anonymous Memoizing Lazy Lists

A vastly prettier version of this blog can be found in the CDF file here: https://dl.dropbox.com/u/1997638/LazyLambda003.cdf. Wolfram’s free CDF reader is found at http://www.wolfram.com/cdf-player/.

One reason to care about anonymized computations is that naming things costs memory, and it’s the kind of memory that lasts from one computation to another – session memory. Naming things means writing definitions and storing them in tables for access in later computations. Building up big memoization tables as definitions can cost a LOT of memory. We can save this cost if we can avoid naming things, storing information in function parameters that only last the lifetime of a single computation.

NON-ANONYMOUS LAZY LISTS

Lazy lists are a handy way of expressing infinite data streams. A typical lazy list might look like the following:

1
In[1]:= integersFrom[n_] := {n, integersFrom[n + 1] &}

Calling integersFrom with some numerical argument produces a list in curly braces, the first element of which is that numerical argument and the second element of which is a delayed list in the form of a nullary function – a function of no parameters. That’s what the & means: “the expression to my left is a function.” We’re doing Lisp-style lists, which are pairs of values and lists.

The list in the second slot of the lazy list won’t be automatically evaluated – we must manually invoke the function that produces it when we want the results. That is the essence of lazy computation. Lazy languages like Haskell can decide when you need the results and just sidestep the explicit wrapping in a nullary function and the manual invocation. With eager or strict languages like Mathematica, we must do the invocations manually. But that’s good, because we can see more clearly what’s going on.

In our lazy lists, the second item will always be a nullary function.

Start peeling integers off such a structure one at a time. integersFrom[0] will be a lazy list of integers starting from 0, and [[1]] will pick the first element from the resulting list:

1
2
In[2]:= integersFrom[0][[1]]
Out[2]= 0

If we pick the second element from integersFrom[0], that will be a nullary function that produces the next lazy list. Manually invoke the function by appending [] – invocation brackets containing no arguments – and then pick off the first element of the result to get the next integer.

1
2
In[3]:= integersFrom[0][[2]][][[1]]
Out[3]= 1

And so on:

1
2
In[4]:= integersFrom[0][[2]][][[2]][][[1]]
Out[4]= 2

A pattern emerges that we can capture in some operators. Value just picks the first element of a lazy list, and next picks the second element – the nullary function – and invokes it on no arguments:

1
2
3
4
5
6
7
8
9
10
11
In[5]:= value[stream_] := stream[[1]];
next[stream_] := stream[[2]][];

In[7]:= value@integersFrom[0]
Out[7]= 0

In[8]:= value@next@integersFrom[0]
Out[8]= 1

In[9]:= value@next@next@integersFrom[0]
Out[9]= 2

Improve the value operator so we can ask for the n th value, recursively:

1
2
3
4
5
6
7
8
In[10]:= ClearAll[value];
value[stream_, n_: 1] :=
 If[n <= 1,
  stream[[1]],
  value[next[stream], n - 1]]

In[12]:= value[integersFrom[1], 26]
Out[12]= 26

Let’s see a few values by mapping over a list of inputs:

1
2
In[13]:= value[integersFrom[1], #] & /@ {1, 2, 3, 4, 5, 10, 15, 20}
Out[13]= {1, 2, 3, 4, 5, 10, 15, 20}

We don’t need next any more – only value used it. Inline it in the body of value:

1
2
3
4
5
In[14]:= ClearAll[next, value];
value[stream_, n_: 1] :=
  If[n <= 1,
   stream[[1]],
   value[stream[[2]][], n - 1]];

As an aside, an efficient implementation of value requires either the tail-recursion optimization in the interpreter or a re-expression in iterative form, which can be done semi-automatically.

Write another lazy list as follows:

1
2
3
4
5
6
7
8
9
10
In[16]:= ClearAll[fibonaccis];
fibonaccis[n_] :=
 {If[n <= 1,
   1,
   value[fibonaccis[n - 1]] + value[fibonaccis[n - 2]]],
  fibonaccis[n + 1] &
  }

In[18]:= Timing[value[fibonaccis[1], 15]]
Out[18]= {0.024130, 987}

Of course, this is exponentially inefficient, as are all non-memoizing recursive fibonaccis. Fib of 15 is already painful, and fib of 150 is unthinkable. This can be mitigated as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
In[19]:= ClearAll[fibonaccis];
fibonaccis[n_] :=
 fibonaccis[n] =
  {If[n <= 1,
    1,
    value[fibonaccis[n - 1]] + value[fibonaccis[n - 2]]],
   fibonaccis[n + 1] &
   }

In[21]:= fibonaccis[#] & /@ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20}
Out[21]= {{1, fibonaccis[1 + 1] &}, {2, fibonaccis[2 + 1] &}, {3, 
  fibonaccis[3 + 1] &}, {5, fibonaccis[4 + 1] &}, {8, 
  fibonaccis[5 + 1] &}, {13, fibonaccis[6 + 1] &}, {21, 
  fibonaccis[7 + 1] &}, {34, fibonaccis[8 + 1] &}, {55, 
  fibonaccis[9 + 1] &}, {89, fibonaccis[10 + 1] &}, {987, 
  fibonaccis[15 + 1] &}, {10946, fibonaccis[20 + 1] &}}

In[22]:= Timing[value[fibonaccis[1], 150]]
Out[22]= {0.001997, 16130531424904581415797907386349}

This is a well-known Mathematica idiom for building a memo table by side effect. The memo table consists of explicit, specific point-values for fibonaccis, which the general fibonaccis – the one depending on a single parameter n_ – creates on-the-fly. Subsequent references to fibonaccis` at known inputs will do a quick lookup on the specific values and avoid the exponential recomputation.

This is non-anonymous programming on steroids – everything is built up inside the named object fibonaccis, but it gives us great speed at the expense of memory. But this is global memory in Mathematica’s global brain. When we’re done with the calculation, we have modified the session state of Mathematica and not left things the way we found them. In some scenarios, this would not be allowed. We must find some other way to evaluate recursive formulas without requiring session state – using only ephemeral memory such as stack frames that go away when our result is achieved.

Let’s get rid of the named functions and then even get rid of the named memo table so we can have decent performance on our recursive evaluations.

ANONYMIZE

The mother of all anonymizers is the Y combinator:

1
2
3
In[23]:= Y = (subjectCode \[Function] (g \[Function] g@g)
     [precursor \[Function] subjectCode
       [n \[Function] precursor[precursor][n]]]);

Without going into how it works, Y is a function that takes another function as an argument. That other function takes an argument k and produces a function of n. The function of n can call k as if k were a recursive definition of the function of n. The function of n can be applied to workhorse arguments, that is, to arguments of the subject code.

In practice, Y is easy to use: just apply it to a function of k that produces a function of n that internally calls k, then apply the result to the desired n.

Here is the lazy list of integers, this time starting at 1, without using the name of the lazy list inside the definition of the lazy list: s1 does not refer to s1:

1
2
3
4
5
6
7
In[24]:= ClearAll[s1];
s1 = Y[k \[Function] n \[Function]
      {n, k[n + 1] &}
    ][1];

In[26]:= value[s1]
Out[26]= 1

We feed to Y a function of k. That function of k produces a function of n. That function of n produces a list in curly braces. The first element of that list is the value n, and the second element of that list is a delayed call of k on the next value, n+1. s1 is thus a lazy list of all the integers, just defined without reference to any names but parameters. No global storage needed, no session state.

Let’s map calls of value[s1,#]& over a sequence of inputs to consicely show it off multiple times:

1
2
In[27]:= value[s1, #] & /@ {1, 2, 3, 4, 26}
Out[27]= {1, 2, 3, 4, 26}

Now, on to an anonymized lazy list of fibonaccis:

1
2
3
4
5
6
7
In[28]:= ClearAll[s2];
s2 = Y[k \[Function] n \[Function]
      {If[n <= 1,
        1,
        value[k[n - 1]] + value[k[n - 2]]],
       k[n + 1] &}
    ][1];

This one is still slow:

1
2
In[30]:= Timing[value[s2, #] & /@ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20}]
Out[30]= {0.883336, {1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 987, 10946}}

It takes about a second to produce fib[20], and higher values become intolerable. We would never get to an absurd input like 150.

MEMOIZE

How can we speed it up without defining point values in some named object? By defining point values in an un-named object, of course! Un-named objects are best modeled in Mathematica as lists of rules, one for each point-like input. We’ll need to pass that around in the argument list of k, and the easiest way to do that is to make n, the argument of k, a regular Mathematica list to contain both a value and an anonymous dictionary.

We’ll get to this form in a few steps. First, just change n so that it’s a list rather than an integer. This change spreads a lot of indexing around the body of k, so we want to make sure we get that right before proceeding:

1
2
3
4
5
6
7
8
9
10
11
12
13
In[31]:= ClearAll[s3];
s3 = Y[k \[Function] n \[Function]
      {If[n[[1]] <= 1,
        {1},
        {value[k[{n[[1]] - 1}]][[1]] +
          value[k[{n[[1]] - 2}]][[1]]}
        ],
       k[{n[[1]] + 1}] &}
    ][{1}];

In[33]:= Timing[value[s3, #] & /@ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20}]
Out[33]= {1.145393, {{1}, {2}, {3}, {5}, {8}, {13}, {21}, {34}, {55}, {89}, {987}, \
{10946}}}

Now add a second element – an empty list – to n, but ignore it in the code, just to make sure we get all the shaping correct. We must modify six places where arguments to k are constructed:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
In[34]:= ClearAll[s4];
s4 = Y[k \[Function] n \[Function]
      {If[n[[1]] <= 1,
        {1, {}},(* 
        modification 1 *)
        {value[k[{n[[1]] - 1, {}}]][[1]] + (* 
          modification 2 *)
          value[k[{n[[1]] - 2, {}}]][[1]],
         (* modification 3 *)
         {}} (* modification 4 *)
        ],
       k[{n[[1]] + 1, {}}] &} (* 
    modification 5 *)
    ][{1, {}}]; (* modification 6 *)

In[36]:= Timing[value[s4, #] & /@ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20}]
Out[36]= {1.214176, {{1, {}}, {2, {}}, {3, {}}, {5, {}}, {8, {}}, {13, {}}, {21, {}}, \
{34, {}}, {55, {}}, {89, {}}, {987, {}}, {10946, {}}}}

Now replace the internal constant empty lists with references to the second slot of n, where we will eventually store the dictionary:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
In[37]:= ClearAll[s5];
s5 = Y[k \[Function] n \[Function]
      {If[n[[1]] <= 1,
        {1, n[[2]]},
        {value[k[{n[[1]] - 1, n[[2]]}]][[1]] +
          value[k[{n[[1]] - 2, n[[2]]}]][[1]],
         n[[2]]}
        ],
       k[{n[[1]] + 1, n[[2]]}] &}
    ][{1, {}}];

In[39]:= Timing[value[s5, #] & /@ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20}]
Out[39]= {1.308497, {{1, {}}, {2, {}}, {3, {}}, {5, {}}, {8, {}}, {13, {}}, {21, {}}, \
{34, {}}, {55, {}}, {89, {}}, {987, {}}, {10946, {}}}}

We’re now ready to promote that second part of n into a dictionary – same as anonymous object, same as list of rules – for fast lookups. Let’s take the convention that if a key is not present in the dictionary, we produce Null, and make a few helper functions:

1
2
3
4
5
6
7
8
9
In[40]:= ClearAll[lookup, add];
lookup[dict_, key_] := key /. dict;
add[dict_, key_, value_] :=
  With[{trial = lookup[dict, key]},
   If[trial =!= Null,
    dict,
    If[Head[dict] === Dispatch,
     Dispatch[Prepend[dict[[1]], key -> value]],
     Dispatch[Prepend[dict, key -> value]]]]];

Add converts the dictionary into a hash table by applying the Mathematica built-in Dispatch to it. This is similar to what relational databases do when computing joins – creates a temporary hash table and uses it. Before adding a new rule to the table, add must check whether the input table is already a hash table since the original list of rules is stored in slot 1 of a hash table. Add can only add a rule to a list of rules, not to a hash table, but add can regenerate a new hash table in linear time. This is a potential hidden quadratic in this code since the hash table will be created over and over again. It does not seem to be a serious problem, here, likely overwhelmed by other overheads.

Now add the code to store computed values in the anonymous object. We must modify the initial input from {1, {}} to {1, {_->Null}} so that the starting dictionary has the default rule.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
In[43]:= ClearAll[s6];
s6 = Y[k \[Function] n \[Function]
      Module[{valToStore, dict = n[[2]]},
       {If[n[[1]] <= 1,
         {1, dict},
         valToStore =
          value[k[{n[[1]] - 1, dict}]][[1]] +
           value[k[{n[[1]] - 2, dict}]][[1]];
         dict = add[dict, n[[1]], valToStore];
         {valToStore, dict}
         ],
        k[{n[[1]] + 1, dict}] &}]
    ][{1, {_ -> Null}}];

In[45]:= Timing[value[s6, #] & /@ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20}]

Out[45]= {6.912630,
{{1,{_->Null}},
{2,{2->2,_->Null}},
{3,{3->3,2->2,_->Null}},
{5,{4->5,3->3,2->2,_->Null}},
{8,Dispatch[{5->8,4->5,3->3,2->2,_->Null},-DispatchTables-]},
{13,Dispatch[{6->13,5->8,4->5,3->3,2->2,_->Null},-DispatchTables-]},
{21,Dispatch[{7->21,6->13,5->8,4->5,3->3,2->2,_->Null},-DispatchTables-]},
{34,Dispatch[{8->34,7->21,6->13,5->8,4->5,3->3,2->2,_->Null},-DispatchTables-]},
{55,Dispatch[{9->55,8->34,7->21,6->13,5->8,4->5,3->3,2->2,_->Null},-DispatchTables-]},
{89,Dispatch[{10->89,9->55,8->34,7->21,6->13,5->8,4->5,3->3,2->2,_->Null},-DispatchTables-]},
{987,Dispatch[{15->987,14->610,13->377,12->233,11->144,10->89,9->55,8->34,7->21,6->13,5->8,4->5,3->3,2->2,_->Null},-DispatchTables-]},
{10946,Dispatch[{20->10946,19->6765,18->4181,17->2584,16->1597,15->987,14->610,13->377,12->233,11->144,10->89,9->55,8->34,7->21,6->13,5->8,4->5,3->3,2->2,_->Null},-DispatchTables-]}}}

Now add code to do lookups

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
In[46]:= ClearAll[s7];
s7 = Y[k \[Function] n \[Function]
      Module[{valToStore, dict = n[[2]]},
       {If[n[[1]] <= 1,
         {1, dict},
         valToStore =
          With[{
            v1 = lookup[dict, n[[1]] - 1],
            v2 = lookup[dict, n[[2]] - 2]},
           If[v1 =!= Null, v1,
             value[k[{n[[1]] - 1, dict}]][[1]]] +
            If[v2 =!= Null, v2,
             value[k[{n[[1]] - 2, dict}]][[1]]]];
         dict = add[dict, n[[1]], valToStore];
         {valToStore, dict}
         ],
        k[{n[[1]] + 1, dict}] &}]
    ][{1, {_ -> Null}}];

Now our regular example is much faster and an otherwise inconceivable computation such as fib[150] can be done in a reasonable time.

1
2
3
4
5
6
In[48]:= Timing[
 value[s7, #][[1]] & /@ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20}]
Out[48]= {0.068064, {1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 987, 10946}}

In[49]:= Block[{$RecursionLimit = 2000}, Timing[value[s7, 150][[1]]]]
Out[49]= {1.308640, 16130531424904581415797907386349}

CONCLUSION

Many improvements can be made to this, such as getting rid of the mutable variables in the Module by a monadic bind and refactoring the code for readability. However, we have shown a general technique for creating lazy memoizing lists without introducing names or session state into an evaluator or interpreter.