An autoiconic language is one in which the format for code and the format for data are identical. Add “quote” and you have instant metaprogramming builtin. That means that when you write code, you’re really writing a DSL (DomainSpecific Language), always. With nonautoiconic languages, e.g., all the C’s, Javas, Algols and Fortrans, that is, the mainstream, DSLs are a special case and require tons of enabling machinery like expressionobjectmodels, serializers, lexers, yaccers, compilers, decompilers, metadata managers, and on and on. All standard or near standard, and all equally unnecessary since none of it at bottom does any more nor less than quote, quasiquote, and unquote.
With Haskell and its ancestors Miranda, KRC, and SASL; and with Scala, F#, and all the ML’s, DSL programming is done via userdefined types. With these languages, DSL programming is not a special case, but remoting is still a special case requiring extra machinery. This is not a small thing, but really the crux of the argument. The real gold in the era of cloud programmability is DSL + remoting. I need to send expressions written in my DSLs to servers – for data affinity, and to clients – for data privacy. Why is remoting a special case? Because I don’t have quote and eval. Why not? This, I don’t know. It must be a conscious decision on the part of language designers, because all these languages conspicuously lack quote and eval. I do not know the reason. Be that as it may…
Granted one might say “the difference is just a matter of degree.” True, but that’s like saying the difference between a rock and the entire Earth is just a matter of degree. That’s why the Lisps won’t die: it’s just plain easier to write remotable DSLs in Lisp, by orders of magnitude, and you always need remotable DSLs. And remotable DSLs are enabled by distributed metaprogramming platforms.
I argue that any sufficiently rich application will become a
distributed metaprogramming platform eventually, and will need all
that stuff. Especially on the web, where coderemoting is a necessity
– confer the massive amounts of machinery and years of work behind
.NET Expressions for IQueryable
and friends. Beautiful,
it’s true, because the thinking and the engineering behind them were
firstrate and I was privileged to participate. But, if our mainstream
languages had been Mathematica or some kinds of Lisp, remoting and
rewriting would have been forfree; a “duh, obviously;” not even an
afterthought; but a builtin assumption. Reflecting on the whole
remotableDSLenabling infrastructure of .NET Expressions, it’s
nothing but an implementation of quote, quasiquote, and unquote for
C#. All that work just to get into the mineshaft where the gold of
remotable DSLs lies.
This entire screed is just an addendum to Greenspun’s Tenth Rule, my asserting that the specific thing that these “sufficiently complicated C or Fortran” programs strive to find in their “adhoc, informally specified, bugridden, slow implementations of half of Common Lisp” – the specific half they need – is the half that enables remotable DSLs, namely distributed, remotable metaprogramming via quote, quasiquote, and unquote.
What does this have to do with Category Theory? That is a use of autoiconicity in mathematics. It uses the language of mathematics to explain mathematics. Every branch of mathematics is just a DSL of Category Theory.
All this philosophizing came about because Fogus, of “Joy of Clojure” recently tweeted about SICP – Structure and Interpretation of Computer Programs, which is now influencing its second or third generation of developers. That reminded me of Philip Wadler’s critique of SICP and got me thinking why I always thought that, outside of his pedagogical context, Wadler’s argument was weak, specifically because it did not note the tradeoff of losing remotability by giving up on autoiconicity, granting that when Wadler wrote, remoting was not centerstage the way it is now with the cloud. Then I remembered that John D. Cook, of “The Endeavor” sent around a link to a deeply intriguing new book by David I. Spivak on category theory, and it hit me.
]]>When you square numbers, you do two risky things: you make big numbers bigger and you make small numbers smaller. Together, these effects can be catastrophic. Suppose, for instance, that you’re working with positive numbers that look like $b+\epsilon$, where $b\geq 1$ and $0<\epsilon<1$. Squaring, you get $b^2+2b\epsilon+\epsilon^2$. The big part, $b$, gets bigger: $b^2$; and the small part, $\epsilon$, gets smaller: $\epsilon^2$. Everyone knows you shouldn’t add big numbers to little ones on computers. The big ones crowd out space for the little ones. You will eventually run out of dynamic range – the number of bits in your machine numbers – and your squares will evaluate to $b^2+2b\epsilon$ or even just to $b^2$. This is the dreaded underflow.
Remember those Mandelbrotsetviewing programs? If you kept zooming, the picture would eventually get all soft and pillowy and you would never get to see more detail. That’s because you ran out of dynamic range. There wasn’t any room for all the heavenly beauty hiding in the $\epsilon^2$. It takes special techniques to keep going; see highprecision deep zoom and fractaljourney.
Now, when you try to subtract squares from one another, you get another opportunity to mess up. Everyone knows you shouldn’t subtract nearly equal numbers on computers and expect the little bits to contain good information (another John D. Cook article, here, elaborates). It’s another aspect of crowding out: the little bits you’re trying to recover by subtracting were never there in the first place.
But why would it make a difference to square first, then subtract, versus subtract first, then square? If you square first and then subtract, you’re subtracting extrabig numbers – bigger than you need. If you subtract first, you’re still exposed to crowding out, but less so. Squared numbers crowd out approximately twice the bits of nonsquared numbers. You can avoid going to special techniques longer if you subtract first and then square.
This is rather a deep and important observation, especially for sequential estimation techniques like the Kalman filter and its many variations. These techniques rely on accurate, iterative propagation of covariance matrices. Losing even a little significance each round can make a final result useless or worse, i.e., not obviously wrong.
As a result, a whole technology of avoiding squares has arisen in the filtering business. Bierman’s work is in my own lineage due to my time at the Jet Propulsion Laboratory, and searching for any of “SquareRoot Information Filter,” “SquareRoot SigmaPoint Information Filter,” “SquareRoot Unscented Kalman Filter,” and so on will uncover a robust, contemporary literature on continuing research and applications of sophisticated methods for avoiding the squaring of machine numbers.
I can see a direct provenance of these techniques in Welford’s method, the best method cited by John above. Welford’s is a recurrence formula, giving the sum of $N+1$ squared residuals in terms of the sum of $N$ squared residuals, where the residuals are differences of data from their means. Since the means can be calculated incrementally (never requiring storage of all the data), Welford’s is also incremental, requiring only the storage of prior results. Welford’s formula is also pretty to look at and easy to remember.
Let $S_N$ be the sum of squared residuals of the first $N$ data:
where $\bar{x}_N$ is the mean of the $N$ data $x_i,\,i\in [1..N]$:
Then Welford’s update formula is
Generally, I prefer to remember the derivations of formulas along with the formulas. Doing so gives deeper, permanent understanding plus a chance to reuse parts of the derivation in other formulas. If I don’t remember the derivation, it’s a little like knowing a tune without knowing its chord progression and verse structure. You might still be able to play the tune, but not with as much conviction or creativity.
A bruteforce expansion of the terms leads to a lot of messy algebra that all cancels out, proving the formula correct, but not yielding any insight into how the author may have seen the pretty and memorable form. An article at Planetmath.org gave me the key to the puzzle: write $S_{N+1}$ as a sum of the same terms as in $S_N$, but with a correction $\gamma$:
where
Regroup the inner sum, then square:
because
That’s enough for an algorithm, but not quite enough to get to the pretty, easytoremember formula (don’t get me started on the practical utility of beauty in mathematics – for another time!).
We just need to show that
But
so
Now, the term we need to analyze is
We already found that
so we just look at
Isn’t that fantastic?
]]>Consider a column $N$vector of numerical data:
For instance, here are some random data:
The mean or average value of this vector is a number: the sum of the vector’selements divided by the number $N$ of elements:
for instance,
(we drop the Transpose superscript since the sum of a row vector is the same as the sum of a column vector).
If you have a running stream of data, it’s easy to maintain its average incrementally, meaning you don’t have to store the entire vector to compute the meansofar. As $N$ increases to $N+1$, note that
for instance, letting $x_6=66$ be a new datum arriving on the scene:
We see two algorithm choices for incrementally updating the average; informally:
The first alternative may be cheaper since you don’t divide until you need the average. The second alternative has a certain mathematical elegance, but you must multiply and divide to get the new average.
With only a small variation in the algorithm, you can keep a slidingwindow average over some fixed width $W$. Along with the running sum or average, keep a copy of the oldest datum used to calculate it. As the new datum arrives, subtract off the oldest one, add the newest one, and divide by the window width $W$. You will need special treatment until the window fills the first time, and possibly at the end if your stream of data is finite.
For instance, away from the ends, a width$3$ moving average over the sample data can be kept as follows, imagining that we add incoming new values from the right of the window and subtract old values from the left of the window:
Now, about variance. Given $N$ data, the sample variance is reckoned as follows (though below we apply Bessel’s Correction to estimate population variance):
If we let
be an $N$vector of all $1$’s, and let
denote the $N$vector of residuals of the data from their mean, then the variance has the following beautiful expression as the scaled inner product:
For instance, with our sample data
Expanding out the inner product, a form useful for incremental updating emerges:
For instance,
When a new datum $x_{N+1}$ arrives, update the variance in a way similar to the way we update the mean:
For instance
This is really quite nice. Notice that
for any $N$, so it’s the running sum of squares. We see that the variance is the running sum of squares divided by the current number $N$ of data minus the square of the current average. This is our incremental formula for variance. The only serious hazard we see here is that of catastrophic cancelation: if the two quantities are of comparable magnitude, we can lose virtually all the precision.
From the data, we might want a formula for the Gaussian distribution that best fits the data. The bestfit Gaussian should use the unbiased estimator for the population variance, related to the sample variance we have been calculating, but with $N1$ in the denominator instead of $N$ – Bessel’s correction. Letting
The univariate Gaussian or Normal distribution depends on $\mu$ and $\sigma$ and represents the probability that $x$ might lie in the infinitesimal interval $x$ and $x+dx$:
We can plot the Gaussian that best fits our sample data along with a histogram of our data and see a plausible match (we scaled up the Gaussian by the area of the histogram, a somewhat arbitrary number depending on the choice of histogram bin width):
To quantitatively test the goodnessoffit, apply something like the ChiSquared test or the KolmogorovSmirnov test.
The particular form $\mathbf{x}_N^T\mathbf{x}$ goes over easily to an elegant form for multivariate data, where we keep variances and correlation coefficients in a matrix.
Consider a pair of column $N$vectors of numerical data:
Plotting $y$’s against $x$’s, we might see something like the following:
Now, calculate the mean of each vector:
and the residuals
Now imagine a column of rows in an outer product with a row of columns:
Divide this by $N$ and call the result $S_N^2$:
Each subproduct in the matrix can be computed incrementally as described above, plus $\tilde{\mathbf{x}}_N^T\tilde{\mathbf{y}}_N = \tilde{\mathbf{y}}_N^T\tilde{\mathbf{x}}_N$, so it is cheap to keep a running value for the matrix.
Just as with the univariate Gaussian, we convert the sample covariance matrix into an unbiased estimator of the population covariance matrix by multiplying by $N$ and dividing by $N1$. Let
The bivariate Gaussian or normal distribution has such a beautiful form that it can be remembered without looking it up:
where $\left\Sigma^2\right^{1/2}$ is the square root of the determinant of $\Sigma^2$ and $\Sigma^{2}$ is the inverse of $\Sigma^2$. The formula generalizes to $D$ dimensions, in which case the coefficient $2\pi$ must be written $(2\pi)^{D/2}$.
I created the example cloud of points above by sampling 500 points from a bivariate Gaussian formula at a mean point of $\begin{bmatrix}\bar{x} & \bar{y}\end{bmatrix} = \begin{bmatrix}5 & 5\end{bmatrix} $ and the covariance matrix:
If we calculate $\Sigma_N^2=\Sigma_{500}^2$, the unbiased population covariance estimate, from the data, we get
plausibly close to the original (though a formal test, as mentioned, would be required to say quantitatively how close).
Later, we have to figure out exactly how to sample a multivariate Gaussian like this (hints: do the sampling in a new coordinate system in which the $x$ and $y$ variables are uncorrelated. That is a coordinate system in which the principal axes of the cloud are aligned with the $x$ and $y$ axes. That entails a PrincipalComponent Analysis (PCA) of the covariance matrix or its inverse and their square roots, which can be cheaply done with the SingularValue Decomposition (SVD).
]]>Lazy lists make it easy to write infinite data sets without doing infinite computations. For instance, the following static method in C# will generate a lazy list that serves up all the integers (actually, all the 32bit integers in an infinite cycle, but we quibble):
1 2 3 4 

Notice that an instance of this LazyList
type has two properties (or fields, if you prefer): a value
and a rest
, which is a function, (lambda expression) that, when invoked, will produce the rest of the infinite stream. That’s the essence of the technique: keep the infinite stuff in a function, recursively, and pick out values only when you need them.
Next is my implementation of the LazyList
class. Of course, I could use C#’s builtin Lazy
type, but I am, on purpose, doing the details myself for deeper understanding and exploration. We’re going to morph these into taskbased LazyLists that produce the downstream values asynchronously:
1 2 3 4 5 6 7 8 9 10 11 12 13 

In addition to the properties already mentioned, I’ve included an nth
method that walks up the lazy list in O(n) time to get the n
th item (without the tailrecursion optimization, this code will also consume O(n) space on the stack). This is not designed to fetch the n
th item directly, as is possible with some spigot algorithms, and it might be very interesting to explore spigots using lazy techniques in C#, but perhaps another time.
Next is a little more meat on the bones: the obligatory factorial:
1 2 3 4 5 6 7 8 

This one uses recursion in the production of value
. There are two uses of recursion: value
recursively uses prior values, rest
recursively produces future values. This is pretty neat. Both uses of recursion are completely synchronous and all this code runs on one thread.
We could go on to fibonaccis and more exotic streams; we could anonymize these for remoting, memoize them for efficiency, as explained an earlier post on this blog; but that’s not where we’re going today.
One last thing before going async
, and thats for a bit of code that exercises the above, using LinqPad:
1 2 3 4 5 6 7 8 9 10 11 12 13 

Much more with synchronous lazy streams can be found in this wonderful paper: Music of the Streams by Doug McIlroy. But on to asynchrony for us.
Let’s start with ‘whatwewant’ in AsyncMain:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

We made a new main because we needed it to be async
and we can’t mark the regular Main
as async
. What we want, here, is to await
the computation of asyncIntegers
and asyncFactorials
in the body of AsyncMain
. This can cause various threads to jump around in our code, so we’re careful to print thread id’s after every await
and in the body of every lambda expression that runs asynchronously in
a task, via the following static method:
1 2 3 

We’ll also insert a number of artificial random delays to try to entice the thread scheduler to send different threads through our code. None of these Sleeps
are part of the algorithm at all; they’re here to induce interesting behavior.
Now look at the implementation of asyncIntegers
:
1 2 3 4 5 6 7 8 9 10 

Looks just like the implementation of lazy integers. The variable k3 = asyncIntegers(5)
in AsyncMain
must be of type AsyncList
. The deeper differences are in nth
, the method we use to walk the lazy list:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

We use only one facet of the multifaceted Task
class. The rule we use is that Task.Run
, when given a Func<T>
, produces a Task<T>
. Awaiting a Task<T>
prduces a T
. So, except for the routing through Task
, it looks like invoking a Func<T>
to produce a T
. This seems like a minimal intrusion of taskness on our original lazylist design.
Look at the first branch of nth
, when n <= 1
. We await
a Task.Run
of a Func<T>
that produces value
, so the result of the await
is of type T
, the type of value
. This T
gets returned through an async
method, namely nth
, so it gets converted back into a Task<T>
on the way out. In the body of AsyncMain
, where we are calling nth
, we await
this Task<T>
, getting back a T
in the end.
T
’s become Task<T>
when run under Task.Run
or when returned from any async
method that would otherwise produce a T
. Task<T>
’s become T
’s when awaited
(there’s a monad in here somewhere, but that’s for another time, too).
On the other branch of nth
, The new rest
will walk up the lazy list just as before, only awaiting the value of rest().nth(n1)
to get a T
; and returning it through the async
to get a Task<T>
on the way out (EDIT: thanks to Brian Grunkmeyer for a bug fix here).
On both branches, nth
is of type Task<T>
, just what we need to await
on in the caller to get T
’s.
Here is the more meaty async factorial, which doesn’t differ in any interesting way from the lazylist factorial:
1 2 3 4 5 6 7 8 9 10 

It may take a few tries to get a lot of thread variety: the threadpool scheduler seems to have hints from your code and tries to reuse the same thread whenever it can. But eventually, you can get a trace like the following that shows various thread moving around these tasks.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 

A gist for this LINQPad script can be found here: https://gist.github.com/4580491.
]]>The problem is a practical, realworld problem. Any time you need to generate random numbers according to some arbitrary, given distribution such as in simulation, traversing a Bayesian network, a decision tree, a packet retry backoff schedule, you name it, the problem comes up. In my work, it comes up every time I need to generate random expressions for testing parsers or for MonteCarlo search for formulas. Most of the time, you want to generate expressions nonuniformly. For instance, you’d like to generate addition expressions more frequently than division expressions or function calls more frequently than function definitions.
There are a lot of solutions to this problem. Probability mavens will immediately see that the problem boils down to inverting the CDF. We can do this by:
At first glance, this seems ridiculous, but there is a way. In a nutshell: paint a dartboard with colors in the given proportions, throw darts, lookup the colors (thanks to Terence Spies for this beautiful idea).
Less metaphorically: imagine that the counts are colored balls distributed in bins:
Scale up the given counts of balls until the total is divisible by N – multiply every count by D/S where D=lcm(N,S)=lcm(6,39)=78: D/S=2
Now we have
Redistribute the counts amongst the bins so that every bin contains the same number of counts, namely D/N=13, and that the following conditions obtain:
There is a very elegant algorithm for this redistribution process. It’s a oneliner whose statement is a proof of its correctness. Think a minute, if you like, to see if you can come up with it, but here it is:
Fill up the shortest bin from the tallest bin; remove the new filled bin from the game and continue.
Do this procedure on a piece of paper:
you should end up with
Now, to generate new randoms, roll once to choose a bin RandomInteger[{1,N}]
, and roll again to choose a height RandomInteger[{1,D/N}]
. If the randomly chosen height is less than or equal to the number of balls of the home color, return the number of the home color; otherwise, return the number of the foreign color.
Nifty, eh? Here is a Mathematica notebook with an implementation tuned for clarity: https://dl.dropbox.com/u/1997638/FastNonUniformPseudoRandoms.cdf
]]>One reason to care about anonymized computations is that naming things costs memory, and it’s the kind of memory that lasts from one computation to another – session memory. Naming things means writing definitions and storing them in tables for access in later computations. Building up big memoization tables as definitions can cost a LOT of memory. We can save this cost if we can avoid naming things, storing information in function parameters that only last the lifetime of a single computation.
Lazy lists are a handy way of expressing infinite data streams. A typical lazy list might look like the following:
1


Calling integersFrom
with some numerical argument produces a list in curly braces, the first element of which is that numerical argument and the second element of which is a delayed list in the form of a nullary function – a function of no parameters. That’s what the &
means: “the expression to my left is a function.” We’re doing Lispstyle lists, which are pairs of values and lists.
The list in the second slot of the lazy list won’t be automatically evaluated – we must manually invoke the function that produces it when we want the results. That is the essence of lazy computation. Lazy languages like Haskell can decide when you need the results and just sidestep the explicit wrapping in a nullary function and the manual invocation. With eager or strict languages like Mathematica, we must do the invocations manually. But that’s good, because we can see more clearly what’s going on.
In our lazy lists, the second item will always be a nullary function.
Start peeling integers off such a structure one at a time. integersFrom[0]
will be a lazy list of integers starting from 0, and
[[1]]
will pick the first element from the resulting list:
1 2 

If we pick the second element from integersFrom[0]
, that will be a nullary function that produces the next lazy list. Manually invoke the function by appending []
– invocation brackets containing no arguments – and then pick off the first element of the result to get the next integer.
1 2 

And so on:
1 2 

A pattern emerges that we can capture in some operators. Value
just picks the first element of a lazy list, and next
picks the second element – the nullary function – and invokes it on no arguments:
1 2 3 4 5 6 7 8 9 10 11 

Improve the value
operator so we can ask for the n
th value, recursively:
1 2 3 4 5 6 7 8 

Let’s see a few values by mapping over a list of inputs:
1 2 

We don’t need next
any more – only value
used it. Inline it in the body of value
:
1 2 3 4 5 

As an aside, an efficient implementation of value
requires either the tailrecursion optimization in the interpreter or a reexpression in iterative form, which can be done semiautomatically.
Write another lazy list as follows:
1 2 3 4 5 6 7 8 9 10 

Of course, this is exponentially inefficient, as are all nonmemoizing recursive fibonaccis. Fib of 15 is already painful, and fib of 150 is unthinkable. This can be mitigated as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

This is a wellknown Mathematica idiom for building a memo table by side effect. The memo table consists of explicit, specific pointvalues for fibonaccis
, which the general fibonaccis
– the one depending on a single parameter n_
– creates onthefly. Subsequent references to fibonaccis
` at known inputs will do a quick lookup on the specific values and avoid the exponential recomputation.
This is nonanonymous programming on steroids – everything is built up inside the named object fibonaccis
, but it gives us great speed at the expense of memory. But this is global memory in Mathematica’s global brain. When we’re done with the calculation, we have modified the session state of Mathematica and not left things the way we found them. In some scenarios, this would not be allowed. We must find some other way to evaluate recursive formulas without requiring session state – using only ephemeral memory such as stack frames that go away when our result is achieved.
Let’s get rid of the named functions and then even get rid of the named memo table so we can have decent performance on our recursive evaluations.
The mother of all anonymizers is the Y combinator:
1 2 3 

Without going into how it works, Y is a function that takes another function as an argument. That other function takes an argument k
and produces a function of n
. The function of n
can call k
as if k
were a recursive definition of the function of n
. The function of n
can be applied to workhorse arguments, that is, to arguments of the subject code.
In practice, Y is easy to use: just apply it to a function of k
that produces a function of n
that internally calls k
, then apply the result to the desired n
.
Here is the lazy list of integers, this time starting at 1, without using the name of the lazy list inside the definition of the lazy list: s1 does not refer to s1:
1 2 3 4 5 6 7 

We feed to Y a function of k
. That function of k
produces a function of n
. That function of n
produces a list in curly braces. The first element of that list is the value n
, and the second element of that list is a delayed call of k
on the next value, n+1
. s1
is thus a lazy list of all the integers, just defined without reference to any names but parameters. No global storage needed, no session state.
Let’s map calls of value[s1,#]&
over a sequence of inputs to consicely show it off multiple times:
1 2 

Now, on to an anonymized lazy list of fibonaccis:
1 2 3 4 5 6 7 

This one is still slow:
1 2 

It takes about a second to produce fib[20]
, and higher values become intolerable. We would never get to an absurd input like 150.
How can we speed it up without defining point values in some named object? By defining point values in an unnamed object, of course! Unnamed objects are best modeled in Mathematica as lists of rules, one for each pointlike input. We’ll need to pass that around in the argument list of k
, and the easiest way to do that is to make n
, the argument of k
, a regular Mathematica list to contain both a value and an anonymous dictionary.
We’ll get to this form in a few steps. First, just change n
so that it’s a list rather than an integer. This change spreads a lot of indexing around the body of k
, so we want to make sure we get that right before proceeding:
1 2 3 4 5 6 7 8 9 10 11 12 13 

Now add a second element – an empty list – to n
, but ignore it in the code, just to make sure we get all the shaping correct. We must modify six places where arguments to k are constructed:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

Now replace the internal constant empty lists with references to the second slot of n
, where we will eventually store the dictionary:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 

We’re now ready to promote that second part of n
into a dictionary – same as anonymous object, same as list of rules – for fast lookups. Let’s take the convention that if a key is not present in the dictionary, we produce Null
, and make a few helper functions:
1 2 3 4 5 6 7 8 9 

Add
converts the dictionary into a hash table by applying the Mathematica builtin Dispatch
to it. This is similar to what relational databases do when computing joins – creates a temporary hash table and uses it. Before adding a new rule to the table, add must check whether the input table is already a hash table since the original list of rules is stored in slot 1 of a hash table. Add
can only add a rule to a list of rules, not to a hash table, but add
can regenerate a new hash table in linear time. This is a potential hidden quadratic in this code since the hash table will be created over and over again. It does not seem to be a serious problem, here, likely overwhelmed by other overheads.
Now add the code to store computed values in the anonymous object. We must modify the initial input from {1, {}}
to {1, {_>Null}}
so that the starting dictionary has the default rule.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 

Now add code to do lookups
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

Now our regular example is much faster and an otherwise inconceivable computation such as fib[150]
can be done in a reasonable time.
1 2 3 4 5 6 

Many improvements can be made to this, such as getting rid of the mutable variables in the Module
by a monadic bind and refactoring the code for readability. However, we have shown a general technique for creating lazy memoizing lists without introducing names or session state into an evaluator or interpreter.