Where do I learn about log_sum_exp, log1p, lccdf, and other numerical analysis tricks?

Richard McElreath inquires:

I was helping a colleague recently fix his MATLAB code by using log_sum_exp and log1m tricks. The natural question he had was, “where do you learn this stuff?” I checked Numerical Recipes, but the statistical parts are actually pretty thin (at least in my 1994 edition). Do you know of any books/papers that describe these techniques?

I’d love to hear this blog’s answers to these questions.

I replied that I learned numerical analysis “on the street” through HMM implementations. HMMs are also a good introduction to the kind of dynamic programming technique I used for that Poisson-binomial implementation we discussed (which we’ll build into Stan one of these days—it’ll be a fun project for someone). Then I picked up the rest through a hodge-podge of case-based learning.

“Numerical analysis” is name of the field and the textbooks where you’ll learn log_sum_exp and log1p and complementary cdfs and learn how 0 is so very different than 1 (smallest double-precision floating point value greater than zero is around 10^-300, whereas the largest double-precision value less than 1 is about 1 – 10^-16), which is rather relevant for statistical computation. You’ll also learn about catastrophic cancellation (which makes naive variance calculations so unstable) and things like the stable Welford algorithm for calculating variance, which also has the nice property of behaving as a streaming accumulator (i.e., it’s memoryless). I don’t know which books are good, but there are lots of web sites and course materials you can try.

The more advanced versions of this will be about matrices and how to maintain stability of iterative algorithms. Things like pivoting LL^t decompositions and how to do stable matrix division. A lot of that’s also about how to deal with caching in memory with blocking algorithms to do this efficiently. A decent matrix multiplier will be more than an order of magnitude faster than a naive approach on large matrices.

“Algorithms and data structures” is the CS class where you learn about things like dynamic programming (e.g., how to calculate HMM likelihoods, fast Fourier transforms, and matrix multiplication ordering).

Algorithms class won’t typically get into the low-level caching and branch-point prediction stuff you need to know to build something like Stan efficiently. There, you need to start diving into the compiler literature and the generated assembly and machine code. I can highly recommend Agner Fogg’s overviews on C++ optimization—they’re free and cover most of what you need to know to start thinking about writing efficient C++ (or Fortran—the game’s a bit different with statically typed functional languages like ML).

The 1D integrator in Stan (probably land in 2.19—there’s a few kinks to work out in Ben Bales’s math lib code) uses an input that provides both the integrated value and its complement (x and closest boundary of the integration minus x). Ben Goodrich helped a lot, as usual, with these complicated numerical things. The result is an integrator with enough precision to integrate the beta distribution between 0 and 1 (the trick is the asymptote at 1).

Integration in general is another advanced numerical analysis field with tons of great references on error accumulation. Leimkuhler and Reich is the usual intro reference that’s specific to Hamiltonian systems; we use the leapfrog (StÃ¶rmer-Verlet) integrator for NUTS and this book has a nice analysis. We’re looking now into some implicit algorithms to deal with “stiff” systems that cause relatively simple explicit algorithms like Runge-Kutta to require step sizes so small as to be impractical; we already offer them within Stan for dynamics modeling (the _bdf integrators). Hairer et al. the more mathematically advanced reference for integrators. There are tons of great course notes and applied mathematics books out there for implementing Euler, implicit Euler, Runge-Kutta, Adams-Moulton, implicit midpoint, etc., all of which have different error and symplecticness properties which heavily tie into implementing efficient Hamiltonian dynamics. Yi Zhang at Metrum is now working on improving our underlying algorithms and adding partial differential equation solvers. Now I have a whole new class of algorithms to learn.

So much for my getting away from Columbia after I “learned statistics”. I should at least record the half-lecture I do on this topic for Andrew’s stats communication class (the other half of the class I do focuses on wring API specs). I figure it’s communicating with the computer and communicating with users, but at least one student per year walks out in disgust at my stretching the topic so broadly to include this computer sciency stuff.