Blazing fast expression evaluation with custom assembly

expression evaluation
compiler
Author

Ben Ruijl

Published

October 3, 2024

Support Symbolica

This blog post uses the computational framework Symbolica. Symbolica is free for hobbyists and a single-core instance is free for non-commercial use. Would you like to have new features added? Your organization can purchase a license and support the Symbolica project!

Introduction

In this blog post we will go on a journey to evaluate mathematical expressions as fast as we can. We will start out with a naive evaluation and end up with custom assembly that is over a 1000 times faster!

Warning

We will use fast-math optimizations and we will consider a NaN in any component (in the case of complex numbers for example) to invalidate the entire number. Loss of precision can be combated by evaluating with Symbolica’s arbitrary precision numbers and it can be measured with Symbolica’s error-propagating floating point numbers.

Let us start with a general problem: we have (nested) expressions that we want to evaluate with numbers. For example:

\[ \begin{align*} e &= x + \pi + \cos(x) + f(g(x+1),2x)\\ f(y,z) &=y^2 + z^2 y^2\\ g(y) &=5 y \end{align*} \]

which we want to evaluate at x=5. Here is the Symbolica equivalent:

from symbolica import *
x, y, z, pi, f, g, cos = S('x', 'y', 'z', 'pi', 'f', 'g', 'cos')

e = x + pi + cos(x) + f(g(x+1), x*2)

ev = e.evaluate({pi: 3.14, x: 5.}, 
                {f: lambda a: a[0]**2 + a[1]**2*a[0]**2,
                 g: lambda y: y[0] * 5})
print(ev)

The evaluate function walks through the expression tree, substitutes every variable, calls functions and accumulates the result. It has to do error checking for missing variables, and the process naturally involves a lot of branching.

Some applications, like Monte Carlo simulations and integrations require sampling expressions millions of times, for which this procedure is too slow (accessing the hashmap alone is wasteful). Therefore, we want to rewrite our expression to have the fewest number of function calls, and the fewest number of arithmetical operations.

The steps we will undertake will be very similar to the ones a general programming language compiler performs.

Inlining and linearizing

We want to turn our program into a list of simple instructions involving additions, multiplications, and calls to built-in floating-point functions such as cos.

The first step is to remove the lambda functions, which are a black-box (they could query the weather and return the current temperature as a double for all we know). We use other Symbolica expressions to define our custom functions and we use the function evaluator that takes function names and its symbolic arguments as a key and the function body as a value. We arrive at:

from symbolica import *
x, y, z, pi, f, g, cos = S('x', 'y', 'z', 'pi', 'f', 'g', 'cos')

e = x + pi + cos(x) + f(g(x+1), x*2)

ev = e.evaluator({pi: N(22)/7},
             {(f, "f", (y, z)): y**2 + z**2*y**2,
              (g, "g", (y, )): y * 5},
            [x])
res = ev.evaluate([[5.]])
print(res)

We also got rid of the hashmap for the parameters; we let the user provide them in an array.

Now that everything is a symbolic expression, inside evaluator we can start unnesting (inlining) the function calls to f and g. We write

e = x + pi + cos(x) + f(g(x+1), x*2)
f(y,z) = y^2 + z^2*y^2
g(y) = y * 5

in a linear form, using helper variables Z[i] as:

Z[0] = x + 1
Z[1] = g(Z[0])
Z[2] = x * 2
Z[3] = f(Z[1], Z[2])
Z[4] = cos(x)
e = x + 22/7 + Z[3] + Z[4]

Now we substitute the functions f and g:

Z[0] = x + 1
Z[1] = Z[0] * 5
Z[2] = x * 2
Z[3] = Z[1]^2 + Z[2]^2*Z[1]^2
Z[4] = cos(x)
e = x + 22/7 + Z[3] + Z[4]

Next, we write our inputs and constants in the Z array and only use one type of operation (+ or *) per line:

Z[0] = x;
Z[1] = 1;
Z[2] = 2;
Z[3] = 5;
Z[4] = 22/7;
Z[5] = Z[0] + Z[1]
Z[6] = Z[3] * Z[5]
Z[7] = Z[0] * Z[2]
Z[8] = Z[6] * Z[6]
Z[9] = Z[6] * Z[6] * Z[7] * Z[7]
Z[10] = Z[8] + Z[9]
Z[11] = cos(Z[0])
Z[12] = Z[0] + Z[4] + Z[10] + Z[11]

Updating the parameters (x in this case) in the Z array for every call to evaluate is a fast memcpy. What we gain is that we can now represent our instructions as follows:

enum Instruction {
  Add(usize, Vec<usize>),
  Mul(usize, Vec<usize>),
  BuiltinFunction(usize, Symbol, usize),
}

where the first argument of each variant is the index in the Z array where the result should be written to. On the right-hand side, we have the indices into the same Z array. With this setup, we drastically cut down on the branching in our code. Now we can evaluate using a simple routine:

for i in instructions {
    match i {
        Instr::Add(r, v) => {
            tmp = Z[v[0]];
            for x in &v[1..] {
                tmp += Z[*x];
            }
            std::mem::swap(&mut Z[*r], &mut tmp);
        }
        ...
    }
}

We could also unroll the loop for additions or multiplications with 2 or 3 arguments for some extra speed.

Horner’s method

Now let us turn our attention to optimizing the input before we linearize it. In the input we have expressions such as:

f(y,z) = y^2 + z^2*y^2

Had we written this as

f(y,z) = y^2*(1 + z^2)

we would have saved on two multiplications, while keeping the number of additions the same. Pulling variables out (‘bracketing’) is called Horner’s method.

For polynomials with one variable (that do not factor), the Horner scheme is one of the most efficient methods of evaluating a polynomial. For multiple variables there is an open question though: in which order do we pull the variables outside?

Consider the following expression:

\[ x^3y^2 + x^2y + x^3z \] which has 9 multiplications. We can construct the following Horner scheme [x,y,z]:

\[ x^2(y + x(y^2 + z)) \]

which has 4 multiplications and [y,x,z]:

\[ x^3z + y(x^2(1 + xy)) \]

which has 7 multiplications. Finding the optimal ordering turns out to be NP-hard!

What makes the optimal scheme determination even harder is that we only want to compute common subexpressions once. The choice of scheme may impact the number of subexpressions.

In Symbolica we use a hill climbing algorithm that tries random swaps in the list of variables and accepts the swap only if the new operation count is the same or less. This turns out to work very well for this problem (Link to scientific paper).

Common subexpressions

Large expressions often have subexpressions that appear more than once. For example, in our simple input we find that Z[6] * Z[6] is used twice. We employ a simple algorithm that counts how often every pair occurs. Then, we extract the most occurring pair in a new Z, and we repeat the process until there are no more repeated computations.

We get:

Z[0] = x;
Z[1] = 1;
Z[2] = 2;
Z[3] = 5;
Z[4] = 22/7;
Z[5] = Z[0] + Z[1]
Z[6] = Z[3] * Z[5]
Z[7] = Z[0] * Z[2]
Z[8] = Z[6] * Z[6]
Z[9] = Z[6] * Z[6] * Z[7] * Z[7]
Z[10] = Z[8] + Z[9]
Z[11] = cos(Z[0])
Z[12] = Z[0] + Z[4] + Z[10] + Z[11]
Z[0] = x;
Z[1] = 1;
Z[2] = 2;
Z[3] = 5;
Z[4] = 22/7;
Z[5] = Z[0] + Z[1]
Z[6] = Z[3] * Z[5]
Z[7] = Z[0] * Z[2]
Z[8] = Z[6] * Z[6]
Z[9] = Z[8] * Z[7] * Z[7] // change here
Z[10] = Z[8] + Z[9]
Z[11] = cos(Z[0])
Z[12] = Z[0] + Z[4] + Z[10] + Z[11]

Register recycling (part 1)

Our linearized code is using quite some memory for the Zs. But we can do better!

Note that Z[5] is referenced only once, on the line after in Z[6] = Z[3] * Z[5]. This means we might as well reassign Z[5]:

Z[5] = Z[3] * Z[5]

We use a simple linear-time algorithm that finds for every index Z[i] on which line it is used last. Then we start from the first instruction and assign the smallest i that is available at that level. Below you can see the side-by-side recycling taking place:

Z[0] = x;
Z[1] = 1;
Z[2] = 2;
Z[3] = 5;
Z[4] = 22/7;
Z[5] = Z[0] + Z[1]
Z[6] = Z[3] * Z[5]
Z[7] = Z[0] * Z[2]
Z[8] = Z[6] * Z[6]
Z[9] = Z[8] * Z[7] * Z[7]
Z[10] = Z[8] + Z[9]
Z[11] = cos(Z[0])
Z[12] = Z[0] + Z[4] + Z[10] + Z[11]
Z[0] = x;
Z[1] = 1;
Z[2] = 2;
Z[3] = 5;
Z[4] = 22/7;
Z[5] = Z[0] + Z[1]
Z[5] = Z[3] * Z[5]
Z[6] = Z[0] * Z[2]
Z[5] = Z[5] * Z[5]
Z[6] = Z[5] * Z[6] * Z[6]
Z[5] = Z[5] + Z[6]
Z[6] = cos(Z[0])
Z[6] = Z[0] + Z[4] + Z[5] + Z[6]

We go down from 13 registers to 7 (out of which 5 are read-only)!

Large expressions

Let’s put what we have so far to the test on the 4.6MB expression F13, which is part of a computation used for collider experiments at the Large Hadron Collider.

Here is a very small excerpt:

-48*ammu*amuq*ammu2*amuq2*x6*xcp4*e1234-48*ammu*amuq*ammu2*amuq2*x6*xcp3*e1234+48*ammu*amuq*ammu2*amuq2*x6*xcp2*e1234+48*ammu*amuq*ammu2*amuq2*x6*xcp1*e1234+48*ammu*amuq*ammu2*amuq2*x6^2*xcp3*e1234-48*ammu*amuq*ammu2*amuq2*x6^2*xcp2*e1234-144*ammu*
amuq*ammu2*amuq2*x5*xcp4*e1234-48*ammu*amuq*ammu2*amuq2*x5*xcp3*e1234+48*ammu*amuq*ammu2*amuq2*x5*xcp2*e1234+144*ammu*amuq*ammu2*amuq2*x5*xcp1*e1234+96*ammu*amuq*ammu2*amuq2*x5*x6*xcp3*e1234-96*ammu*amuq*ammu2*amuq2*x5*x6*xcp2*e1234+48*ammu*amuq*
ammu2*amuq2*x5^2*xcp3*e1234-48*ammu*amuq*ammu2*amuq2*x5^2*xcp2*e1234-96*ammu*amuq*ammu2*amuq2*x4*xcp4*e1245-48*ammu*amuq*ammu2*amuq2*x4*xcp3*e1245-96*ammu*amuq*ammu2*amuq2*x4*xcp3*e1234+48*ammu*amuq*ammu2*amuq2*x4*xcp2*e1245+96*ammu*amuq*ammu2*
amuq2*x4*xcp2*e1234+96*ammu*amuq*ammu2*amuq2*x4*xcp1*e1245+48*ammu*amuq*ammu2*amuq2*x4*x6*xcp4*e1234-48*ammu*amuq*ammu2*amuq2*x4*x6*xcp1*e1234+48*ammu*amuq*ammu2*amuq2*x4*x6^2*xcp4*e1234-48*ammu*amuq*ammu2*amuq2*x4*x6^2*xcp1*e1234+144*ammu*amuq*
ammu2*amuq2*x4*x5*xcp4*e1234-144*ammu*amuq*ammu2*amuq2*x4*x5*xcp1*e1234+48*ammu*amuq*ammu2*amuq2*x4*x5*x6*xcp4*e1234+144*ammu*amuq*ammu2*amuq2*x4*x5*x6*xcp3*e1234-144*ammu*amuq*ammu2*amuq2*x4*x5*x6*xcp2*e1234-48*ammu*amuq*ammu2*amuq2*x4*x5*x6*xcp1*
e1234+96*ammu*amuq*ammu2*amuq2*x4*x5^2*xcp3*e1234-96*ammu*amuq*ammu2*amuq2*x4*x5^2*xcp2*e1234+144*ammu*amuq*ammu2*amuq2*x4^2*xcp4*e1245-96*ammu*amuq*ammu2*amuq2*x4^2*xcp3*e1245+48*ammu*amuq*ammu2*amuq2*x4^2*xcp3*e1234+96*ammu*amuq*ammu2*amuq2*x4^2*

This expression has to be evaluated millions of times for different values of its 29 parameters.

Let’s load it in Symbolica:

from symbolica import *
f13 = E(open('f13.txt').read())
print(len(f13))

We see it has 105114 terms. Now let us evaluate it once:

vars = ["alpha", "amuq", "ammu", "xcp1", "e1245", "xcp4", "e3e2", "e1234", "e2345", "e1235",
        "e1345", "amel2", "e2e1", "e5e2", "e4e2", "e3e1", "e4e1", "e5e1", "ammu2", "amuq2", "e5e3",
        "e4e3", "x5", "x6", "x1", "x3", "x4", "xcp3", "xcp2"]

r = f13.evaluate({S(x): float(i) for i, x in enumerate(vars)}, {})

We get a timings of 26 ms ± 440 μs.

Next, we create an evaluator that construct a Horner scheme, removes common subexpressions and writes the computation in a linear format:

evaluator = f13.evaluator({}, {}, [Expression.symbol(x) for x in vars], iterations=10)

When we evaluate the optimized version, we get:

%timeit evaluator.evaluate([[float(i) for i in range(len(vars))]])

263 μs ± 1.85 μs. This is a factor 100 faster! Not bad.

C++ output generation

The linear representation is suitable to export to C++ (or Fortran, etc), which has the significant advantage that all branches can be removed and that CPU registers can be used efficiently. The Symbolica code is simple:

compiled_evaluator = evaluator.compile('f13', 'f13.cpp', 'f13.so', inline_asm=False)

The generated code is templated and looks similar to this:

#include <complex>
#include <cmath>

template<typename T>
void small(T* params,  T* out) {
    T Z9, Z10, Z11;
    Z9 = params[4]+params[5];
    Z10 = params[3]+params[6];
    Z11 = params[0]*Z10;
    Z11 = Z11+Z9;
    Z11 = Z0*Z11;
    Z9 = Z11+Z9;
    Z9 = params[0]*Z9;
    Z10 = params[7]+Z10;
    Z9 = Z9+Z10;
    Z10 = 5+Z9;
    out[0] = Z9;
    out[1] = Z10;
    return;
}

extern "C" void small_double(double *params, double *out) {
  small(params, out);
  return;
}

extern "C" void small_complex(std::complex<double> *params, std::complex<double> *out) {
  small(params, out);
  return;
}

The compiled C++ library is automatically loaded back into Symbolica, so we can use the compiled version for fast double and complex<double> evaluations seamlessly:

%timeit compiled_evaluator.evaluate([[float(i) for i in range(len(vars))]])

This runs in 18.1 μs ± 633 ns

%timeit compiled_evaluator.evaluate_complex([[float(i) for i in range(len(vars))]])

And the complex evaluation runs in 57.5 μs ± 2.51 μs.

This is a great speedup of a factor 15, but there is one downside… The C++ compilation takes 1100 seconds. For larger problems compilation can take more than a month… Ouch.

Why is g++ so slow? Well, it turns out that the compiler will try to do its own common subexpression elimination and registry assignment. The compiler is even undoing some of our nice common subexpression extraction when it constructs its own AST. I have tried my best to turn this behavior off using some optimization flags, but it is hard as it is intertwined with many other flags. The only ‘solution’ for quicker compilation is O0, which is still slow and produces terrible assembly.

Speaking of which… Can’t we cut out the middle man and directly generate assembly ourselves? It turns out we can, and we can even do better than g++!

Custom assembly

Since we are using a limited number of operations, essentially only addition and multiplication, we can write decent quality assembly without too much pain. We’ll stick to normal C++ for function calls to cos. We write our inline assembly using AT&T syntax, which should work for any x86_64 machine (including Mac).

Here is a very small snippet of what the auto-generated inline assembly looks like:

static const double constants[5] = {3.142857142857143,1.,2.,5.,1};

extern "C" void test_double(const double *params, double* Z, double *out)
{
    __asm__(
        "movsd 8(%1), %%xmm0\n\t"
        "addsd 0(%2), %%xmm0\n\t"
        "movsd %%xmm0, 40(%0)\n\t"
        "movsd 24(%1), %%xmm0\n\t"
        "addsd 0(%2), %%xmm0\n\t"
        "movapd %%xmm0, %%xmm1\n\t"
        "mulsd %%xmm0, %%xmm1\n\t"
        "movsd 16(%1), %%xmm0\n\t"
        "mulsd 0(%2), %%xmm0\n\t"
        "mulsd %%xmm0, %%xmm0\n\t"
        "mulsd %%xmm1, %%xmm0\n\t"
        "movapd %%xmm1, %%xmm2\n\t"
        "addsd %%xmm0, %%xmm2\n\t"
        "movsd %%xmm2, 40(%0)\n\t"
        :
        : "r"(Z), "r"(constants), "r"(params)
        : "memory", "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6", "xmm7", "xmm8", "xmm9", "xmm10", "xmm11", "xmm12", "xmm13", "xmm14", "xmm15");
    return;
}

We first look at the end of the assembly block __asm__, where we specify the memory addresses we want to have access to: Z, constants and params. The C++ compiler will make sure to load these addresses in appropriate assembly registers that we access with %0, %1 and %2 in our ASM block. The last entries are parts that will be modified by our assembly code: we will use all floating point registers xmm0 to xmm15 and we will also write into the Z array, which means our ASM code will trash (“clobber”) the memory. With these options set, C++ will not use optimizations that assume that the contents of the memory will remain unmodified in the assembly block.

Let’s try to decode some of the assembly in the test function. We have:

movsd 8(%1), %xmm0
addsd 0(%2), %xmm0
movsd %xmm0, 40(%0)

which reads: move the double at constants[1] into xmm0, add a double from params[0] to it and write xmm0 to Z[5]. So we have written Z[5] = constants[1] + params[0].

Better assembly for complex arithmetic

For doubles the assembly operations for addition and multiplication are built-in (division is not and it requires a helper constant). How about complex numbers? A normal (non-SIMD) representation of a complex number is

struct Complex {
  re: f64,
  im: f64
}

which has the real and imaginary components adjacent in memory. Let’s see what assembly GCC generates for addition using Godbolt:

movupd  (%rsi), %xmm0
movupd  (%rdi), %xmm1
addpd   %xmm0, %xmm1

We notice one interesting tidbit: each floating-point xmm register is 128-bits, so we can simply load the entire complex number into one register. Complex addition is then one instruction: simply addpd to add the two 128-bit registers component-wise.

Improving complex multiplication

Complex multiplication is more involved:

\[ (a + b i)(c + d i) = a c - b d + (ad + b c)i \] or written in components: \[ (a, b)(c,d) = (a c - b d, ad + b c) \]

GCC with -O3 -ffast-math generates the following for a multiplication (see Godbolt):

movupd   (%rdi), %xmm0
movupd   (%rsi), %xmm2
movapd   %xmm0, %xmm1
unpcklpd %xmm0, %xmm1
unpckhpd %xmm0, %xmm0
mulpd    %xmm2, %xmm1
shufpd   $1, %xmm2, %xmm2
mulpd    %xmm2, %xmm0
movapd   %xmm1, %xmm2
subpd    %xmm0, %xmm2
addpd    %xmm0, %xmm1
movsd    %xmm2, %xmm1

That’s 14 instructions and the result is stored in %xmm1. When I first saw this, I was immediately skeptical that this was the best way to write complex multiplication, as there is a lot of copying. I read up on the available instruction set and found that SSE2 has the command addsubp that can do a joint subtraction and addition, which sounds great as we need to subtract b*d and add b*c.

After some puzzling I wrote the following code for complex multiplication, where every step is explained in the comments:

movupd   (%rdi), %xmm1    ; xmm1 = (a, b)
movupd   (%rsi), %xmm2    ; xmm2 = (c, d)
movapd   %xmm1, %xmm0     ; xmm0 = (a, b)
unpckhpd %xmm0, %xmm0     ; xmm0 = (b, b)
unpcklpd %xmm1, %xmm1     ; xmm1 = (a, a)
mulpd    %xmm2, %xmm0     ; xmm0 = (c, d) * (b, b) = (b * c, b * d)
mulpd    %xmm2, %xmm1     ; xmm1 = (c, d) * (a, a) = (a * c, a * d)
shufpd   $1, %xmm0, %xmm0 ; xmm0 = (b * d, b * c)
addsubpd %xmm0, %xmm1     ; xmm1 = (a * c, a * d) -+ (b * d, b * c) = (a * c - b * d, a * d + b * c)

These are only 9 instructions and it is about 20% faster in practice for the very common operation of multiplying two complex numbers!

Register recycling (part 2)

In the first version of my assembly output, every instruction always writes its result in the Z array. For example:

Z[3] = Z[0] + Z[1]
Z[3] = Z[3] * Z[4]

would write the result of the addition into Z[3] and then one line later it would read it again into a register. This back-and-forth reading and writing from registers to memory is slow and it would be much faster if we store Z[3] in one of the 16 xmm registers. This requires extra bookkeeping as we need some of these registers to do our computations and we don’t want to overwrite registers.

I wrote a fast algorithm that tries to keep results in registers, preferring results that will be used very soon after their computation: for every Z[i] I keep track of when it is last used. This gives the range of instructions for which we have to keep the value of Z[i] in memory for (last used position minus i). We then sort the list of all instructions based on the length of the range. We loop over the instructions and find out which registers are available in that range. If there is one free, we select that one and block it for the entire range.

Here is some Python pseudo-code:

# find the last use per instruction
last_use = [(i, 0) for i in range(instructions)]
for (pos, (out, args)) in enumerate(instructions):
    for i in args:
        last_use[i] = (last_use[i][0], pos)

# sort instructions based on last-use distance
last_use = sorted(last_use, key=lambda x: x[1] - x[0])

# set all 16 xmm registers as available for every instruction
available = [set(range(16)) for _ in instructions]
register_to_use = [None for _ in instructions]

for (pos, dist) in last_use:
    # find a register that is available for the entire range
    av = available[pos]
    for x in range(pos + 1, pos + dist):
        av = av & available[x]
    if len(av) > 0:
        reg = list(av)[0]
        register_to_use[pos] = reg
        for x in range(pos, pos + dist):
            available[x].remove(reg)

If the assembly operations require additional registers (like complex multiplication) the logic has to be adapted.

Putting it all together

Below is the full Symbolica program that reads the expression, optimizes it, compiles it with assembly and loads the library:

from symbolica import *
f13 = E(open('f13.txt').read())
vars = ["alpha", "amuq", "ammu", "xcp1", "e1245", "xcp4", "e3e2", "e1234", "e2345", "e1235",
        "e1345", "amel2", "e2e1", "e5e2", "e4e2", "e3e1", "e4e1", "e5e1", "ammu2", "amuq2", "e5e3",
        "e4e3", "x5", "x6", "x1", "x3", "x4", "xcp3", "xcp2"]

evaluator = f13.evaluator({}, {}, [S(x) for x in vars], iterations=0)
compiled_evaluator = evaluator.compile('f13', 'f13.cpp', 'f13.so', inline_asm=True)

The above command runs in a few seconds instead of more than 1100s. Now let’s time the evaluations:

%timeit compiled_evaluator.evaluate([[float(i) for i in range(len(vars))]])
%timeit compiled_evaluator.evaluate_complex([[float(i) for i in range(len(vars))]])

An evaluation with doubles takes 19 μs ± 547 ns (slightly slower than g++, this will be improved in the future). We started out with an evaluation time of 26 ms, so this is a factor 1300 improvement!

An evaluation with complex numbers takes 48 μs ± 563 ns, so this is indeed about 17% faster than the g++ version (and this does not use register recycling at the moment, so in the future it will be even faster).

The above machinery drastically improves the performance of, for example, Monte Carlo integration. For more advanced evaluations using arbitrary precision numbers or other number types such as Symbolica hyperduals, custom assembly will not yield much better code per se, so these evaluations can be run with the Symbolica evaluator directly.

Conclusion

We showed how large expressions can be optimized to allow for fast evaluation using custom assembly, rivaling and beating g++ in terms of output quality, at a fraction of the compilation time. We even found a better way to do complex multiplication along the way.

Want to learn more? Join the discussion on the Symbolica Discord or Zulip!

Support Symbolica

Symbolica is free for hobbyists and a single-core instance is free for non-commercial use. Would you like to use additional features or have unrestricted access? Your organization can purchase a license and support the Symbolica project!