How I learned to stop worrying and love the global state

rust
code
Author

Ben Ruijl

Published

March 25, 2024

Introduction

After the goto statement, mutable global variables may be the most frowned upon pattern in modern programming. There is a reason for its infamy, as it is a canary in a coalmine for poor structural design and because mutable global variables may easily lead to race conditions.

Thankfully, Rust makes scoped goto safe and reasonable to use (see my other blog post) and it achieves the same for mutable global variables. One cannot simply write

static MY_GLOBAL: Vec<usize> = Vec::new();

and expect to call MY_GLOBAL.push(5) from anywhere in the program, as globals (indicated with static1) are immutable. We have to use wrappers that allow for interior mutability. In the end the Rust compiler will take you by the hand and lead you to a setup such as:

1 For compile-time constants, use const instead of static

static MY_GLOBAL: Mutex<MyData> = Mutex::new(MyData::new());

If write access is only needed upon initialization, it is more efficient to use a OnceLock over a Mutex.

Even though using global data is possible in Rust, there is a general preference to pass references of ‘global’ data to functions that need it. For example:

fn main() {
    let my_data = MyData::new();
    some_function(&mut mydata);
}

Reference passing makes the function dependencies explicit and may be useful to avoid locking, in scenarios where scoped threads only require immutable references. For example:

fn main() {
    let mut data = vec![5];
    std::thread::scope(|s| {
        for _ in 0..5 {
            s.spawn(|| println!("{:?}", data));
        }
    });
}

where the threads all access the same data, without any locking.

This approach is also more in line with the principle of pure functions: functions that yield the same output if the input is the same. Accessing non-constant global variables is surely not pure!2

2 Pure functions that access the filesystem must even have some abstract global state of the entire system as an argument!

In this blog post I will show how I embraced the global state and removed intrusive function argument passing. Along the way we will encounter append-only structures, thread local storage, and custom garbage collection.

The state and the workspace

Symbolica is a computer algebra system, a library that can do symbolic mathematics (such as taking derivatives). Commonly, users define their own variables and functions (symbols) or they are defined automatically when an expression is parsed. For example:

let x = Atom::parse("x")?;

It would be extremely memory-inefficient to store the actual text string “x” (or worse, if the user wanted to define a symbol with hundreds of characters!) into the internal representation of a mathematical expression. Instead, these symbol names are identified with a number. A naive (and also memory-inefficient) representation of an Atom would be


type Symbol = usize;

enum Atom {
    Variable(Symbol),
    Function(Symbol, Vec<Atom>),
    Num(i64),
    Mul(Vec<Atom>),
    Add(Vec<Atom>)
}

The association of the name (a string) and a symbol (a usize) needs to be stored somewhere and we call this the state.

In early versions of Symbolica, the user created their own state and passed a mutable reference to functions that may define new symbols, such as the parser, and passed immutable references to any other function. For example:

let mut state = State::new();
let x = Atom::parse("(1+x)^2", &mut state)?;

As a consequence, any function that requires access to the name of the symbol, such as for printing the atom, needs to have the state passed as an argument. What makes matters worse is that the state also stores more information, such as whether a function is symmetric in its arguments. Due to this, the state has to be passed to virtually any function. Yikes…

As a result, straightforward operator overloading for adding and multiplyying two atoms was not possible anymore; atoms were to be wrapped together with a reference to the state in a new struct, which implemented the operator overloading.

The state is not the only object that had to be passed around. There is also the workspace, a global cache. Creating atoms is expensive, as it requires heap allocations. Therefore Symbolica has an automatic atom recycling system. The workspace will be described in more detail in the next section.

In the end, this is what a small Symbolica program looked like:

let mut state = State::new();
let workspace = Workspace::new();
let x = Atom::parse("(1+x)^2", &mut state, &workspace)?;
let y = x.expand(&state, &workspace); 
println!("y={}", y.printer(&state)); // y = x^2 + 2*x + 1

Is there really no other way? It turns out there is, by going global! As the state and the workspace require very different solutions, we will treat them separately.

Enemy of the State

In principle, Symbolica could use the proposal from the Rust compiler and go with

static State: Mutex<State> = Mutex::new(State::new());

where the State looks like:

pub struct State {
    name_to_symbol: HashMap<String, usize>,
    symbol_to_name: Vec<String>
}

impl State {
    pub fn get_symbol(&mut self, name: &str) -> usize {
        if let Some(id) = self.name_to_symbol.get(name) {
            *id
        } else {
            self.symbol_to_name.push(name.to_owned());
            let new_id = self.symbol_to_name.len() - 1;
            self.name_to_symbol.insert(name.to_owned(), new_id);
            new_id
        }
    }

    pub fn get_name(&self, id: usize) -> &str {
        &self.symbol_to_name[id]
    }
}

In this setup, the symbol of any new name is the index in symbol_to_name.

Going the Mutex route incurs a significant performance cost every time the state is accessed for reading. However, we know that new variables and functions are only defined sporadically and often at the start of the program, so this is wasteful. An RwLock may provide better performance, but it does not remove synchronization overhead.

We want reading to be extremely cheap. Writing may have locks however, as defining new variables should never occur in performance-critical parts of the execution.

The key realization to solve this conundrum is that the state is append-only: new symbols can be defined, but old symbols can never be removed, as that would invalidate previously constructed atoms. Instead of considering the name_to_symbol and symbol_to_name to be part of the State, let us try to extract the symbol_to_name vector and make it append-only.

Thus, I started my search for append-only crates and I stumbled upon elsa. elsa has sync::FrozenVec whose push function is push(&self), instead of push(&mut self). This is safe because elsa forces every element to live on the heap, therefore any reference to an element of the vector (pointing to the heap) will not change if the vector itself is moved. Sadly, there is still synchronization overhead, as internally a RwLock is used. Then I found the excellent append-only-vec crate which does exactly what I want: it is an array of pointers to dynamically allocated fixed-size arrays (chunks). Once a chunk is full, a new fixed-size chunk is allocated. The data structure looks like [UnsafeCell<*mut T>; MAX_FIXED_CHUNKS]. Since all arrays are fixed size, nothing will ever move in memory when adding new data and thus a reference to an existing element will never be invalid. Accessing an index is virtually free.

I can now move symbol_to_name out of the State, and make it global:

static SYMBOL_TO_NAME: AppendOnlyVec<String> = AppendOnlyVec::<String>::new();

pub struct State {
    name_to_symbol: HashMap<String, Symbol>,
}

impl State {
    pub fn get_symbol(&mut self, name: &str) -> usize {
        if let Some(id) = self.name_to_symbol.get(name) {
            *id
        } else {
            let new_id = SYMBOL_TO_NAME.push(name.to_owned());
            self.name_to_symbol.insert(name.to_owned(), new_id);
            new_id
        }
    }

    pub fn get_name(id: usize) -> &'static str {
        &SYMBOL_TO_NAME[id]
    }
}

Pushing into the SYMBOL_TO_NAME returns the (flattened) index it was inserted at.

We can now drop all dependencies on &State, since in order to get the name of a variable we can now call State::get_name(my_id) instead!

The new setup has a problem though: if we have two states created by the user and they define the same symbol, it will be added twice to SYMBOL_TO_NAME, as we never check if the symbol already exists in SYMBOL_TO_NAME. Checking that array must require some kind of lock. The solution to this is simple and achieves exactly what we wanted in the first place: we will only allow one state (the constructor is private) and make it global:

static SYMBOL_TO_NAME: AppendOnlyVec<String> = AppendOnlyVec::<String>::new();
static STATE: Mutex<State> = Mutex::new(State::new());

impl State {
    const fn new() -> State {
        State {
            name_to_symbol: HashMap::new()
        }
    }

    pub fn get_symbol(name: &str) -> usize {
        let state = STATE.write().unwrap();
    
        if let Some(id) = state.name_to_symbol.get(name) {
            *id
        } else {
            let new_id = SYMBOL_TO_NAME.push(name.to_owned());
            state.name_to_symbol.insert(name.to_owned(), new_id);
            new_id
        }
    }

    pub fn get_name(id: usize) -> &'static str {
        &SYMBOL_TO_NAME[id]
    }
}

Now we have a truly global state, that uses a lock when defining new symbols, but no lock is required to access information about defined symbols! This means that we can also remove all &mut State in Symbolica as well, and use State::get_symbol(name) instead.

We now have one problem down, and one more to go.

Thread-local Storage Wars

A Symbolica expression, called an Atom, is represented by a vector of bytes, a Vec<u8>. For example, f(x, 2/5) is coded in only 15 bytes as

[3, 10, 0, 0, 0, 17, 1, 2, 2, 1, 0, 1, 17, 2, 5]

which allows for many expressions to fit in memory. The details of the representation is a story for another blog post, but the important part for now is that creating a new atom requires a heap allocation.

Atom recycling

During symbolic computations, such as performing an expansion, temporary atoms are needed. It is rather slow to allocate new vectors all the time, therefore Symbolica caches them. The way that worked in earlier versions is by doing garbage collection and recycling. The idea is as follows:

  • We have a cache of objects that are resettable to their initial state and they implement the ResettableBuffer trait:
/// A buffer that can be reset to its initial state.
/// The `new` function may allocate, but the `reset` function must not.
pub trait ResettableBuffer: Sized {
    /// Create a new resettable buffer. May allocate.
    fn new() -> Self;
    /// Reset the buffer to its initial state. Must not allocate.
    fn reset(&mut self);
}
  • The cache looks like this:
pub struct Cache<T: ResettableBuffer>(RefCell<Vec<T>>);

The vector is wrapped in a RefCell for interior mutability. We use RefCell instead of Mutex as it makes sense to have a memory cache per thread, to eliminate any synchronization overhead.

  • When an object is requested from the cache, we pop an element from the vector, and return it, wrapped in a handle:
/// A handle to an underlying resettable buffer. When this handle is dropped,
/// the buffer is returned to the cache it was created by.
pub struct BufferHandle<'a, T: ResettableBuffer> {
    buf: Option<T>,
    parent: &'a Cache<T>,
}

where parent is a reference to the cache it was created from. Here is the code to get a handle, and to add/return a ResettableBuffer to the cache.

impl<T: ResettableBuffer> Cache<T> {
    /// Get a buffer from the cache if the cache is not empty,
    /// else create a new one.
    #[inline]
    pub fn get_handle(&self) -> BufferHandle<T> {
        let b = if let Ok(mut a) = self.0.try_borrow_mut() {
            if let Some(b) = a.pop() {
                b
            } else {
                T::new()
            }
        } else {
            T::new() // should never happen
        };

        BufferHandle {
            buf: Some(b),
            parent: self,
        }
    }

    /// Return a buffer to the cache.
    #[inline]
    fn return_arg(&self, mut b: T) {
        if let Ok(mut a) = self.0.try_borrow_mut() {
            b.reset();
            a.push(b);
        }
    }
}

Note that I am using try_borrow_mut instead of borrow_mut().unwrap() for an additional few percent of performance. The case where the borrow fails should never happen, as the workspace is not shared between threads.

  • Now for the final trick: when the handle is dropped, return the object back to the cache
impl<'a, T: ResettableBuffer> Drop for BufferHandle<'a, T> {
    /// Upon dropping the handle, the buffer is returned to the cache it was created by.
    #[inline]
    fn drop(&mut self) {
        let buffer = std::mem::take(&mut self.buf).unwrap();
        self.parent.return_arg(buffer)
    }
}

This completes the fully automatic garbage collection and object recycling: one cannot accidentally forget to return an atom to the cache, Rust will do it for you when the handle goes out of scope!

Using a cache makes real-world Symbolica code about 25% faster. The price to pay is that the cache needs to be passed as an argument to virtually every function. Can we just use the same trick we did for the state? Well, not quite. Contrary to the state, we want to have a workspace per thread and we certainly don’t want any locking. Enter thread-local storage.

Thread-local storage

Thread-local storage (TLS) allows us to define static variables that have an independent copy per thread:

thread_local!(
    static WORKSPACE: Workspace = const { Workspace(RefCell::new(Vec::new())) }
);

type Atom = Vec<u8>;
pub struct Workspace(RefCell<Vec<Atom>>);

By defining the constructor Workspace::new() as a const function, the initialization check of the thread-local storage will be compiled away.

Here is what the impl block looks like:

impl Workspace {
    /// Get a thread-local workspace.
    #[inline]
    pub const fn get_local() -> &'static LocalKey<Workspace> {
        &WORKSPACE
    }

    /// Create an atom that will be recycled.
    #[inline]
    pub fn new_atom(&self) -> RecycledAtom {
        if let Ok(mut a) = self.0.try_borrow_mut() {
            if let Some(b) = a.pop() {
                RecyledAtom(b)
            } else {
                RecycledAtom(Atom::default())
            }
        } else {
            RecycledAtom(Atom::default())
        }
    }
}

Contrary to a BufferHandle<'a> that had a reference to the cache, we now return a RecycledAtom:

pub struct RecycledAtom(Atom);

which is simply a wrapper around an Atom that returns its content to the current thread-local workspace (which may be different from the one it was created in) upon dropping:

impl Drop for RecycledAtom {
    #[inline]
    fn drop(&mut self) {
        let _ = WORKSPACE.try_with(
            #[inline(always)]
            |ws| {
                if let Ok(mut a) = ws.0.try_borrow_mut() {
                    a.push(std::mem::take(&mut self.0));
                }
            },
        );
    }
}

Once again I use try_with instead of unwrapping for small performance gains. I also use #[inline(always)] on the inlined function to make sure it actually gets inlined.

Note that I could have also implemented a custom Drop on Atom directly instead of using RecycledAtom, but I have made a newtype to have a bit more control over which objects are made recyclable.

Whenever a new temporary atom is needed, Workspace::get_local() can now be called:

Workspace::get_local().with(|ws| {
    let atom = ws.new_atom();
    // some operations
});

Leaky optimizations

Since threads in Symbolica are long-lived, as long as the main process itself, we can remove the TLS overhead caused by the code that checks if we are in the clean-up phase of the thread-local storage. This can be achieved by wrapping the workspace in a ManuallyDrop:

thread_local!(
    static WORKSPACE: ManuallyDrop<Workspace> = 
            const { ManuallyDrop::new(Workspace(RefCell::new(Vec::new()))) }
);

Buyer beware, as the workspace memory is leaked when the thread finishes (until all threads finish, after which the OS will clean it up). When a thread is short-lived, this memory leak can be avoided by clearing the workspace when the thread finishes.

Conclusion

We have achieved our goal of removing the intrusive passing of the state and cache as arguments to functions by making use of statics, append-only data structures and thread-local storage with minimal overhead. This allows for succinct code in Rust that resembles the use of Symbolica in Python:

use symbolica::{
    fun,
    atom::{Atom, FunctionBuilder},
    state::State,
};

fn main() -> Result<(), String> {
    let x = Atom::parse("x")?;
    let y = Atom::parse("y")?;
    let f_id = State::get_symbol("f");

    let f = fun!(f_id, x, y, Atom::new_num(2));

    let r = (-(&y + &x + 2) * &y * 6).npow(5) / &y * &f / 4;

    println!("{}", r); // (-6*y*(x+y+2))^5*f(x,y,2)/y/4
    Ok(())
}

I hope that the design techniques described in this blog are also useful outside of Symbolica. Please let me know in the comments!