Vale's First Prototype for Immutable Region Borrowing

Results and measurements!

July 11, 2023 — Evan Ovadia — Sponsor on GitHub or Patreon!

Three years, two states, and one pandemic ago, I wrote about a very weird idea: what if the type system could track which data existed before a pure call, to eliminate its memory safety overhead? 0

And a couple months later, another weird idea struck: what if we use generational indices as the foundation for an entire language?

These ideas evolved in weird ways. The first one evolved into a full region-based borrowing system. The second one became generational references. Together, they looked like they could form an entirely new approach to memory safety, one that doesn't use reference counting, tracing garbage collection, nor borrow checking.

Basically, the coder writes their program in a normal C or C++ish way, and Vale's generational references keep everything memory safe. Then, the coder can use pure and region borrowing to eliminate most generation check overhead. 1 Add in some linear style, 2 and we can get generation checks down to zero for any Vale code.

In other words, this could make our Vale code very, very fast.

The exciting part is that region borrowing is completely opt-in. We could write our code in a normal, comfortable way, and later add region borrowing for the parts that we want to optimize, almost like a more flexible, opt-in borrow checker. 3 We could choose which parts of our program should be as flexible as Java, or as fast as Rust, or anywhere in-between.

But alas, it was all a theory. We couldn't play with it, because it wasn't real yet!

Side Notes

(interesting tangential thoughts)

Notes [–] Notes [+] 0 1 2 3

0

See the original article here, but keep in mind that was written before I came up with generational references!

1

A "generation check" happens whenever we dereference a generational reference. It checks to make sure the target object is still alive.

2

Linear style is where we never make a reference to something unless handing it into a pure function.

3

This is possible because it makes borrowing compose much better with shared mutability; you can do as much aliasing as you want and then turn the entire world immutable via pure.

Building the Theory

Most of my free time for the past few years has gone into building out the compiler's foundations so that it could support this unexplored approach.

It was hard. Any kind of borrowing system is already pretty complex, but they also require full generics, which are notoriously difficult. 4 5

On top of that, an entire new compiler stage was needed to get regions and generational references to work seamlessly together. 6

Finally, a few months ago, the regions prototype was finished. It's rough around the edges, 7 but it successfully compiled something for the first time.

With that, I made the first ever zero-check Vale program! 8

It was a program that uses Cellular Automata to generate a level for a roguelike game.

0

1

2

Of course, it didn't work perfectly at first. Compilers are tricky. The slightest misstep in the compiler code will add extra instructions to the resulting assembly, causing artificial overhead in the final program. And sometimes, there's extra little bits of information you need to pass to the optimizer (or the CPU itself!) to trick it into the most optimal behavior.

To help me track down the problems, I kept comparing its assembly to the assembly generated by Vale's "unsafe" modes:

unsafe_no_bounds is similar to C; all memory-safety protections are turned off, and it only uses raw pointers for everything, rather than generational references.
unsafe_with_bounds then adds bounds checking for array accesses, similar to how Rust does it.

After a couple months of tracking down differences, the resulting assembly looked nearly identical to Vale's unsafe_with_bounds mode! Every difference was expected 9 and everything looked pretty reasonable.

Notes [–] Notes [+] 4 5 6 7 8 9

4

Previously, Vale had templates (like C++), not full generics.

5

It took the Golang team a decade to figure out their generics, and I don't blame them at all for that, after this struggle with generics!

6

Under the hood, it reduces regions to "pure height" integers: negative for region generic parameters, zero for the default region, and increasing positive for every pure block.

7

For example, its compile errors are very, very verbose, and there are a lot of things that just trigger assertions in the compiler still. I'll be fixing all of these before merging it into the main branch.

8

You can count how many generation checks in a program via the --print_mem_overhead true compiler flag.

9

The only expected difference is that it put a pseudo-random generation number at the top of every allocation, though it never needed to read it for any generation checks. This is really just a monotonically increasing register under the hood, to keep things fast. We'll be able to remove this once we add isolates or unique references.

The Benchmarks

Finally, I benchmarked the program again:

Summary
  './build_unsafe_no_bounds/main' ran
    1.18 ± 0.01 times faster than './build_unsafe_with_bounds/main'
    1.18 ± 0.01 times faster than './build_safe_fastest/main'

Success! Vale's normal mode (safe_fastest here) showed no slowdowns compared to only bounds checking.

In other words, this approach has no observable overhead. 10

Finally seeing this was a shock, a relief, and almost surreal. No overhead! We knew it was possible in theory, but seeing it happen for real still felt very surprising.

Feel free to play with it! Just build from the regions branch, check out the benchmarking scripts, and ask any questions in the discord server.

And before we get too excited, let's keep these important details in mind:

This is not benchmarking against languages like C and Rust directly. Those compilers have years of unrelated optimizations that would just confound the experiment, so I compare with unsafe_no_bounds and unsafe_with_bounds to isolate those variables and get a more accurate comparison of the memory safety approaches.
This was benchmarked on a Razer Blade 15" 2018 (512GB SSD) running Ubuntu 22.04, using hyperfine inside a cset shield.
When I made larger programs, I observed quite a bit of optimizer noise, 11 where a minor change in one area would swing the measurements one way or another. 12 Benchmark results for the larger programs seemed rather fragile. We'll need a large set of benchmark programs to isolate away this optimizer noise.
In a larger program (a tiny roguelike game), I also observed that the optimizer didn't merge two identical branches of an if-statement, and missed a couple other obvious optimizations. I'm not sure how the presence of an integer (especially unread!) would affect this. It could even be a bug in LLVM, which are pretty common.

That last one hints that we might want our own Vale-specific pre-optimizer, similar to Rust's MIR, 13 since LLVM was designed more with C in mind. 14

Still, even with these details, these results are quite promising!

Notes [–] Notes [+] 10 11 12 13 14

10

There could be overhead in theory, in the form of a nonatomic monotonically incrementing integer used for filling generations. It doesn't seem to affect the performance, likely because registers and simple arithmetic operations are so cheap on modern CPUs compared to the real bottleneck which is memory latency. The optimizer also often optimizes it out, since it sees nobody using these generations.

11

This is not the same thing as benchmark noise. This benchmarking setup reported very consistent run times (hence the ± 0.01 in the output).

12

In fact, when I switched the size of the generation numbers, it consistently had negative overhead (1.13 ± 0.01), which is a bit weird considering that there weren't that many generation numbers in the program anyway. It did change the register allocations, so I suspect that's dwarfing any performance differences from anything actually semantically different in the programs.

13

Thanks bluejekyll for this correction!

14

We might want this anyway, as I'm pretty sure LLVM would treat generational references as undefined behavior, if it could figure out that we're intentionally accessing released memory.

What does this mean?

This means that generational references and regions combine to form a memory safety approach that is very, very fast.

It also means that this approach is actually viable, and could be desirable for quite a few software domains. A domain might desire this approach if:

It wants more predictable latency than tracing garbage collection.
It wants better performance and cache friendliness than reference counting.
It wants to prototype and iterate more easily than with borrow checking.

Where does Vale go from here?

The above benchmarks compared Vale's safe mode to Vale's unsafe modes, for a more accurate comparison of the memory safety approaches.

However, there are still a few things to do before Vale can really go toe-to-toe with languages like C and C++.

I'll need to start a Vale-specific pre-optimizer, since LLVM's optimizer seems to have some problems reasoning about generations and immutability.
Vale still needs to support inline data, instead of the temporary solution of putting all structs on the heap. (Note that that wouldn't affect the above benchmarks, which didn't use any structs.)
Regions are still just in the prototype phase. I'll need to smoothe out the rough edges, pay down a bit of tech debt, and merge this code in before doing anything else.

After this is merged in, I'll be making the standard library use regions so that every user will benefit from them, even if their main program code doesn't use regions directly.

Conclusion

It's been an epic and exciting journey to get to this point! And now, we finally have some measurements to show that zero-check programs are possible, and that they're as fast as we hoped.

I want to give a massive thanks to everyone that has helped with this endeavor, especially our contributors and sponsors! I definitely would not have made it to this point without your support.

Cheers!

- Evan Ovadia