LanguagesArchitecture

In this article, I'll explain how we can add multi-threading to our code with a single keyword, with no risk of data races!

TL;DR:

  • Structured concurrency makes multi-threading much easier.
  • Fearless concurrency prevents data races between threads.
  • Seamless structured concurrency lets us access any data from the containing scope.
  • It's possible to combine all three!
  • We're adding a parallel keyword to Vale to try it out!

Note that this is a theoretical feature, and not implemented yet; this article is just to give a hint at the direction Vale is going.

Structured Concurrency

If you've never used it, believe me when I say that structured concurrency is a lot of fun. 0

With structured concurrency, you can launch multiple threads at once, to run your calculations in parallel, for a big speedup. And often, it can be done with a tiny adjustment.

Imagine we had this simple program that calculated x^3 for some values of x:

c
#include <math.h>
#include <stdio.h>

void main() {
  int exponent = 3;
  int results[5];

  // Calculate some powers
  for (int i = 0; i < 5; i++) {
   results[i] = pow(i, exponent);
  }

  // Print out the results
  for (int i = 0; i < 5; i++) {
   printf("%d to the %d'th power is %d!\n", i, exponent, results[i]);
  }
}
stdout
0^3 is 0!
1^3 is 1!
2^3 is 8!
3^3 is 27!
4^3 is 64!

If pow was much more expensive, we might want to run those calculations in parallel. 1

It can be pretty tedious to add threading to C code. We'd have to wrap our results[i] = pow(i, exponent); line in an entire new function, and use pthread_create, like so:

c
#include <math.h>
#include <stdio.h>
#include <pthread.h>

typedef struct {
  int exponent;
  int i;
  int* results;
} MyThreadArgs;

void *my_thread_main(void* args_raw) {
  MyThreadArgs* args = (MyThreadArgs*)args_raw;
  int exponent = args->exponent;
  int i = args->i;
  int* results = args->results;

  results[i] = pow(i, exponent); // Some expensive calculation

  free(args_raw);
  return NULL;
}

int main() {
  int exponent = 3;

  int results[5];

  // Don't run each iteration one after the other...
  // run them in parallel, rather than serially.
  pthread_t threads[5];
  for (int i = 0; i < 5; i++) {
    MyThreadArgs* args = (MyThreadArgs*)malloc(sizeof(MyThreadArgs));
    args->exponent = exponent;
    args->i = i;
    args->results = results;
    pthread_create(&threads[i], NULL, my_thread_main, args);
  }

  // Join the threads
  for (int i = 0; i < 5; i++) {
    void* returnval = NULL;
    pthread_join(threads[i], &returnval);
  }

  for (int i = 0; i < 5; i++) {
    printf("%d to the %d'th power is %d!\n", i, exponent, results[i]);
  }
  return 0;
}

That's a lot of code to just run results[i] = pow(i, exponent); in parallel!

With structured concurrency, we can do that with just one line, using OpenMP. Let's add a #pragma omp parallel for to our original program:

c
#include <math.h>
#include <stdio.h>
#include <omp.h>

int main() {
  int exponent = 3;

  int results[5];
  
  // Launch some threads and run in parallel!
  #pragma omp parallel for
  for (int i = 0; i < 5; i++) {
    results[i] = pow(i, exponent); // some expensive calculation
  }

  for (int i = 0; i < 5; i++) {
    printf("%d to the %d'th power is %d!\n", i, exponent, results[i]);
  }
  return 0;
}

This is structured concurrency. It runs our iterations in parallel, and makes sure the parallel iterations are complete before continuing on.

Nathaniel Smith makes a great case that we should use structured concurrency rather than directly using thread APIs such as pthread_create, go, asyncio.create_task, etc. 2 Structured concurrency isn't a silver bullet, of course, but it's definitely a big step forward.

Side Notes
(interesting tangential thoughts)
0

My mind was blown the first time I used it with OpenMP. I had no idea that parallelism could be so easy!

And years later, I used CUDA to make a raytracer, which was similarly mind-boggling. They even appear to have lambda support now, which means we basically have GPU structured concurrency now. Incredible!

1

We only use 5 ints and the pow function as a simple example. In practice, we use threads for much larger data sets.

In fact, this toy example will probably be a lot slower, because threads have their own performance overhead, and because of false sharing. Threads tend to pay off for much larger data sets.

"Seamless" Structured Concurrency

OpenMP is a really amazing tool for C structured concurrency because it's seamless:

  • The threads can access any data from the surrounding scope. (Notice how all of our threads are accessing results and exponent.)
  • It's easy; if we want to write a parallel loop, we don't have to refactor our callers or rearchitect our program to enable it. We just start building!

In other words, seamless concurrency is the ability to read existing data concurrently without refactoring existing code.

This seamlessness is important because it saves us time, and we can more easily experiment with adding concurrency to more places in our program.

Some other implementations of concurrency aren't seamless, but they have other benefits. For example, fearless concurrency is immune to data race bugs. Read on to find out what that means, and how we can prevent them!

Data Races

Concurrency is "fearless" if data races are impossible. So what's a data race, and why do we want to avoid them?

A data race is when: 3

  • Two or more threads [or tasks] 4 concurrently accessing a location of memory.
  • One or more of them is a write.
  • One or more of them is unsynchronized.

Let's add a "progress" counter to our above snippet, and see a data race bug in action:

c
#include <math.h>
#include <stdio.h>
#include <omp.h>

int main() {
  int exponent = 3;

  int results[5];

  int numIterationsCompleted = 0;
  
  // Launch some threads and run in parallel!
  #pragma omp parallel for
  for (int i = 0; i < 5; i++) {
    results[i] = pow(i, exponent); // some expensive calculation

    // Read, add, and store
    numIterationsCompleted = numIterationsCompleted + 1;
    printf("Completed %d iterations!\n", numIterationsCompleted);
  }

  for (int i = 0; i < 5; i++) {
    printf("%d to the %d'th power is %d!\n", i, exponent, results[i]);
  }
  return 0;
}
stdout
Completed 1 iterations!
Completed 2 iterations!
Completed 3 iterations!
Completed 4 iterations!
Completed 4 iterations!
0 to the 3'th power is 0!
1 to the 3'th power is 1!
2 to the 3'th power is 8!
3 to the 3'th power is 27!
4 to the 3'th power is 64!

Look at that! There are two Completed 4 iterations! printed out. 5 Why is that?

It's because the line numIterationsCompleted = numIterationsCompleted + 1; has three steps:

  • Load numIterationsCompleted into register A.
  • Add 1 to register A.
  • Store register A into numIterationsCompleted.

If two threads are running in parallel, we might see this happen:

  • 4th thread loads 3 from numIterationsCompleted into register A.
  • 5th thread loads 3 from numIterationsCompleted into register B.
  • 4th thread adds 1 to register A, it now contains 4.
  • 5th thread adds 1 to register B, it now contains 4.
  • 4th thread stores register A into numIterationsCompleted, it now contains 4.
  • 5th thread stores register B into numIterationsCompleted, it now contains 4.

The problem is that they didn't see each others add operations; they're both adding 1 to the old value 3, to get 4.

If instead the 4th thread's store happened before the 5th thread's load, we'd have gotten the correct answer 5.

This is a data race bug. They are very difficult to detect, because they depend on the scheduling of the threads, which is effectively random.

Luckily, we've figured out how to avoid these problems, with some "concurrency ground rules":

  1. Multiple threads can read the same data, if nobody can modify it.
  2. If a thread can read data that another thread can modify, the data must be wrapped in a mutex or atomically updated.
  3. If data is only visible to one thread, it can access that data freely.

We usually need proper fear, respect, and discipline to stick to these rules. But when a language enforces these rules, we have fearless concurrency.

4

I include this "[and tasks]" because concurrency can have data races too. Python uses concurrency (not parallelism) and can have data races, such as in this example.

5

If this doesn't happen for you, try running it a few more times. It's rather fickle.

Fearless Concurrency

Plenty of languages 6 offer fearless concurrency, but we'll look at Pony and Rust here.

Message Passing with Pony

Pony doesn't let us access data from the outside scope, like in our above C examples. Instead, we "send" data between "actors", which are similar to threads.

We can send data if either of these is true:

  • We know we have the only reference to the data (the reference has the iso permission). 7
  • We know it is deeply immutable (the reference has the val permission).

If we have an val or iso reference to an object, we can send it to another actor.

Pony has fearless concurrency because in this system, data races are impossible; we can never have a modifiable chunk of data that's visible to multiple threads.

Key takeaways from Pony:

  • A type system can track whether something's immutable.
  • Immutable objects can be shared with multiple threads without risk of data races. 8

We'll use these techniques below!

7

A "permission" is something that travels along with a pointer at compile time, which affects how you may use that pointer. For example, C++'s const or the mut in Rust's &mut

8

This is also true of Clojure. It's sometimes true with Rust; only some immutable borrows can be shared with multiple threads.

Structured fearless concurrency

Rust has fearless concurrency that feels a lot like Pony's:

  • Similar to iso, Rust's borrow checker enforces that we have the only reference to an object when we want send it to another thread.
  • Similar to val, Rust can share things that are immutable.
    • For example, we can share Arc<Vec<int>>, which is an atomically reference counted, immutable vector of ints.
This is only part of Rust's multi-threading benefits; in the next article, we'll explain how its borrow checking lets us fearlessly use mutexes, and how we can combine that technique with seamless structured concurrency.

And because Rust's has async lambdas, it can also sometimes 9 do structured concurency!

rust
use async_scoped::*;

#[async_std::main]
async fn main() {
    Scope::scope_and_block(|s| {
        for _ in 0..5 {
            s.spawn((|| async {
                println!("Running task!");
            })());
        }
    });
}
stdout
Running task!
Running task!
Running task!
Running task!
Running task!

But alas, Rust doesn't have seamless structured concurrency, because it often can't access variables defined outside the task's scope. 10

9

There are some outstanding issues which prevent this from being generally usable, like sometimes requiring 'static lifetimes, blocking the running executor, or running unsafe code. Some very smart folks are working on these though, see tokio/1879 and tokio/2596.

10

Specifically, tasks can only capture things in the parent scope if they have Sync, see this example. Making data Sync to achieve concurrency is often not possible without potentially extensive rearchitecting of the containing program, so we can't quite say it's seamless. It's still pretty cool though!

How do we combine them?

We can make fearless and seamless structured concurrency by:

  1. Start with C's seamless structured concurrency.
  2. Only allow reading values created outside the parallel block.
  3. Relax #2's restrictions in data-race-free ways (we'll explore this in Part 2).

Our C program almost follows these rules, except it violates rule #2; remember how we modified the results array inside the parallel block:

c
int exponent = 3;

int results[5];

// Launch some threads and run in parallel!
#pragma omp parallel for
for (int i = 0; i < 5; i++) {
  results[i] = pow(i, exponent);
}

But Vale has a foreach loop that can accumulate each iteration's result, and produce an array: 11

vale
exported func main() {
exponent = 3;

results =
foreach i in range(0, 5) {
pow(i, exponent) 12
}
;

println(results);

}

There! The loop doesn't modify anything created outside the block.

We can now add a theoretical parallel keyword. 13

vale
exported func main() {
exponent = 3;

results =
parallel foreach i in range(0, 5) {
pow(i, exponent)
}
;

println(results);

}

The parallel keyword will:

  • Make the loop body see all existing data as immutable. For example, the compiler wont let the loop body modify exponent.
  • Launch multiple threads, and divide the iterations among them. 14

Since no threads modify the data, the data is truly temporarily immutable.

Since the data immutable, the threads can safely share it, as we saw in Pony.

We are now thread-safe!

And just because we can, let's modify something created inside the block:

vale
exported func main() {
exponent = 3;

results =
parallel foreach i in range(0, 5) {
a = pow(i, exponent);
set a = a + 4; 15 // We can modify!
a

}
;

println(results);

}

You can see how the compiler can enforce that we only modify things that came from inside the block.

If we call a function, it needs to know what it can modify and what it can't. For that, we use r'. Note how in this snippet, blur's first parameter's type is &r'[][]int. 16

vale
exported func main() {
// Makes a 2D array of ints
input = [](10, x => [](10, y => x * 10 + y));

results =
parallel foreach x in range(0, 10) {
parallel foreach y in range(0, 10) {
blur(input, x, y)
}

}
;

println(results);

}


func blur(input &r'[][]int, x int, y int) int {
... // Loop over the 3x3 ints around x,y in input and return the average
}

That r' means we're pointing into a read-only region. More on that below!

Key takeaways:

  • We can make any existing data temporarily immutable, without refactoring!
  • We can launch threads that can safely read that data, in parallel!
11

Vale is a language we're making to explore techniques for easy, fast, and safe code.

We actually discovered this "seamless, fearless, structured concurrency" while playing around with Vale's region borrow checker, which blends mutable aliasing with Rust's borrow checker.

12

The lack of a ; here means this line produces the block's result, similar to Scala and Rust.

13

To reiterate, this feature is theoretical, and we're still adding it to Vale. Stay tuned!

14

The parallel block has a default executor, but can be overloaded with parallel(num_threads) or parallel(an_executor).

15

set a = a + 4; is like a = a + 4; in C.

16

This is similar to Rust's lifetimes, more on that in the afterword.

Read-only Regions

As stated above, to enable fearless structured concurrency, we need to express that a reference points into a read-only region. 17 We can think of a region as a scope.

When we have a reference to an object in a read-only region, we:

  • Can't modify it.
  • Can load a member (or element) from it. The type system will remember the element is from a read-only region.
  • Can pass it to a function, if the function sees the region as read-only.

We're using r' to specify that a reference is in a read-only region. For example, an &r'[]int is an non-owning reference to an array of integers inside a read-only region.

We can use these for our concurrency, with the following rule:

When inside a parallel statement, we see everything outside as a read-only region.

Elaborating on the r' a little more:

  • The r is arbitrary, we can name it anything we want.
  • By default, any named region is immutable. To make it mutable, we can say 'r! or just leave it unnamed.

This may sound familiar from C++ 18 or Rust 19 20, but they don't have anything quite like it.

17

In Vale, regions can be mutable or immutable, but we "see" them as read-write or read-only. We can see both immutable and mutable regions as read-only. (This is similar to * and const * in C++.)

The user doesn't need to know about this distinction; if they ignore regions altogether the program's behavior will still make sense and be predictable.

18

A const * is similar, but we might not get a const * when we load a member/element from it.

19

An immutable borrow (&) is similar, except those can't always be shared with multiple threads. Only objects with Sync can be shared with multiple threads.

20

The afterword goes more into the difference between Vale's regions and Rust's lifetimes.

Is there a catch?

There are a couple drawbacks:

  • We can only read the data outside the parallel block. 21
    • We relax this restriction is Part 2, using mutexes, channels, splitting, atomics, isolated sub-regions, and something called "region shifting".
  • All globals must be channels, mutexes, atomics, or immutable. 22

Next Steps

As we saw, we can combine...

  • Seamless structured concurrency, which allows us to access data from the parent scope, without refactoring our callers or surrounding program.
  • Fearless concurrency, which protects us from data races.

...into a new and interesting fearless, seamless, structured concurrency feature.

And in Part 2, we'll enable fearlessly sharing mutable data too, using channels, isolated subgraphs, "region-shifting", and mutexes. Stay tuned!

Thanks for reading! If you want to discuss more, come by the r/Vale subreddit or the Vale discord, and if you enjoy these articles or want to support our work on Vale, you're welcome to sponsor us on GitHub!

21

This means no "immutability escape hatches" such as RefCell; immutable must really mean immutable.

22

Though, this might be good practice anyway.

Afterword: Vale Regions and Rust Lifetimes

The r' seems to hint that they're similar! Let's take another look:

vale
func blur(input &r'[][]int, x int, y int) int {
... // Loop over the 3x3 ints around x,y in input and return the average
}

There are some similarities:

  • We can't mix up objects' regions; we can't use an object in r' where we expect an object in another region (e.g. b'), the compiler keeps them separate. 23
  • A reference to an object in region r' can't outlive region r''s scope.
  • Vale names this a "region borrow checker", similar to Rust's "borrow checker".

The biggest difference is that Vale lifts mutability to the region level:

  • In Rust, references are either immutable (&) or mutable (&mut) borrows. In Vale, there is only &.
  • Instead, Vale's regions are inherently mutable or immutable. We see them as read-only (r') or read-write ('r! or unnamed).

The compiler enforces that nobody changes objects in immutable regions by:

  • Only giving those references to receivers who see the region as read-only.
  • Not allowing any changes to any object in a region we see as read-only.

Perhaps the biggest difference is that Vale has no aliasing restrictions.

  • In Rust, we can't have a mutable reference and an immutable reference to the same object.
  • In Vale, there's only one kind of reference (&), and we can also have a read-only region and read-write region point at the same underlying region. 24

Because of these design decisions, Vale's regions are largely "opt-in":

  • One can write an entire Vale program, without ever adding a region annotation or knowing about regions at all. 25.
  • A new user can look at an existing program, ignore all region markers, parallel, and pure, and the program will still make sense. All of these features are designed to improve a program's performance without affecting its semantics.

We hope that these things will give Vale a gradual learning curve, and make life easy for newcomers.

This is a pretty high-level overview of the differences, feel free to swing by our discord and we'd be happy to explain it!

23

Though, we can have structs in one temporary region pointing into structs in another region, like 'r MyStruct<'b>.

24

A function that accepts a read-only region will actually be generated twice: once with and once without the assumption that the read-only region is immutable and can therefore take advantage of the immutability optimizations.

25

This is because Vale gets its memory safety from Generational References, and soon possibly Hybrid-Generational Memory