At Materialize, we recently encountered, investigated, and diagnosed a concurrency bug in the unbounded channels of crossbeam and the corresponding unbounded channels implementation in the standard library of Rust. The bug, under rare but realizable interleaving conditions, could lead to a double free and consequently trigger undefined behavior (UB). Even though channels are ubiquitous in multithreaded Rust programs this issue remained undetected for over a year. This serves as yet another reminder that concurrent code is notoriously difficult to get right. The fix for this bug is included in Rust 1.87.0 which got released today. In this blog post we will walk through our debugging journey, a precise description of the race condition and the internal invariant that got violated.

How we got here

On February 26th, our CI runs began to intermittently fail with errors that indicated memory corruption. These errors surfaced as segmentation faults and panics, typically in jobs that ran under high concurrency and non-deterministic scheduling. We made several attempts at reproducing these errors by running the affected jobs in various combinations and configurations but the issue remained very rare.

A great tool to discover memory errors is AddressSanitizer (ASan), a tool developed by Google for C/C++ but that can also be used with Rust projects. Running our CI jobs under ASan had been broken for a while but as the number of recorded failures kept rising more people started paying attention and helping out with the issue. On March 17th one of our engineers got ASan working again and we started trying to reproduce the error in that environment. We immediately started 50 runs of the one hour long job that we deemed most likely to encounter the error and in one of those we did manage to capture an ASan trace.

==401==ERROR: AddressSanitizer: attempting double-free on 0x515000795200 in thread T55:
#1 0xaaaaec133894 in core::ptr::drop_in_place::>
#2 0xaaaaec0e8650 in >::release
[...]
freed by thread T56 here:
#1 0xaaaaee1c9440 in >::disconnect_receivers
[...]

This finding immediately turned our attention to crossbeam-channel, which had been updated from version 0.5.8 to version 0.5.14 on February 7th, about ten days before we started observing the issues. This looked like a promising theory so we reverted back to 0.5.8, expecting the CI failures to stop as a means to validate the theory. Reverting the upgrade did cause the frequency of crashes to decrease dramatically, but they did not entirely disappear. The residual errors after the downgrade complicated the investigation as they cast some doubt as to whether crossbeam-channel 0.5.14 was truly responsible for some of the memory errors or whether the issue was elsewhere and it just made the race condition more likely.

After many more tests and theories, on April 9th and around 40 days after the initial CI error, we finally discovered the race condition in crossbeam! Under certain conditions the unbounded implementation of crossbeam channels could end up with a double free, exactly as reported by ASan.

Unbounded channel structure

Crossbeam offers various types of channels, called flavors,which are made available to the user through a common facade of a Sender<T> and Receiver<T> types. This facade is responsible for reference counting the number of active senders and receivers, similarly to how Arc<T> reference counts the number of active instances. On initialization the channel looks like this:

Maintaining separate reference counts for senders and receivers allows the channel to notify the receivers when all senders have disconnected and to eagerly clean up any unreceived messages the moment the last receiver is dropped.

The inner structure of the channel field highlighted in yellow depends on the specific channel flavor. For the unbounded channel—referred to internally as the list flavor—is backed by a linked list of heap-allocated Block instances. Each block contains an array of 31 Slot values and each slot contains a message of typeof type T and a field to indicate the state of the slot.

The overall Channel struct holds on to a head and a tail position that point to the corresponding block in the linked list and also the corresponding slot in the block.

When a sender sends a message to a channel the tail pointer is advanced by one and a message is written to the slot. When a receiver receives a message from the channel the head pointer is advanced by one and a message is read from the slot. Whenever a sender uses the final slot of a block it additionally allocates the next block and sets the next pointer. Whenever a receiver uses the final slot of a block, it deallocates the block.

These steady state operations are the most likely to be exercised in high concurrency situations since channel handles are usually cloned/moved to other threads. The code behind these operations is relatively simpler to understand and verify due to the fact that senders and receivers write to disjoint locations. The head field is only ever written to by receivers and the tail field is only ever written to by senders.

The exception to this is channel initialization. Like many data structures in Rust, the channel defers allocating the first block of the linked list until the first message is sent. When the channel is constructed both the head and the tail point to a null pointer and the first call to send a message will attempt to allocate the first block, set the tail pointer, and also set the head pointer.

Because channel initialization is two separate steps there is a moment in between where the channel is in a half-initialized state. This third possibility, the other two being uninitialized and initialized, must be taken into account in all other methods. This turned out to be a key ingredient in reproducing the race condition.

Race condition analysis

Armed with a good understanding of the channel structure we can now analyze the ASan trace and work backwards to find the conditions that trigger the bug. The ASan trace mentions two functions, the drop implementation for the channel and disconnect_receivers. We can see that the drop implementation for the channel attempted to free a pointer that had already been freed by disconnect_receivers.

The disconnect_receivers function is called when the last receiver is dropped. If the last receiver is dropped before the last sender the function also calls discard_all_messages which traverses the linked list from head to tail, deallocating blocks and invoking destructors on any enqueued messages. In the production binary that function got inlined which is why ASan reported it as disonnect_receivers.

Similarly, when the last reference (sender or receiver) is dropped, Channel::drop runs and performs equivalent logic of deallocating the linked list of blocks.

The correctness of this design relies on the following invariant: if the head pointer is set, then it must point to valid memory. Consequently, when discard_all_messages wants to deallocate a block it must first atomically swap the head pointer to null, which effectively transfers full ownership of the linked list to that thread and it gives it permission to deallocate it.

After reviewing the faulty implementation we observed that the block pointer is indeed swapped with a null pointer in the beginning but there is an additional code path where the block pointer is simply loaded without setting it to null. If that code path was ever taken, the following code would deallocate the block pointed to by head, violating the invariant.

rust
let mut block = self.head.block.swap(ptr::null_mut(), Ordering::AcqRel);
// Atomic swap ------------------^

if head >> SHIFT != tail >> SHIFT {
	while block.is_null() {
		backoff.snooze();
            	block = self.head.block.load(Ordering::Acquire);
             // Plain load ------------^
}
}
// ..code that deallocates the linked list pointed to by `block`

We can see that in order to take that path we must have head != tail and block == null. In other words we must have messages sent in the channel, meaning that a block has been allocated, but the head pointer is null. This might seem like an impossible situation but this is where the half-initialized state of the channel comes into play.

As mentioned in the previous section the channel initially has both tail and head point to null. When the first message is sent the channel goes through lazy initialization which first allocates a block, then sets tail to point to that block, and finally sets head to point to the same block.

rust
if block.is_null() {
    let new = Box::into_raw(Block::<T>::new());
    if self.tail.block.compare_exchange(...).is_ok() {
        self.head.block.store(new, Ordering::Release);
    }
}

We now have all the pieces of the puzzle:

  1. A channel with two senders and one receiver is created in thread A. One of the senders is sent to thread B.
  2. Thread B starts sending a message. Since it’s the first message it begins initialization and sets tail to the first slot of the allocated block.
  3. Before setting head, it is descheduled or preempted.
  4. Thread A sends a message. It observes tail is valid, successfully sends a message, and updates tail to point to the second slot of the block.
  5. Thread A drops its receiver. Since it’s the last receiver it calls discard_all_messages.
  6. discard_all_messages observes head != tail && head == null, and enters the spin loop.
  7. Thread B resumes and sets head to point to the allocated block.
  8. Thread A exits the spin loop and discard_all_messages deallocates the block.
  9. Then, one of the threads drops the last sender and Channel::drop is called. Since head is not null it is assumed to point to a valid block and a second deallocation is attempted leading to a double free.

Impact and historical analysis

Having the full explanation at hand we then evaluated the impact of this bug and how it came to be. Specifically we wanted to know which versions are affected and, more importantly, whether the std channels, which are heavily based on crossbeam’s implementation, were affected by the same issue.

Incidentally, the piece of code that only triggers when the channel is in a half-initialized state was contributed by Materialize engineers in April of 2023 to fix another UB issue. In that version of the code the head pointer was being set to null at the end of the function, making sure that it doesn’t become a dangling pointer. Then, in February of 2024 a PR that fixed a memory leak changed only one of the loads to a swap operation, which introduced the possibility for a dangling pointer. The same change was contributed to Rust shortly afterwards.

From the commit history we were able to piece together a list of affected versions for crossbeam and the Rust std channels. The affected versions for crossbeam are 0.5.12, 0.5.13, and 0.5.14. The affected versions of Rust are all versions between 1.78.0 and 1.86.0 inclusive.

Contributing the fix

After confirming that Rust std channels had the same behavior we quickly prepared PRs (crossbeam#1187 and rust#139553) to fix the issue. Both communities were extremely responsive and quickly merged the fix which made contribution a great experience. The maintainers of crossbeam-channel quickly released a new version and yanked all the affected versions from crates.io. Over on the Rust side the fix got merged and additionally nominated for backporting into the upcoming 1.87.0 release, which was great to see. Finally, the Tor community noticed the changes in crossbeam and after checking in with the maintainers issued a RUSTSEC security advisory.

Afterthoughts

This experience reaffirms that even in a language like Rust, where memory safety is a cornerstone, the presence of unsafe code and relaxed atomics introduces the potential for subtle and severe errors. Our findings underscore the critical importance of exhaustive CI, robust diagnostic tooling (e.g., ASAN, Valgrind, Miri), and adversarial stress testing.

Moreover, we saw that the right conditions for the error can be rare enough that even when the bug exists in such foundational libraries it took over a year to find and fix. One of the difficulties is capturing the precise invariants that make a certain concurrent piece of code correct and ensuring that they continue to hold true over time or as different people work on the same code. This is where formal methods shine and we are excited to see efforts like AWS’ recent project on verifying the Rust std lib.

We hope this write-up serves as a valuable resource for Rust developers working on concurrent systems. Our ongoing work in this area continues to inform our engineering culture and shape our contributions to the open source ecosystem.

Get Started with Materialize