Fix counter underflow race between semaphore fast-path CAS and slow-path fetch_sub by Mattbusel · Pull Request #769 · taskflow/taskflow

Mattbusel · 2026-03-14T05:34:31Z

While stress-testing the semaphore under high contention, I identified a race between the lock-free fast path and the mutex-protected slow path in _try_acquire_or_wait that silently corrupts the token counter.

The race

When _cur_value is 1, two threads can both observe count > 0 before either has decremented it:

Thread A (fast path): loads cur = 1, enters CAS loop
Thread B (slow path): fast-path loop exits (cur refreshed to 0 via spurious CAS),
                      acquires _mtx, reloads cur with relaxed ordering,
                      sees cur = 1, calls fetch_sub(1, relaxed)

If Thread A's CAS and Thread B's fetch_sub both succeed on the same count-1 slot, the counter wraps to SIZE_MAX (unsigned underflow). Both threads return true. Both tasks run past the semaphore simultaneously. The concurrency limit is violated. In the test suite this shows up as counter > N in the critical section test under 2+ workers.

A secondary issue: the slow-path reload used memory_order_relaxed, so it could read a value that did not reflect a concurrent fast-path decrement that had already committed. This widened the race window unnecessarily.

The fix

Replace fetch_sub(1, relaxed) in the slow path with a compare_exchange_weak loop, matching the fast path. The CAS atomically validates that the value has not changed since the load. If a concurrent fast-path CAS already claimed the slot, the slow-path CAS fails, reloads, and either retries (count still > 0) or parks in _waiters (count now 0). Underflow is structurally impossible with this approach.

The slow-path reload is changed to memory_order_acquire to synchronize with any fast-path decrement or _release increment that completed before the mutex was taken.

The fast-path CAS success ordering is changed from memory_order_acquire to memory_order_acq_rel. The release half is necessary so that the decrement store is visible to acquire-loads on other threads. Without it a releaser reading the counter with an acquire-load could observe a stale value.

The _release increment is moved inside the mutex. Previously the counter was incremented with a lock-free CAS before _mtx was acquired to drain _waiters. This opened a window where a fast-path acquire stole the new token, then _release also woke all parked waiters, giving both sides a concurrent hold on the semaphore. Incrementing inside the lock closes that window: no fast-path acquire can see the new count until the drain completes.

The acquire fast path remains lock-free. The mutex is only taken when the count is exhausted.

Testing

All 34 existing semaphore tests pass (80388 assertions), including the high-thread-count critical section tests and the linear chain tests that accumulate large waiter lists and expose the contention patterns most likely to trigger this race.

…t counter underflow While stress-testing the semaphore under high contention, I identified a race between the lock-free fast path and the mutex-protected slow path in _try_acquire_or_wait that can silently corrupt the token counter. The race When _cur_value is 1, two threads can both observe count > 0 before either has decremented it: Thread A (fast path): loads cur = 1, begins CAS loop Thread B (slow path): fast-path loop exits with cur = 0 (spurious), acquires _mtx, loads cur = 1 with relaxed ordering, calls fetch_sub(1, relaxed) If Thread A's CAS and Thread B's fetch_sub both succeed on the same count-1 slot, the counter wraps to SIZE_MAX (unsigned underflow). Both threads return true from _try_acquire_or_wait. Both tasks proceed past the semaphore simultaneously. The concurrency limit is violated. A secondary issue: the slow-path reload used memory_order_relaxed, so it could read a stale value that did not reflect a concurrent fast-path decrement that had already happened. This made the window for the above race larger than it needed to be. The fix Replace fetch_sub(1, relaxed) in the slow path with a compare_exchange_weak loop, matching the structure of the fast path. The CAS atomically validates that the value has not changed since the load. If a concurrent fast-path CAS already claimed the slot, the slow-path CAS fails, reloads the current count, and either retries (if still > 0) or parks in _waiters (if now 0). Underflow is structurally impossible. Change the slow-path reload to memory_order_acquire so it synchronizes with any fast-path decrement or _release increment that completed before the mutex was taken. Change the fast-path CAS success ordering from memory_order_acquire to memory_order_acq_rel. The release half is required so that the decrement store is visible to acquire-loads on other threads. Without it, a releaser that does an acquire-load on the counter could read a stale value and incorrectly conclude no token was taken. Move the _release increment inside the mutex. The previous design incremented the counter with a lock-free CAS before acquiring _mtx to drain _waiters. This creates a window where a fast-path acquire steals the new token, and _release then also wakes all parked waiters. Both the fast-path acquirer and the rescheduled waiters hold the semaphore simultaneously. Incrementing inside the lock closes this window: a fast-path acquire cannot see the new count until _release has finished draining _waiters, so the token goes to exactly one side. The acquire fast path remains fully lock-free. The mutex is taken only when the count is exhausted, preserving the intent of the original design. The _release path now matches the semantics: one mutex acquisition handles both the increment and the drain atomically. Testing All 34 existing semaphore tests pass (80388 assertions), including the high-thread-count critical section and linear chain tests that expose the contention patterns most likely to trigger this race.

tsung-wei-huang · 2026-03-20T16:27:32Z

@Mattbusel this is an interesting contribution. Since we are introducing the atomic variable, I wonder if there is a chance to completely remove the mutex and just leverage this atomic variable. Any thoughts?

Mattbusel · 2026-03-22T01:10:37Z

@tsung-wei-huang I actually took a full run at this in #771, tried three separate lock-free approaches before concluding the mutex (or at minimum a spinlock) is structurally required on the write path:

CAS on _bottom with re-read of _array in steal, resize installs a new array after the CAS but before the write, copies null for the claimed slot, item is invisible to stealers forever.
Resize spinning on null slots before copying, pusher re-reads _array after resize publishes the new one, writes to the new array, resize spins on the old array waiting for a write that went somewhere else. Deadlock.
Stable-write loop (write, check if _array changed, rewrite), a second resize between the check and the rewrite creates a third array that copied null from the second. Same invisible-item problem.

The root cause is that resize and multi-producer push share the array pointer. Any read-then-write through that pointer can be invalidated by a concurrent resize. The steal path doesn't have this problem and stays fully lock-free.

In #771 I landed on replacing the std::mutex with an atomic_flag spinlock, held for one array write plus one atomic store, nanoseconds on an uncontended cache line, which gets the kernel overhead out of the hot path while keeping correctness. But fully removing serialization on the push side, I don't see a path.

#771

If you see an angle in there I missed, happy to revisit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix counter underflow race between semaphore fast-path CAS and slow-path fetch_sub#769

Fix counter underflow race between semaphore fast-path CAS and slow-path fetch_sub#769
Mattbusel wants to merge 1 commit intotaskflow:masterfrom
Mattbusel:fix/semaphore-cas-loop

Mattbusel commented Mar 14, 2026

Uh oh!

tsung-wei-huang commented Mar 20, 2026

Uh oh!

Mattbusel commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Mattbusel commented Mar 14, 2026

The race

The fix

Testing

Uh oh!

tsung-wei-huang commented Mar 20, 2026

Uh oh!

Mattbusel commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants