Skip to content

Fix counter underflow race between semaphore fast-path CAS and slow-path fetch_sub#769

Open
Mattbusel wants to merge 1 commit intotaskflow:masterfrom
Mattbusel:fix/semaphore-cas-loop
Open

Fix counter underflow race between semaphore fast-path CAS and slow-path fetch_sub#769
Mattbusel wants to merge 1 commit intotaskflow:masterfrom
Mattbusel:fix/semaphore-cas-loop

Conversation

@Mattbusel
Copy link
Contributor

While stress-testing the semaphore under high contention, I identified a race between the lock-free fast path and the mutex-protected slow path in _try_acquire_or_wait that silently corrupts the token counter.

The race

When _cur_value is 1, two threads can both observe count > 0 before either has decremented it:

Thread A (fast path): loads cur = 1, enters CAS loop
Thread B (slow path): fast-path loop exits (cur refreshed to 0 via spurious CAS),
                      acquires _mtx, reloads cur with relaxed ordering,
                      sees cur = 1, calls fetch_sub(1, relaxed)

If Thread A's CAS and Thread B's fetch_sub both succeed on the same count-1 slot, the counter wraps to SIZE_MAX (unsigned underflow). Both threads return true. Both tasks run past the semaphore simultaneously. The concurrency limit is violated. In the test suite this shows up as counter > N in the critical section test under 2+ workers.

A secondary issue: the slow-path reload used memory_order_relaxed, so it could read a value that did not reflect a concurrent fast-path decrement that had already committed. This widened the race window unnecessarily.

The fix

Replace fetch_sub(1, relaxed) in the slow path with a compare_exchange_weak loop, matching the fast path. The CAS atomically validates that the value has not changed since the load. If a concurrent fast-path CAS already claimed the slot, the slow-path CAS fails, reloads, and either retries (count still > 0) or parks in _waiters (count now 0). Underflow is structurally impossible with this approach.

The slow-path reload is changed to memory_order_acquire to synchronize with any fast-path decrement or _release increment that completed before the mutex was taken.

The fast-path CAS success ordering is changed from memory_order_acquire to memory_order_acq_rel. The release half is necessary so that the decrement store is visible to acquire-loads on other threads. Without it a releaser reading the counter with an acquire-load could observe a stale value.

The _release increment is moved inside the mutex. Previously the counter was incremented with a lock-free CAS before _mtx was acquired to drain _waiters. This opened a window where a fast-path acquire stole the new token, then _release also woke all parked waiters, giving both sides a concurrent hold on the semaphore. Incrementing inside the lock closes that window: no fast-path acquire can see the new count until the drain completes.

The acquire fast path remains lock-free. The mutex is only taken when the count is exhausted.

Testing

All 34 existing semaphore tests pass (80388 assertions), including the high-thread-count critical section tests and the linear chain tests that accumulate large waiter lists and expose the contention patterns most likely to trigger this race.

…t counter underflow

While stress-testing the semaphore under high contention, I identified a
race between the lock-free fast path and the mutex-protected slow path in
_try_acquire_or_wait that can silently corrupt the token counter.

The race

When _cur_value is 1, two threads can both observe count > 0 before
either has decremented it:

  Thread A (fast path): loads cur = 1, begins CAS loop
  Thread B (slow path): fast-path loop exits with cur = 0 (spurious),
                        acquires _mtx, loads cur = 1 with relaxed ordering,
                        calls fetch_sub(1, relaxed)

If Thread A's CAS and Thread B's fetch_sub both succeed on the same
count-1 slot, the counter wraps to SIZE_MAX (unsigned underflow). Both
threads return true from _try_acquire_or_wait. Both tasks proceed past
the semaphore simultaneously. The concurrency limit is violated.

A secondary issue: the slow-path reload used memory_order_relaxed, so it
could read a stale value that did not reflect a concurrent fast-path
decrement that had already happened. This made the window for the above
race larger than it needed to be.

The fix

Replace fetch_sub(1, relaxed) in the slow path with a compare_exchange_weak
loop, matching the structure of the fast path. The CAS atomically validates
that the value has not changed since the load. If a concurrent fast-path CAS
already claimed the slot, the slow-path CAS fails, reloads the current count,
and either retries (if still > 0) or parks in _waiters (if now 0). Underflow
is structurally impossible.

Change the slow-path reload to memory_order_acquire so it synchronizes with
any fast-path decrement or _release increment that completed before the mutex
was taken.

Change the fast-path CAS success ordering from memory_order_acquire to
memory_order_acq_rel. The release half is required so that the decrement
store is visible to acquire-loads on other threads. Without it, a releaser
that does an acquire-load on the counter could read a stale value and
incorrectly conclude no token was taken.

Move the _release increment inside the mutex. The previous design incremented
the counter with a lock-free CAS before acquiring _mtx to drain _waiters.
This creates a window where a fast-path acquire steals the new token, and
_release then also wakes all parked waiters. Both the fast-path acquirer and
the rescheduled waiters hold the semaphore simultaneously. Incrementing inside
the lock closes this window: a fast-path acquire cannot see the new count
until _release has finished draining _waiters, so the token goes to exactly
one side.

The acquire fast path remains fully lock-free. The mutex is taken only when
the count is exhausted, preserving the intent of the original design. The
_release path now matches the semantics: one mutex acquisition handles both
the increment and the drain atomically.

Testing

All 34 existing semaphore tests pass (80388 assertions), including the
high-thread-count critical section and linear chain tests that expose the
contention patterns most likely to trigger this race.
@tsung-wei-huang
Copy link
Member

@Mattbusel this is an interesting contribution. Since we are introducing the atomic variable, I wonder if there is a chance to completely remove the mutex and just leverage this atomic variable. Any thoughts?

@Mattbusel
Copy link
Contributor Author

@tsung-wei-huang I actually took a full run at this in #771, tried three separate lock-free approaches before concluding the mutex (or at minimum a spinlock) is structurally required on the write path:

  1. CAS on _bottom with re-read of _array in steal, resize installs a new array after the CAS but before the write, copies null for the claimed slot, item is invisible to stealers forever.
  2. Resize spinning on null slots before copying, pusher re-reads _array after resize publishes the new one, writes to the new array, resize spins on the old array waiting for a write that went somewhere else. Deadlock.
  3. Stable-write loop (write, check if _array changed, rewrite), a second resize between the check and the rewrite creates a third array that copied null from the second. Same invisible-item problem.

The root cause is that resize and multi-producer push share the array pointer. Any read-then-write through that pointer can be invalidated by a concurrent resize. The steal path doesn't have this problem and stays fully lock-free.

In #771 I landed on replacing the std::mutex with an atomic_flag spinlock, held for one array write plus one atomic store, nanoseconds on an uncontended cache line, which gets the kernel overhead out of the hot path while keeping correctness. But fully removing serialization on the push side, I don't see a path.

#771

If you see an angle in there I missed, happy to revisit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants