Fix counter underflow race between semaphore fast-path CAS and slow-path fetch_sub#769
Fix counter underflow race between semaphore fast-path CAS and slow-path fetch_sub#769Mattbusel wants to merge 1 commit intotaskflow:masterfrom
Conversation
…t counter underflow
While stress-testing the semaphore under high contention, I identified a
race between the lock-free fast path and the mutex-protected slow path in
_try_acquire_or_wait that can silently corrupt the token counter.
The race
When _cur_value is 1, two threads can both observe count > 0 before
either has decremented it:
Thread A (fast path): loads cur = 1, begins CAS loop
Thread B (slow path): fast-path loop exits with cur = 0 (spurious),
acquires _mtx, loads cur = 1 with relaxed ordering,
calls fetch_sub(1, relaxed)
If Thread A's CAS and Thread B's fetch_sub both succeed on the same
count-1 slot, the counter wraps to SIZE_MAX (unsigned underflow). Both
threads return true from _try_acquire_or_wait. Both tasks proceed past
the semaphore simultaneously. The concurrency limit is violated.
A secondary issue: the slow-path reload used memory_order_relaxed, so it
could read a stale value that did not reflect a concurrent fast-path
decrement that had already happened. This made the window for the above
race larger than it needed to be.
The fix
Replace fetch_sub(1, relaxed) in the slow path with a compare_exchange_weak
loop, matching the structure of the fast path. The CAS atomically validates
that the value has not changed since the load. If a concurrent fast-path CAS
already claimed the slot, the slow-path CAS fails, reloads the current count,
and either retries (if still > 0) or parks in _waiters (if now 0). Underflow
is structurally impossible.
Change the slow-path reload to memory_order_acquire so it synchronizes with
any fast-path decrement or _release increment that completed before the mutex
was taken.
Change the fast-path CAS success ordering from memory_order_acquire to
memory_order_acq_rel. The release half is required so that the decrement
store is visible to acquire-loads on other threads. Without it, a releaser
that does an acquire-load on the counter could read a stale value and
incorrectly conclude no token was taken.
Move the _release increment inside the mutex. The previous design incremented
the counter with a lock-free CAS before acquiring _mtx to drain _waiters.
This creates a window where a fast-path acquire steals the new token, and
_release then also wakes all parked waiters. Both the fast-path acquirer and
the rescheduled waiters hold the semaphore simultaneously. Incrementing inside
the lock closes this window: a fast-path acquire cannot see the new count
until _release has finished draining _waiters, so the token goes to exactly
one side.
The acquire fast path remains fully lock-free. The mutex is taken only when
the count is exhausted, preserving the intent of the original design. The
_release path now matches the semantics: one mutex acquisition handles both
the increment and the drain atomically.
Testing
All 34 existing semaphore tests pass (80388 assertions), including the
high-thread-count critical section and linear chain tests that expose the
contention patterns most likely to trigger this race.
|
@Mattbusel this is an interesting contribution. Since we are introducing the atomic variable, I wonder if there is a chance to completely remove the mutex and just leverage this atomic variable. Any thoughts? |
|
@tsung-wei-huang I actually took a full run at this in #771, tried three separate lock-free approaches before concluding the mutex (or at minimum a spinlock) is structurally required on the write path:
The root cause is that resize and multi-producer push share the array pointer. Any read-then-write through that pointer can be invalidated by a concurrent resize. The steal path doesn't have this problem and stays fully lock-free. In #771 I landed on replacing the If you see an angle in there I missed, happy to revisit. |
While stress-testing the semaphore under high contention, I identified a race between the lock-free fast path and the mutex-protected slow path in
_try_acquire_or_waitthat silently corrupts the token counter.The race
When
_cur_valueis 1, two threads can both observe count > 0 before either has decremented it:If Thread A's CAS and Thread B's
fetch_subboth succeed on the same count-1 slot, the counter wraps toSIZE_MAX(unsigned underflow). Both threads returntrue. Both tasks run past the semaphore simultaneously. The concurrency limit is violated. In the test suite this shows up ascounter > Nin the critical section test under 2+ workers.A secondary issue: the slow-path reload used
memory_order_relaxed, so it could read a value that did not reflect a concurrent fast-path decrement that had already committed. This widened the race window unnecessarily.The fix
Replace
fetch_sub(1, relaxed)in the slow path with acompare_exchange_weakloop, matching the fast path. The CAS atomically validates that the value has not changed since the load. If a concurrent fast-path CAS already claimed the slot, the slow-path CAS fails, reloads, and either retries (count still > 0) or parks in_waiters(count now 0). Underflow is structurally impossible with this approach.The slow-path reload is changed to
memory_order_acquireto synchronize with any fast-path decrement or_releaseincrement that completed before the mutex was taken.The fast-path CAS success ordering is changed from
memory_order_acquiretomemory_order_acq_rel. The release half is necessary so that the decrement store is visible to acquire-loads on other threads. Without it a releaser reading the counter with an acquire-load could observe a stale value.The
_releaseincrement is moved inside the mutex. Previously the counter was incremented with a lock-free CAS before_mtxwas acquired to drain_waiters. This opened a window where a fast-path acquire stole the new token, then_releasealso woke all parked waiters, giving both sides a concurrent hold on the semaphore. Incrementing inside the lock closes that window: no fast-path acquire can see the new count until the drain completes.The acquire fast path remains lock-free. The mutex is only taken when the count is exhausted.
Testing
All 34 existing semaphore tests pass (80388 assertions), including the high-thread-count critical section tests and the linear chain tests that accumulate large waiter lists and expose the contention patterns most likely to trigger this race.