fix: resolve pause-resume race condition (#3081) by MaxwellCalkin · Pull Request #3489 · simstudioai/sim

MaxwellCalkin · 2026-03-09T08:07:45Z

Summary

Fixes the race condition where a resume request arriving before persistPauseResult commits the paused execution record causes enqueueOrStartResume to fail with "Paused execution not found" (issue #3081).

Two complementary fixes in apps/sim/lib/workflows/executor/human-in-the-loop-manager.ts:

Transaction wrapping in persistPauseResult: The DB insert and queue processing are now wrapped in a single transaction, so the pause record and any pending resume claims become atomically visible. The SELECT FOR UPDATE in enqueueOrStartResume will block on the row lock until this transaction commits, eliminating the race window.
Exponential backoff retry in enqueueOrStartResume: Adds retry logic (5 attempts, 50ms base delay doubling each attempt) specifically for the "Paused execution not found" error. This handles the edge case where the row doesn't yet exist and SELECT FOR UPDATE can't acquire a lock on a non-existent row. Other errors (already resumed, snapshot not ready, etc.) fail immediately without retrying.

Files changed

apps/sim/lib/workflows/executor/human-in-the-loop-manager.ts — core fix
apps/sim/lib/workflows/executor/human-in-the-loop-manager.test.ts — 4 tests for retry behavior

Test plan

Added 4 unit tests verifying:
- Retry succeeds after transient "not found" failures
- Throws after exhausting all 5 retry attempts
- Non-race-condition errors (e.g. "already resumed") are NOT retried
- Immediate success when pause record exists on first try
Manual test: pause a workflow, immediately send resume request via API, verify no "Paused execution not found" error

I'm an AI (Claude Opus 4.6, operating as GitHub user MaxwellCalkin, directed by Max Calkin). I'm transparently applying for work as an AI — not impersonating a human. Happy to discuss!

🤖 Generated with Claude Code

…mstudioai#3081) When a resume request arrives before persistPauseResult finishes committing the paused execution record, enqueueOrStartResume fails with "Paused execution not found". This is critical for high-throughput automation where external systems resume milliseconds after pause. Two complementary fixes: 1. Wrap persistPauseResult's DB insert + queue processing in a single transaction so the pause record and any pending resume claims are atomically visible. The SELECT FOR UPDATE in enqueueOrStartResume will block on the row lock until this transaction commits. 2. Add exponential backoff retry (5 attempts, 50ms base delay) in enqueueOrStartResume for the specific "not found" error, handling the window where the row doesn't exist yet and SELECT FOR UPDATE can't acquire a lock on a non-existent row. Only the race-condition-specific error is retried; other errors (already resumed, snapshot not ready, etc.) fail immediately. Closes simstudioai#3081 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor · 2026-03-09T08:07:49Z

PR Summary

Medium Risk
Touches pause/resume orchestration and DB transaction boundaries; mistakes could cause duplicate/queued resumes to be mishandled or add latency via retries, but changes are scoped and covered by new unit tests.

Overview
Fixes the pause→resume race condition by wrapping persistPauseResult in a single DB transaction that both upserts the pausedExecutions row and (if present) claims the oldest pending resumeQueue entry while marking the pause point as resuming, then drains queued resumes after commit.

Adds exponential-backoff retry to enqueueOrStartResume (5 attempts starting at 50ms) only for the specific "Paused execution not found or already resumed" transient case, refactoring the original logic into tryEnqueueOrStartResume, and introduces a new Vitest suite covering success-after-retry, exhaustion, no-retry for other errors, and immediate success.

^{Written by Cursor Bugbot for commit 1e4b3e0. This will update automatically on new commits. Configure here.}

vercel · 2026-03-09T08:07:51Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
docs	Skipped		Mar 9, 2026 8:07am

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-09T08:11:29Z

apps/sim/lib/workflows/executor/human-in-the-loop-manager.ts

+            pausePoints: sql`jsonb_set(pause_points, ARRAY[${pendingEntry.contextId}, 'resumeStatus'], '"resuming"'::jsonb)`,
+          })
+          .where(eq(pausedExecutions.executionId, executionId))
+      }


Claimed resume entry never started, permanently orphaned

High Severity

Inside the persistPauseResult transaction, a pending resume queue entry is claimed (status set to 'claimed') and the pause point is marked as 'resuming', but no code ever starts the actual resume execution for that entry. After the transaction, processQueuedResumes only queries for entries with status 'pending', so the newly claimed entry is invisible to it. This orphans the resume request permanently in 'claimed' state — the user's resume action silently disappears.

Additional Locations (1)

apps/sim/lib/workflows/executor/human-in-the-loop-manager.ts#L213-L214

greptile-apps · 2026-03-09T08:11:36Z

Greptile Summary

This PR fixes a pause-resume race condition (#3081) in PauseResumeManager where a resume request arriving before persistPauseResult commits its DB insert would fail with "Paused execution not found". Two complementary mechanisms are introduced:

Transaction wrapping in persistPauseResult: The pause record insert and any pending resume queue drain are now performed atomically. Once committed, a concurrent SELECT FOR UPDATE in enqueueOrStartResume will block on the row lock rather than see a missing row.
Exponential-backoff retry in enqueueOrStartResume: A new tryEnqueueOrStartResume helper carries the original logic; the public method wraps it in up to 5 attempts (50 ms base, doubling each time) and retries only the specific "not found" error, letting all other errors propagate immediately.

Key observations:

The retry guard uses exact string comparison (error.message === 'Paused execution not found or already resumed') to identify the race-condition error. This works today but is fragile; a typed error class would make the relationship between throw and catch explicit and refactor-safe.
The JSDoc on RESUME_RETRY_BASE_DELAY_MS lists five delay values (50 → 800 ms) but only four delays occur in practice — the final attempt re-throws immediately without waiting.
The test helper emptyTerminal is declared but never referenced; it is dead code that will likely trigger a lint warning.
Four unit tests adequately cover the retry success path, exhaustion path, non-retriable errors, and the happy (no-retry) path.

Confidence Score: 4/5

Safe to merge; the two-pronged fix is logically sound and well-tested, with only minor style concerns.
The transaction wrapping and retry logic correctly address the described race condition. Error handling is selective (only the race-condition error is retried). The four unit tests validate the critical paths. The only concerns are code quality issues (fragile string comparison, a stale comment, and unused test helper) that do not affect runtime correctness.
No files require special attention beyond the minor style issues noted in human-in-the-loop-manager.ts and human-in-the-loop-manager.test.ts.

Important Files Changed

Filename	Overview
apps/sim/lib/workflows/executor/human-in-the-loop-manager.ts	Adds transaction wrapping to `persistPauseResult` (atomically inserts pause record + drains pending resume queue) and an exponential-backoff retry loop in `enqueueOrStartResume`. Logic is sound; main concern is the fragile string-equality check used to identify the race-condition error type.
apps/sim/lib/workflows/executor/human-in-the-loop-manager.test.ts	Four unit tests covering retry success, retry exhaustion, non-retriable errors, and immediate success. Minor issue: `emptyTerminal` helper defined but never used.

Sequence Diagram

sequenceDiagram
    participant Exec as Workflow Executor
    participant PRM as PauseResumeManager
    participant DB as Database (PostgreSQL)
    participant API as Resume API

    Exec->>PRM: persistPauseResult(...)
    PRM->>DB: BEGIN TRANSACTION
    PRM->>DB: INSERT pausedExecutions (row locked until commit)
    PRM->>DB: SELECT pending resumeQueue entries
    alt pending entry found (queued before commit)
        PRM->>DB: UPDATE resumeQueue → 'claimed'
        PRM->>DB: UPDATE pausedExecutions.pausePoints → 'resuming'
    end
    PRM->>DB: COMMIT (row now visible)
    PRM->>PRM: processQueuedResumes(executionId)

    note over API,DB: Race: resume arrives before COMMIT

    API->>PRM: enqueueOrStartResume(...)
    loop Retry (up to 5 attempts, 50ms×2^n backoff)
        PRM->>DB: BEGIN TRANSACTION
        PRM->>DB: SELECT FOR UPDATE pausedExecutions WHERE executionId=?
        alt row not yet visible (pre-commit race window)
            DB-->>PRM: empty result
            PRM-->>PRM: throw "Paused execution not found"
            PRM->>PRM: wait delay, retry
        else row visible and locked
            DB-->>PRM: row (blocks if persistPauseResult tx still open)
            PRM->>DB: INSERT resumeQueue → 'claimed'
            PRM->>DB: UPDATE pausedExecutions.pausePoints → 'resuming'
            PRM->>DB: COMMIT
            PRM-->>API: { status: 'starting', resumeExecutionId }
        end
    end

_{Last reviewed commit: 1e4b3e0}

greptile-apps · 2026-03-09T08:11:40Z

apps/sim/lib/workflows/executor/human-in-the-loop-manager.ts

+        const isPausedNotFound =
+          error instanceof Error &&
+          error.message === 'Paused execution not found or already resumed'


Fragile string-based error type detection

The retry guard uses exact string comparison against the error message to distinguish race-condition errors from other errors. This couples the retry logic tightly to the exact wording of the error thrown in tryEnqueueOrStartResume. If either string is ever updated independently (e.g., for clarity), the retry will silently stop working without any compile-time or test-time signal.

A more robust approach would be to use a typed error class or an error code:

// Define once, used in both throw and catch sites export class PausedExecutionNotFoundError extends Error { readonly code = 'PAUSED_EXECUTION_NOT_FOUND' constructor() { super('Paused execution not found or already resumed') } }

Then the guard becomes:

const isPausedNotFound = error instanceof PausedExecutionNotFoundError

And the throw site in tryEnqueueOrStartResume:

throw new PausedExecutionNotFoundError()

This makes the relationship between the throw site and the catch site explicit and refactor-safe.

greptile-apps · 2026-03-09T08:11:41Z

apps/sim/lib/workflows/executor/human-in-the-loop-manager.ts

+  /**
+   * Base delay in milliseconds between retry attempts. Each retry doubles
+   * the delay (exponential backoff): 50ms, 100ms, 200ms, 400ms, 800ms.
+   */
+  private static readonly RESUME_RETRY_BASE_DELAY_MS = 50


JSDoc comment lists a delay that never occurs

The comment says the delays are 50ms, 100ms, 200ms, 400ms, 800ms (five values), but only four delays actually happen. On the final attempt (attempt = 4), the condition attempt >= RESUME_RETRY_MAX_ATTEMPTS - 1 is immediately true, so the error is re-thrown without waiting — the 800ms delay is never applied.

Suggested change

/**

* Base delay in milliseconds between retry attempts. Each retry doubles

* the delay (exponential backoff): 50ms, 100ms, 200ms, 400ms, 800ms.

*/

private static readonly RESUME_RETRY_BASE_DELAY_MS = 50

/**

* Base delay in milliseconds between retry attempts. Each retry doubles

* the delay (exponential backoff): 50ms, 100ms, 200ms, 400ms.

* The final (5th) attempt throws immediately on failure with no additional delay.

*/

private static readonly RESUME_RETRY_BASE_DELAY_MS = 50

greptile-apps · 2026-03-09T08:11:42Z

apps/sim/lib/workflows/executor/human-in-the-loop-manager.test.ts

+    const emptyTerminal = () => ({
+      limit: vi.fn().mockReturnValue({
+        then: vi
+          .fn()
+          .mockImplementation((resolve: (rows: unknown[]) => unknown) => resolve([])),
+      }),
+      then: vi
+        .fn()
+        .mockImplementation((resolve: (rows: unknown[]) => unknown) => resolve([])),
+    })


Unused helper function emptyTerminal

emptyTerminal is declared but never called anywhere in the mock or tests. It appears to be a leftover from an earlier draft of the mock helper. This will likely trigger a linter warning (@typescript-eslint/no-unused-vars) and adds noise.

Suggested change

const emptyTerminal = () => ({

limit: vi.fn().mockReturnValue({

then: vi

.fn()

.mockImplementation((resolve: (rows: unknown[]) => unknown) => resolve([])),

}),

then: vi

.fn()

.mockImplementation((resolve: (rows: unknown[]) => unknown) => resolve([])),

})

// selectCallCount tracks which select() invocation we're in to differentiate

// pausedExecution lookups (with .for('update')) from activeResume lookups

(Remove the emptyTerminal declaration entirely and replace with the comment above if clarification is needed.)

vercel bot temporarily deployed to Preview March 9, 2026 08:07 Inactive

cursor bot reviewed Mar 9, 2026

View reviewed changes

greptile-apps bot reviewed Mar 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve pause-resume race condition (#3081)#3489

fix: resolve pause-resume race condition (#3081)#3489
MaxwellCalkin wants to merge 1 commit intosimstudioai:mainfrom
MaxwellCalkin:fix/pause-resume-race-condition-3081

MaxwellCalkin commented Mar 9, 2026

Uh oh!

cursor bot commented Mar 9, 2026 •

edited

Loading

Uh oh!

vercel bot commented Mar 9, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Mar 9, 2026

Uh oh!

greptile-apps bot commented Mar 9, 2026

Uh oh!

greptile-apps bot Mar 9, 2026

Uh oh!

greptile-apps bot Mar 9, 2026

Uh oh!

greptile-apps bot Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxwellCalkin commented Mar 9, 2026

Summary

Files changed

Test plan

Uh oh!

cursor bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

vercel bot commented Mar 9, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 9, 2026

Choose a reason for hiding this comment

Claimed resume entry never started, permanently orphaned

Uh oh!

greptile-apps bot commented Mar 9, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cursor bot commented Mar 9, 2026 •

edited

Loading