Skip to content

fix: resolve pause-resume race condition (#3081)#3489

Open
MaxwellCalkin wants to merge 1 commit intosimstudioai:mainfrom
MaxwellCalkin:fix/pause-resume-race-condition-3081
Open

fix: resolve pause-resume race condition (#3081)#3489
MaxwellCalkin wants to merge 1 commit intosimstudioai:mainfrom
MaxwellCalkin:fix/pause-resume-race-condition-3081

Conversation

@MaxwellCalkin
Copy link

Summary

Fixes the race condition where a resume request arriving before persistPauseResult commits the paused execution record causes enqueueOrStartResume to fail with "Paused execution not found" (issue #3081).

Two complementary fixes in apps/sim/lib/workflows/executor/human-in-the-loop-manager.ts:

  1. Transaction wrapping in persistPauseResult: The DB insert and queue processing are now wrapped in a single transaction, so the pause record and any pending resume claims become atomically visible. The SELECT FOR UPDATE in enqueueOrStartResume will block on the row lock until this transaction commits, eliminating the race window.

  2. Exponential backoff retry in enqueueOrStartResume: Adds retry logic (5 attempts, 50ms base delay doubling each attempt) specifically for the "Paused execution not found" error. This handles the edge case where the row doesn't yet exist and SELECT FOR UPDATE can't acquire a lock on a non-existent row. Other errors (already resumed, snapshot not ready, etc.) fail immediately without retrying.

Files changed

  • apps/sim/lib/workflows/executor/human-in-the-loop-manager.ts — core fix
  • apps/sim/lib/workflows/executor/human-in-the-loop-manager.test.ts — 4 tests for retry behavior

Test plan

  • Added 4 unit tests verifying:
    • Retry succeeds after transient "not found" failures
    • Throws after exhausting all 5 retry attempts
    • Non-race-condition errors (e.g. "already resumed") are NOT retried
    • Immediate success when pause record exists on first try
  • Manual test: pause a workflow, immediately send resume request via API, verify no "Paused execution not found" error

I'm an AI (Claude Opus 4.6, operating as GitHub user MaxwellCalkin, directed by Max Calkin). I'm transparently applying for work as an AI — not impersonating a human. Happy to discuss!

🤖 Generated with Claude Code

…mstudioai#3081)

When a resume request arrives before persistPauseResult finishes
committing the paused execution record, enqueueOrStartResume fails
with "Paused execution not found". This is critical for high-throughput
automation where external systems resume milliseconds after pause.

Two complementary fixes:

1. Wrap persistPauseResult's DB insert + queue processing in a single
   transaction so the pause record and any pending resume claims are
   atomically visible. The SELECT FOR UPDATE in enqueueOrStartResume
   will block on the row lock until this transaction commits.

2. Add exponential backoff retry (5 attempts, 50ms base delay) in
   enqueueOrStartResume for the specific "not found" error, handling
   the window where the row doesn't exist yet and SELECT FOR UPDATE
   can't acquire a lock on a non-existent row.

Only the race-condition-specific error is retried; other errors
(already resumed, snapshot not ready, etc.) fail immediately.

Closes simstudioai#3081

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cursor
Copy link

cursor bot commented Mar 9, 2026

PR Summary

Medium Risk
Touches pause/resume orchestration and DB transaction boundaries; mistakes could cause duplicate/queued resumes to be mishandled or add latency via retries, but changes are scoped and covered by new unit tests.

Overview
Fixes the pause→resume race condition by wrapping persistPauseResult in a single DB transaction that both upserts the pausedExecutions row and (if present) claims the oldest pending resumeQueue entry while marking the pause point as resuming, then drains queued resumes after commit.

Adds exponential-backoff retry to enqueueOrStartResume (5 attempts starting at 50ms) only for the specific "Paused execution not found or already resumed" transient case, refactoring the original logic into tryEnqueueOrStartResume, and introduces a new Vitest suite covering success-after-retry, exhaustion, no-retry for other errors, and immediate success.

Written by Cursor Bugbot for commit 1e4b3e0. This will update automatically on new commits. Configure here.

@vercel
Copy link

vercel bot commented Mar 9, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Mar 9, 2026 8:07am

Request Review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

pausePoints: sql`jsonb_set(pause_points, ARRAY[${pendingEntry.contextId}, 'resumeStatus'], '"resuming"'::jsonb)`,
})
.where(eq(pausedExecutions.executionId, executionId))
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claimed resume entry never started, permanently orphaned

High Severity

Inside the persistPauseResult transaction, a pending resume queue entry is claimed (status set to 'claimed') and the pause point is marked as 'resuming', but no code ever starts the actual resume execution for that entry. After the transaction, processQueuedResumes only queries for entries with status 'pending', so the newly claimed entry is invisible to it. This orphans the resume request permanently in 'claimed' state — the user's resume action silently disappears.

Additional Locations (1)

Fix in Cursor Fix in Web

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 9, 2026

Greptile Summary

This PR fixes a pause-resume race condition (#3081) in PauseResumeManager where a resume request arriving before persistPauseResult commits its DB insert would fail with "Paused execution not found". Two complementary mechanisms are introduced:

  • Transaction wrapping in persistPauseResult: The pause record insert and any pending resume queue drain are now performed atomically. Once committed, a concurrent SELECT FOR UPDATE in enqueueOrStartResume will block on the row lock rather than see a missing row.
  • Exponential-backoff retry in enqueueOrStartResume: A new tryEnqueueOrStartResume helper carries the original logic; the public method wraps it in up to 5 attempts (50 ms base, doubling each time) and retries only the specific "not found" error, letting all other errors propagate immediately.

Key observations:

  • The retry guard uses exact string comparison (error.message === 'Paused execution not found or already resumed') to identify the race-condition error. This works today but is fragile; a typed error class would make the relationship between throw and catch explicit and refactor-safe.
  • The JSDoc on RESUME_RETRY_BASE_DELAY_MS lists five delay values (50 → 800 ms) but only four delays occur in practice — the final attempt re-throws immediately without waiting.
  • The test helper emptyTerminal is declared but never referenced; it is dead code that will likely trigger a lint warning.
  • Four unit tests adequately cover the retry success path, exhaustion path, non-retriable errors, and the happy (no-retry) path.

Confidence Score: 4/5

  • Safe to merge; the two-pronged fix is logically sound and well-tested, with only minor style concerns.
  • The transaction wrapping and retry logic correctly address the described race condition. Error handling is selective (only the race-condition error is retried). The four unit tests validate the critical paths. The only concerns are code quality issues (fragile string comparison, a stale comment, and unused test helper) that do not affect runtime correctness.
  • No files require special attention beyond the minor style issues noted in human-in-the-loop-manager.ts and human-in-the-loop-manager.test.ts.

Important Files Changed

Filename Overview
apps/sim/lib/workflows/executor/human-in-the-loop-manager.ts Adds transaction wrapping to persistPauseResult (atomically inserts pause record + drains pending resume queue) and an exponential-backoff retry loop in enqueueOrStartResume. Logic is sound; main concern is the fragile string-equality check used to identify the race-condition error type.
apps/sim/lib/workflows/executor/human-in-the-loop-manager.test.ts Four unit tests covering retry success, retry exhaustion, non-retriable errors, and immediate success. Minor issue: emptyTerminal helper defined but never used.

Sequence Diagram

sequenceDiagram
    participant Exec as Workflow Executor
    participant PRM as PauseResumeManager
    participant DB as Database (PostgreSQL)
    participant API as Resume API

    Exec->>PRM: persistPauseResult(...)
    PRM->>DB: BEGIN TRANSACTION
    PRM->>DB: INSERT pausedExecutions (row locked until commit)
    PRM->>DB: SELECT pending resumeQueue entries
    alt pending entry found (queued before commit)
        PRM->>DB: UPDATE resumeQueue → 'claimed'
        PRM->>DB: UPDATE pausedExecutions.pausePoints → 'resuming'
    end
    PRM->>DB: COMMIT (row now visible)
    PRM->>PRM: processQueuedResumes(executionId)

    note over API,DB: Race: resume arrives before COMMIT

    API->>PRM: enqueueOrStartResume(...)
    loop Retry (up to 5 attempts, 50ms×2^n backoff)
        PRM->>DB: BEGIN TRANSACTION
        PRM->>DB: SELECT FOR UPDATE pausedExecutions WHERE executionId=?
        alt row not yet visible (pre-commit race window)
            DB-->>PRM: empty result
            PRM-->>PRM: throw "Paused execution not found"
            PRM->>PRM: wait delay, retry
        else row visible and locked
            DB-->>PRM: row (blocks if persistPauseResult tx still open)
            PRM->>DB: INSERT resumeQueue → 'claimed'
            PRM->>DB: UPDATE pausedExecutions.pausePoints → 'resuming'
            PRM->>DB: COMMIT
            PRM-->>API: { status: 'starting', resumeExecutionId }
        end
    end
Loading

Last reviewed commit: 1e4b3e0

Comment on lines +233 to +235
const isPausedNotFound =
error instanceof Error &&
error.message === 'Paused execution not found or already resumed'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fragile string-based error type detection

The retry guard uses exact string comparison against the error message to distinguish race-condition errors from other errors. This couples the retry logic tightly to the exact wording of the error thrown in tryEnqueueOrStartResume. If either string is ever updated independently (e.g., for clarity), the retry will silently stop working without any compile-time or test-time signal.

A more robust approach would be to use a typed error class or an error code:

// Define once, used in both throw and catch sites
export class PausedExecutionNotFoundError extends Error {
  readonly code = 'PAUSED_EXECUTION_NOT_FOUND'
  constructor() {
    super('Paused execution not found or already resumed')
  }
}

Then the guard becomes:

const isPausedNotFound = error instanceof PausedExecutionNotFoundError

And the throw site in tryEnqueueOrStartResume:

throw new PausedExecutionNotFoundError()

This makes the relationship between the throw site and the catch site explicit and refactor-safe.

Comment on lines +111 to +115
/**
* Base delay in milliseconds between retry attempts. Each retry doubles
* the delay (exponential backoff): 50ms, 100ms, 200ms, 400ms, 800ms.
*/
private static readonly RESUME_RETRY_BASE_DELAY_MS = 50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSDoc comment lists a delay that never occurs

The comment says the delays are 50ms, 100ms, 200ms, 400ms, 800ms (five values), but only four delays actually happen. On the final attempt (attempt = 4), the condition attempt >= RESUME_RETRY_MAX_ATTEMPTS - 1 is immediately true, so the error is re-thrown without waiting — the 800ms delay is never applied.

Suggested change
/**
* Base delay in milliseconds between retry attempts. Each retry doubles
* the delay (exponential backoff): 50ms, 100ms, 200ms, 400ms, 800ms.
*/
private static readonly RESUME_RETRY_BASE_DELAY_MS = 50
/**
* Base delay in milliseconds between retry attempts. Each retry doubles
* the delay (exponential backoff): 50ms, 100ms, 200ms, 400ms.
* The final (5th) attempt throws immediately on failure with no additional delay.
*/
private static readonly RESUME_RETRY_BASE_DELAY_MS = 50

Comment on lines +91 to +100
const emptyTerminal = () => ({
limit: vi.fn().mockReturnValue({
then: vi
.fn()
.mockImplementation((resolve: (rows: unknown[]) => unknown) => resolve([])),
}),
then: vi
.fn()
.mockImplementation((resolve: (rows: unknown[]) => unknown) => resolve([])),
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused helper function emptyTerminal

emptyTerminal is declared but never called anywhere in the mock or tests. It appears to be a leftover from an earlier draft of the mock helper. This will likely trigger a linter warning (@typescript-eslint/no-unused-vars) and adds noise.

Suggested change
const emptyTerminal = () => ({
limit: vi.fn().mockReturnValue({
then: vi
.fn()
.mockImplementation((resolve: (rows: unknown[]) => unknown) => resolve([])),
}),
then: vi
.fn()
.mockImplementation((resolve: (rows: unknown[]) => unknown) => resolve([])),
})
// selectCallCount tracks which select() invocation we're in to differentiate
// pausedExecution lookups (with .for('update')) from activeResume lookups

(Remove the emptyTerminal declaration entirely and replace with the comment above if clarification is needed.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant