Skip to content

Harden the dual-era stream loop's era-lock and rejection semantics#3040

Merged
maxisbey merged 4 commits into
mainfrom
dual-era-loop-hardening
Jul 1, 2026
Merged

Harden the dual-era stream loop's era-lock and rejection semantics#3040
maxisbey merged 4 commits into
mainfrom
dual-era-loop-hardening

Conversation

@maxisbey

@maxisbey maxisbey commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Post-merge review follow-up to #3038. An adversarial review pass over the dual-era driver found a family of cases where a request that failed could still lock the connection's era — and on stdio a connection is a whole subprocess, so a locked-then-stranded connection has no recovery short of a restart.

The era-lock fixes

The invariant the driver documents is: released auto-negotiating clients fall back on any error code except -32022, so a failed probe must leave the legacy handshake available. Three openers violated it:

  1. Well-formed envelope, malformed content (e.g. clientInfo: 42): classification checks key presence only, so the era locked modern before pydantic rejected the content. The request failed with a fallback-eligible -32602, but the fallback then hit -32022 forever. The lock now commits only after the request succeeds, mirroring the legacy side's existing "lock only on success" rule.
  2. initialize stamped with the envelope triple: routed modern (the triple won), failed METHOD_NOT_FOUND (initialize doesn't exist at 2026-07-28), and locked the era anyway. initialize is legacy-distinctive by definition, so it now always takes the handshake path regardless of decoration.
  3. subscriptions/listen as the opener: the not-served-on-this-transport rejection used to land after the lock; now, like every failed request, it locks nothing.

No conforming client — released SDK or otherwise — produces any of these frames; the fixes are robustness against buggy third-party modern clients.

Compatibility restoration

A bare server/discover on a legacy-locked connection now falls through to the loop runner's per-version validation, restoring the byte-identical METHOD_NOT_FOUND (-32601, same message and data) a handshake-only server produced — #3038 had it rejecting with -32600. Only envelope-bearing frames get the cross-era INVALID_REQUEST; a conforming legacy client can never send the reserved triple, so legacy traffic is byte-identical again in full.

Note for anyone bisecting client behavior: v2.0.0b1 ships the -32600 variant for this one frame shape (it carries #3038 but not this fix), so a legacy-locked bare server/discover answers -32600 on b1 and -32601 before #3038 and from this PR onward.

Smaller fixes

  • classify_inbound_request rejects a present-but-non-string protocol version with a descriptive INVALID_PARAMS instead of raising ValidationError out of its own -32022 payload construction (requested is a str field). The dispatcher's exception ladder happened to mask that as a generic -32602 — same code, now deliberate, with a message that names the defect. Unreachable over HTTP (the header rung fires first), so the wire delta is stdio-only.
  • The modern dispatch wrapper masks transport.can_send_request so per-message transport metadata agrees with its denial, matching the modern HTTP entry's context — closing a latent inconsistency before anything consumes the field.
  • NotifyOnlyOutbound inherits the no-channel refusal instead of duplicating it; the has_standalone_channel/Raises contracts now document that a modern connection refuses requests on a real channel.
  • Docs: migration guide + protocol-versions page note the probe is transport-independent; the subscriptions page documents the streamable-HTTP-only limitation of subscriptions/listen; a stale serve_loop docstring claim is corrected. Two pyright suppressions in the stdio test doubles are replaced with typeshed-compatible Buffer signatures.

Validation

Every fix has a pinning test (rejection codes, lock neutrality of each failing opener, the byte-identical legacy fall-through, the classifier guard, the transport mask). Beyond the suite (100% coverage), each formerly-stranding opener was driven against a live stdio subprocess and shown to leave the legacy handshake working on the same pipe, with the full original probe matrix (auto/legacy/pinned negotiation, pipelined probe ordering, -32022 round-trip) passing unchanged.

Round 2 — review follow-ups

Review found that the round-1 lock-on-success rule still approximated "client-visible success" with "the handler returned". Two windows violated the invariant, plus two cross-transport divergences on the modern path:

  1. Era overwrite (confirmed live): the era commits ran unconditionally after their own request, so a pipelined envelope-bearing request completing after an inline initialize overwrote the committed legacy lock — bricking the freshly-handshaked client — and the mirror direction (a legacy commit clobbering a modern lock) was possible too. The era cell is now monotone: one era_settles settlement predicate with site-local atomic commits, first success wins, and a straggler from the other era can never move a committed lock. Its response still stands — it was already earned.
  2. Cancelled-away success: with the dispatcher's interrupt cancel mode, a peer cancel landing during the handler's shielded teardown is delivered at the response-write checkpoint — the client sees "Request cancelled", but the era had already locked. The settlement predicate now skips the lock when the request's cancel event is set, tying "success" to what the client actually observes. (A cancel racing a genuinely blocked response write can still slip through; that residue belongs to the dispatcher's existing respond-after-cancel divergence and dissolves with it.)
  3. Exception-boundary parity: an unmapped handler exception on a modern stream request fell through to the dispatcher's code-0 catch-all — str(e) on the wire — while the identical request over modern HTTP produced the generic -32603 Internal server error. The code-0 pin exists for v1 compat, which the modern era doesn't have. Both modern entries now share one modern_error_data boundary; debug raise_exceptions still re-raises.
  4. Envelope-coercion parity: the HTTP entry degraded mis-shaped clientInfo/clientCapabilities to not-supplied while the stream path rejected them at Connection.from_envelope construction. For spec methods both transports already rejected identically (the kernel's per-version params surface types the reserved _meta keys strictly); custom methods diverged. from_envelope now takes the classifier's raw values and owns the coercion, so every modern entry — HTTP, stream loop, and modern_on_request — builds connection identity identically.
  5. Classifier guard placement: the non-string protocol-version guard moved below the header rung, and (review follow-up) the rung now checks the version header's presence explicitly — a null body version would otherwise slip the None == None equality and make the absent-header outcome depend on the body value. An absent or disagreeing version header is uniformly HEADER_MISMATCH, the HTTP wire is untouched wherever it was well-defined, and the string guard is reachable only on header-less transports, making the round-1 "stdio-only delta" claim above exactly true. (Through the shipped HTTP manager the era routing is header-only, so these cells harden the modern entry's own contract.) The public Server.run docstring also picked up the first-success wording.

Each behavioral fix carries a regression test that failed on the round-1 code (the overwrite and cancelled-away tests reproduce the races deterministically), and the scenarios were re-driven against a live stdio subprocess: the pipelined-race brick, the sanitized internal error, the failed-opener fallback, and the cancel-never-locks path all behave as documented, with auto/legacy negotiation unchanged.

Round 3 — the timeout brick and the third modern entry

Further review surfaced the last member of the brick family, plus a pre-existing gap in the parity story:

  1. Timed-out probe stranding mode='auto' (reproduced live): a server/discover probe that outlives the client's 10s timeout can still succeed on a slow-starting server — the response was written, the server cannot know the client stopped waiting, and discover is inline so even a courtesy cancel could not land before the commit. The connection locks modern, the pipelined fallback initialize gets -32022, and negotiate_auto treated that as terminal: the default mode='auto' could not connect to any stdio server that takes >10s to its first response. No server-side change can close this; the client-side fix is that -32022 from the fallback handshake is itself positive modern evidence (it names the server's versions), so the policy now re-probes once at a mutual version — bounded by the existing two-attempt loop, so a pathologically slow server still terminates with a genuine error. Covered by policy unit tests and a deterministic end-to-end test (a relay holds the probe frame until the post-timeout initialize is sent — no sleeps).
  2. The in-process modern entry now denies server-initiated requests like the other two: previously a handler's request-scoped elicitation over in-process Client(server) in modern mode delivered the protocol-forbidden elicitation/create frame to the client, which answered a misleading Method not found. The entry's dispatch context is now wrapped in the same server-requests denial, so the call refuses server-side with the no-back-channel contract — which is also the exact wording the client-callbacks docs page already used to describe this path.

AI Disclaimer

Post-merge review follow-up to #3038, fixing the cases where a failed
request could still lock the connection's era:

- The modern era now locks only when a request SUCCEEDS, mirroring the
  legacy side's lock-on-success rule. Previously the lock committed at
  classification time, so a first request with a well-formed envelope
  but malformed content - or subscriptions/listen, or an unknown
  method - locked the connection modern after failing; the client's
  initialize fallback then got -32022, the one code auto-negotiating
  clients do not fall back from, stranding the connection for its
  lifetime (a whole subprocess on stdio).
- initialize is legacy-distinctive by definition (the method does not
  exist at modern versions), so it always takes the handshake path now,
  even when a confused client stamps the envelope triple on it.
  Previously it routed modern, failed METHOD_NOT_FOUND, and locked the
  era modern - bricking the handshake the same way.
- A bare server/discover on a legacy-locked connection falls through to
  the loop runner's per-version surface validation again, restoring the
  byte-identical METHOD_NOT_FOUND a handshake-only server produced;
  only envelope-bearing frames get the INVALID_REQUEST rejection (a
  conforming legacy client can never send the reserved triple).
- classify_inbound_request rejects a present-but-non-string protocol
  version as a descriptive INVALID_PARAMS instead of raising
  ValidationError out of its own -32022 payload construction
  (`requested` is a str field); the dispatcher's exception ladder
  happened to mask that as a generic -32602.
- _NoServerRequestsDispatchContext masks transport.can_send_request so
  the per-message transport metadata agrees with the wrapper's denial,
  matching the modern HTTP entry's context.
- NotifyOnlyOutbound inherits _NoChannelOutbound's refusal instead of
  duplicating it, and the has_standalone_channel / Raises contracts now
  document that a modern connection refuses requests on a real channel.
- Docs: the migration guide and protocol-versions page note the
  discover probe is transport-independent; the subscriptions page
  documents the streamable-HTTP-only limitation of subscriptions/listen;
  serve_loop's stale shared-recipe claim is corrected.

Also replaces two pyright suppressions in the stdio test doubles with
typeshed-compatible Buffer signatures.
@github-actions

github-actions Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

📚 Documentation preview

Preview https://pr-3040.mcp-python-docs.pages.dev
Deployment https://1274a706.mcp-python-docs.pages.dev
Commit 290e068
Triggered by @maxisbey
Updated 2026-07-01 14:55:35 UTC

@maxisbey maxisbey marked this pull request as ready for review July 1, 2026 09:08

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 9 files

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread src/mcp/server/runner.py Outdated
Comment thread src/mcp/shared/inbound.py Outdated
Comment thread src/mcp/server/runner.py Outdated
The dual-era loop's lock-on-success rule approximated "client-visible
success" with "the handler returned", leaving windows where a request the
client saw fail still moved the era: a pipelined envelope-bearing request
completing after an inline initialize overwrote the committed legacy lock
(stranding the freshly-handshaked client), and a peer cancel landing
during the handler's shielded teardown locked the era while the client
received "Request cancelled". Era commits now run through a single
era_settles predicate: monotone (the first success wins, a straggler from
the other era can never move a committed lock) and skipped when a pending
cancel is about to replace the response.

Two modern-path divergences between the stream loop and the HTTP entry
close by sharing one pipeline: Connection.from_envelope takes the
classifier's raw envelope values and owns the tolerant coercion (custom
methods now degrade mis-shaped clientInfo identically everywhere), and a
shared modern_error_data boundary maps unmapped handler exceptions to a
generic INTERNAL_ERROR instead of the legacy code-0 catch-all putting
str(e) on the wire.

The classifier's non-string protocol-version guard moves below the header
rung, so the HTTP wire is untouched whenever the version header is
present, while the guard still backstops the header-absent null-version
corner that previously escaped as a ValidationError.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 6 files (changes from recent commits).

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread src/mcp/shared/inbound.py
Comment thread src/mcp/server/runner.py
A null body protocolVersion slipped the header equality check (None ==
None), so the absent-header outcome depended on the body value: an int
gave HEADER_MISMATCH while null fell through to the string guard's
INVALID_PARAMS. The rung now checks presence explicitly, making an
absent version header uniformly HEADER_MISMATCH and the string guard
reachable only on header-less transports. Also updates the public
Server.run docstring to the first-success era-lock wording.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No new issues found on this pass — the round-2 commits (era_settles monotone lock, explicit version-header presence check, shared modern_error_data boundary, Server.run docstring) resolve the concerns I raised earlier, each with a pinning regression test. That said, this changes the era-lock and rejection semantics of the dual-era negotiation path, so it warrants a human maintainer's review rather than auto-approval.

Extended reasoning...

Overview

This PR hardens the dual-era stream loop introduced in #3038: the era lock now commits only on a client-visible success (monotone, first-success-wins, cancel-aware via era_settles), initialize always takes the handshake path even when envelope-stamped, a bare server/discover on a legacy-locked connection restores the byte-identical METHOD_NOT_FOUND, unmapped modern handler exceptions are sanitized identically across transports via a shared modern_error_data boundary, envelope coercion is centralized in Connection.from_envelope, and the classifier gains an explicit non-string protocol-version guard placed after a now presence-checking header rung. Touched files span runner.py, inbound.py, connection.py, the modern HTTP entry, docs, and substantial new tests.

Security risks

No auth, crypto, or permission code is touched. The changes affect protocol negotiation and error-mapping semantics; the main risk class is compatibility/availability (a stranded stdio subprocess or a wrong wire error code for buggy clients) rather than data exposure. The exception-boundary change actually improves information hygiene by ensuring handler internals never reach the wire on the modern stream path.

Level of scrutiny

High. This is core connection-negotiation logic with subtle concurrency (pipelined requests, cancellation races) and wire-compatibility guarantees that released auto-negotiating clients depend on. Even with no findings on this pass, design-level decisions here (lock-on-success semantics, error-code choices, the classifier rung ordering) deserve maintainer sign-off.

Other factors

The prior review findings — the era-overwrite race, the missing version-header presence check, and the stale Server.run docstring — are all addressed in the follow-up commits, each with a regression test that reproduces the original scenario. The suite maintains 100% coverage and the PR description reports live stdio-subprocess validation of the fallback and race scenarios. Existing reviewer threads (cubic and my earlier comments) are marked resolved.

…ack-channel

A server/discover probe that outlives the client's 10s timeout can still
succeed on a slow-starting server, locking the connection modern before
the pipelined fallback initialize arrives; that initialize then fails
with -32022, which negotiate_auto treated as terminal - the default
mode='auto' could not connect at all. The -32022 is itself positive
modern evidence (it names the server's versions), so the policy now
re-probes once at a mutual version, bounded by the existing two-attempt
loop.

modern_on_request now wraps its dispatch context in the server-requests
denial like the other modern entries: a handler's request-scoped
elicitation no longer delivers the protocol-forbidden frame to the
in-process client (which answered a misleading "Method not found") and
instead refuses server-side with the no-back-channel contract.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 5 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/mcp/client/_probe.py">

<violation number="1" location="src/mcp/client/_probe.py:85">
P2: The handshake recovery rejects supported-only `-32022` data, even though this codebase emits that shape for initialize-after-modern-lock errors. Parse `supported` from the handshake error without requiring `requested`, otherwise auto mode can still fail instead of re-probing.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic

Comment thread src/mcp/client/_probe.py
# the connection modern before this initialize arrived. Re-probe
# once at a version the server names; the era is already
# settled, so the second probe answers without the slow start.
supported = _parse_supported(handshake_exc.error.data)

@cubic-dev-ai cubic-dev-ai Bot Jul 1, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The handshake recovery rejects supported-only -32022 data, even though this codebase emits that shape for initialize-after-modern-lock errors. Parse supported from the handshake error without requiring requested, otherwise auto mode can still fail instead of re-probing.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/mcp/client/_probe.py, line 85:

<comment>The handshake recovery rejects supported-only `-32022` data, even though this codebase emits that shape for initialize-after-modern-lock errors. Parse `supported` from the handshake error without requiring `requested`, otherwise auto mode can still fail instead of re-probing.</comment>

<file context>
@@ -65,7 +72,22 @@ async def negotiate_auto(session: ClientSession) -> None:
+                # the connection modern before this initialize arrived. Re-probe
+                # once at a version the server names; the era is already
+                # settled, so the second probe answers without the slow start.
+                supported = _parse_supported(handshake_exc.error.data)
+                mutual = [v for v in MODERN_PROTOCOL_VERSIONS if v in (supported or ())]
+                if not mutual:
</file context>
Suggested change
supported = _parse_supported(handshake_exc.error.data)
data = handshake_exc.error.data
supported = _parse_supported(data)
if supported is None and isinstance(data, dict) and isinstance(data.get("supported"), list):
supported = [v for v in data["supported"] if isinstance(v, str)]
Fix with cubic

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The follow-ups to my earlier comments (the era-lock overwrite guard, the header-presence check, and the Server.run docstring) all look addressed in the latest commits, and I found no new issues in this pass — but this PR reworks core era-negotiation and client auto-negotiation semantics across three rounds of changes, so it warrants a human maintainer's review before merge.

Extended reasoning...

Overview

This PR (rounds 1–3) hardens the dual-era stream loop's era-lock semantics in src/mcp/server/runner.py (lock-only-on-client-visible-success via the era_settles predicate, monotone first-success commits, shared modern_error_data exception boundary), moves envelope coercion into Connection.from_envelope, adds a non-string protocol-version guard and explicit header-presence check to the classifier ladder in src/mcp/shared/inbound.py, changes the client-side negotiate_auto policy in src/mcp/client/_probe.py to treat a -32022 fallback-handshake error as modern evidence and re-probe once, and wraps the in-process modern entry in the server-requests denial. Docs and extensive pinning/regression tests accompany each change.

Security risks

No direct security-sensitive surfaces (auth, crypto, permissions) are touched. The changes do affect error-message hygiene positively: unmapped handler exceptions on the modern stream path are now sanitized to a generic INTERNAL_ERROR instead of leaking str(e) to the wire, matching the HTTP entry. The classifier changes alter which error code a malformed client receives but do not weaken validation.

Level of scrutiny

High. This is production-critical protocol-negotiation logic: the era-lock state machine determines whether real clients can connect at all over stdio (where a stranded connection means restarting the subprocess), and the client-side auto-negotiation policy affects every default-mode connection. The changes involve subtle concurrency reasoning (pipelined inline vs. spawned dispatch, cancellation at the response-write checkpoint) and wire-compatibility guarantees (byte-identical legacy fall-through, -32022 semantics). Even with strong test coverage, the design decisions here (what counts as "success", the re-probe policy on -32022) are ones a maintainer should sign off on.

Other factors

The prior review comments — cubic's atomicity finding and my comments on the era-overwrite race, the header-rung placement, and the stale Server.run docstring — are all addressed in the round-2/3 commits, each with a regression test that reproduces the original race deterministically. The bug hunting system found no new issues on the current revision, and the repository requires 100% coverage which the PR claims to maintain. The remaining reason to defer is scope and criticality, not any specific defect.

@maxisbey maxisbey enabled auto-merge (squash) July 1, 2026 16:05
@maxisbey maxisbey merged commit 080f2a8 into main Jul 1, 2026
41 of 42 checks passed
@maxisbey maxisbey deleted the dual-era-loop-hardening branch July 1, 2026 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants