Skip to content

egress: add integration test for connection migration#359

Merged
myleshorton merged 2 commits into
mainfrom
fisk/migration-integration-test
May 4, 2026
Merged

egress: add integration test for connection migration#359
myleshorton merged 2 commits into
mainfrom
fisk/migration-integration-test

Conversation

@myleshorton

Copy link
Copy Markdown
Contributor

Summary

Adds two integration tests in egress/migration_test.go covering the connection-migration happy path and the probe-timeout failure mode. Plus a 1-line refactor to createOrMigrate's parameter type to make the test feasible without a real WebSocket.

Background

Nelson reported intermittent connection-migration failures on Slack. The us-linode-nj egress's logs over the last 24h confirm: 13 successful migrations / 15 attempts, with 2 failures all of the form path probe error: context deadline exceeded. PR #301 (where migration was first implemented) had no tests — so we have zero regression coverage on the load-bearing quic-go AddPath / Probe / Switch contract.

Two suspect dep bumps landed mid-April:

  • 3f2b8cb: quic-go v0.51 → v0.59 (with rebased fork +50 connection-IDs patch)
  • ca7cd88: pion/webrtc v3 → v4.2.11

Either could in principle have broken the migration API contract. Both tests in this PR pass under the current quic-go v0.59, so we now know the API contract still holds — the prod failures are likely consumer-side state GC during the WebRTC re-pair window, not a quic-go regression.

What's tested

TestConnectionManager_Migration_HappyPath

Real UDP loopback for both paths. Establishes a QUIC connection over path A, then calls createOrMigrate(csid, pathB) and asserts:

  1. Migration succeeded (no error from AddPath / Probe / Switch)
  2. The returned *quic.Conn is the same pointer as path A's (i.e. migrate branch ran, not the dial branch)
  3. A fresh stream opens over the migrated connection (proves the connection state is alive)

TestConnectionManager_Migration_ProbeTimeout

Same setup but path B's WriteTo is wired to silently drop all writes — PATH_CHALLENGE never reaches the consumer. Asserts:

  1. createOrMigrate returns an error rather than hanging
  2. The error returns within ~probeTimeout (not later)
  3. The connection record stays in cm.connections after probe failure (so the next migration attempt can retry — production's deleteIfNotMigratedSince is the only thing that should remove it)

Refactor

createOrMigrate(csid, pconn *errorlessWebSocketPacketConn)createOrMigrate(csid, pconn net.PacketConn). The function only read pconn.addr for logging; now uses pconn.LocalAddr(). errorlessWebSocketPacketConn already implements net.PacketConn, so no production caller needs to change. This is the only way I found to test migration without standing up a real WebSocket handshake — production's path-A pconn can be anything that talks UDP-like.

Test plan

  • go vet ./egress/... clean (the two pre-existing wsCancel warnings on egresslib.go are unchanged)
  • go test -run TestConnectionManager_Migration -v both pass
  • go test -count=5 stable across runs
  • go test -run TestNewListener (the existing egress test) still passes

🤖 Generated with Claude Code

Triaged a Slack thread from nelson reporting intermittent connection-
migration failures and got concrete data from the prod us-linode-nj
egress: 13/15 successful migrations over 24h, with 2 failures all
matching "createOrMigrate error: path probe error: context deadline
exceeded". PR #301 (where migration was first implemented) didn't add
any tests, so we have no regression coverage on the load-bearing
quic-go AddPath/Probe/Switch contract.

Adds two tests in egress/migration_test.go:

- TestConnectionManager_Migration_HappyPath establishes a QUIC
  connection over a UDP loopback path A, then calls createOrMigrate
  again with the same CSID over a different UDP loopback path B and
  asserts the migration branch returns the SAME *quic.Conn (not a new
  one) and that a fresh stream opens cleanly over the migrated
  connection. End-to-end exercise of the AddPath → Probe → Switch
  triple via real packets.

- TestConnectionManager_Migration_ProbeTimeout pins down the failure
  contract: when the new path's WriteTo silently drops everything,
  PATH_CHALLENGE never reaches the consumer and createOrMigrate must
  surface the timeout as an error within ~probeTimeout (not hang),
  and must leave the connection record in cm.connections so the next
  attempt can retry. This is the exact failure mode seen in prod, but
  here we know the cause (we're forcing it).

Both pass in 0.01s and 1.5s respectively, and `-count=5` is stable.

To make the test feasible without standing up a full WebSocket, this
also relaxes createOrMigrate's parameter type from
*errorlessWebSocketPacketConn to net.PacketConn. The function only
read pconn.addr for logging, now uses pconn.LocalAddr() instead.
errorlessWebSocketPacketConn already implements net.PacketConn so no
production caller needs to change.

What this rules out: if a future quic-go bump regresses the AddPath /
Probe / Switch API contract, these tests fail at the failing step (the
error string names which sub-API broke). Currently both pass under
quic-go v0.59 + the rebased fork patch, so the prod failures Nelson
saw are NOT coming from the quic-go API layer — most likely candidate
is consumer-side QUIC connection-state GC during the WebRTC re-pair
window (MaxIdleTimeout=60s, KeepAlivePeriod=15s; if the re-pair takes
long enough the consumer's listener side drops the connection state
before the new producer's WS reaches the egress).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 4, 2026 21:18

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end integration coverage for the egress QUIC connection-migration path (AddPath → Probe → Switch) by enabling tests to inject non-WebSocket net.PacketConn transports.

Changes:

  • Relax connectionManager.createOrMigrate to accept a generic net.PacketConn and update logging to use LocalAddr().
  • Add integration tests covering the migration happy path and the probe-timeout failure mode using UDP loopback transports.
  • Add minimal test-only TLS helpers and small net.PacketConn adapters to simulate fixed-destination and write-dropping behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
egress/quic.go Broadens createOrMigrate’s transport parameter to net.PacketConn and updates log address rendering.
egress/migration_test.go Adds integration tests validating QUIC connection migration success and probe-timeout behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread egress/quic.go Outdated
Comment thread egress/migration_test.go
Comment thread egress/migration_test.go Outdated
Three findings, all valid:

- gofmt the imports in egress/quic.go. The earlier addition of `net`
  put it after `sync` instead of in stdlib alphabetical order, which
  gofmt now corrects.

- Fix t.Cleanup ordering in both migration tests. closeAllRecords was
  registered too early, which combined with t.Cleanup's LIFO
  semantics meant QUIC connections were being closed AFTER their
  underlying PacketConns. quic-go's read goroutines saw vanished
  transports before they saw the connection close, producing noisy
  error logs and an occasional race against the test assertion path.
  Move the closeAllRecords registration to AFTER the PacketConn
  registrations so it runs first in the cleanup order; comment the
  intent so future edits don't accidentally re-break it.

- Replace the hard-coded 4s upper bound and exact >=probeTimeout
  lower bound on the probe-timeout test with slack windows
  (probeTimeout - 100ms ≤ elapsed ≤ probeTimeout + 2s). This both
  avoids CI-scheduler-jitter flakes and keeps the test correct when
  probeTimeout is bumped — the explicit constant references the
  same probeTimeout the test sets on cm, so changing one updates
  both bounds.

3x test runs all stable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@myleshorton myleshorton merged commit eaba02d into main May 4, 2026
myleshorton pushed a commit to getlantern/radiance that referenced this pull request May 4, 2026
Picks up getlantern/lantern-box#253 which bumped broflake to land
getlantern/unbounded#360 (clientcore: classify QUIC connection
errors). After this ships in a Lantern build, lantern.log will carry
err_class structured-log fields on every "QUIC connection ended"
record, distinguishing idle_timeout / handshake_timeout /
application_close_remote / application_close_local / transport_error
/ stateless_reset / context_canceled / etc.

The transitive broflake bump also picked up getlantern/unbounded#359
(integration test for connection migration), which is test-only.

`go mod tidy` clean; pre-existing `cmd/lantern/lantern.go:78`
ipc.NewClient build error reproduces on main and is unrelated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
myleshorton added a commit to getlantern/radiance that referenced this pull request May 6, 2026
Picks up getlantern/lantern-box#253 which bumped broflake to land
getlantern/unbounded#360 (clientcore: classify QUIC connection
errors). After this ships in a Lantern build, lantern.log will carry
err_class structured-log fields on every "QUIC connection ended"
record, distinguishing idle_timeout / handshake_timeout /
application_close_remote / application_close_local / transport_error
/ stateless_reset / context_canceled / etc.

The transitive broflake bump also picked up getlantern/unbounded#359
(integration test for connection migration), which is test-only.

`go mod tidy` clean; pre-existing `cmd/lantern/lantern.go:78`
ipc.NewClient build error reproduces on main and is unrelated.

Co-authored-by: Adam Fisk <afisk@mini.local>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants