egress: add integration test for connection migration#359
Merged
Conversation
Triaged a Slack thread from nelson reporting intermittent connection- migration failures and got concrete data from the prod us-linode-nj egress: 13/15 successful migrations over 24h, with 2 failures all matching "createOrMigrate error: path probe error: context deadline exceeded". PR #301 (where migration was first implemented) didn't add any tests, so we have no regression coverage on the load-bearing quic-go AddPath/Probe/Switch contract. Adds two tests in egress/migration_test.go: - TestConnectionManager_Migration_HappyPath establishes a QUIC connection over a UDP loopback path A, then calls createOrMigrate again with the same CSID over a different UDP loopback path B and asserts the migration branch returns the SAME *quic.Conn (not a new one) and that a fresh stream opens cleanly over the migrated connection. End-to-end exercise of the AddPath → Probe → Switch triple via real packets. - TestConnectionManager_Migration_ProbeTimeout pins down the failure contract: when the new path's WriteTo silently drops everything, PATH_CHALLENGE never reaches the consumer and createOrMigrate must surface the timeout as an error within ~probeTimeout (not hang), and must leave the connection record in cm.connections so the next attempt can retry. This is the exact failure mode seen in prod, but here we know the cause (we're forcing it). Both pass in 0.01s and 1.5s respectively, and `-count=5` is stable. To make the test feasible without standing up a full WebSocket, this also relaxes createOrMigrate's parameter type from *errorlessWebSocketPacketConn to net.PacketConn. The function only read pconn.addr for logging, now uses pconn.LocalAddr() instead. errorlessWebSocketPacketConn already implements net.PacketConn so no production caller needs to change. What this rules out: if a future quic-go bump regresses the AddPath / Probe / Switch API contract, these tests fail at the failing step (the error string names which sub-API broke). Currently both pass under quic-go v0.59 + the rebased fork patch, so the prod failures Nelson saw are NOT coming from the quic-go API layer — most likely candidate is consumer-side QUIC connection-state GC during the WebRTC re-pair window (MaxIdleTimeout=60s, KeepAlivePeriod=15s; if the re-pair takes long enough the consumer's listener side drops the connection state before the new producer's WS reaches the egress). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
Contributor
There was a problem hiding this comment.
Pull request overview
Adds end-to-end integration coverage for the egress QUIC connection-migration path (AddPath → Probe → Switch) by enabling tests to inject non-WebSocket net.PacketConn transports.
Changes:
- Relax
connectionManager.createOrMigrateto accept a genericnet.PacketConnand update logging to useLocalAddr(). - Add integration tests covering the migration happy path and the probe-timeout failure mode using UDP loopback transports.
- Add minimal test-only TLS helpers and small
net.PacketConnadapters to simulate fixed-destination and write-dropping behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| egress/quic.go | Broadens createOrMigrate’s transport parameter to net.PacketConn and updates log address rendering. |
| egress/migration_test.go | Adds integration tests validating QUIC connection migration success and probe-timeout behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Three findings, all valid: - gofmt the imports in egress/quic.go. The earlier addition of `net` put it after `sync` instead of in stdlib alphabetical order, which gofmt now corrects. - Fix t.Cleanup ordering in both migration tests. closeAllRecords was registered too early, which combined with t.Cleanup's LIFO semantics meant QUIC connections were being closed AFTER their underlying PacketConns. quic-go's read goroutines saw vanished transports before they saw the connection close, producing noisy error logs and an occasional race against the test assertion path. Move the closeAllRecords registration to AFTER the PacketConn registrations so it runs first in the cleanup order; comment the intent so future edits don't accidentally re-break it. - Replace the hard-coded 4s upper bound and exact >=probeTimeout lower bound on the probe-timeout test with slack windows (probeTimeout - 100ms ≤ elapsed ≤ probeTimeout + 2s). This both avoids CI-scheduler-jitter flakes and keeps the test correct when probeTimeout is bumped — the explicit constant references the same probeTimeout the test sets on cm, so changing one updates both bounds. 3x test runs all stable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
myleshorton
pushed a commit
to getlantern/radiance
that referenced
this pull request
May 4, 2026
Picks up getlantern/lantern-box#253 which bumped broflake to land getlantern/unbounded#360 (clientcore: classify QUIC connection errors). After this ships in a Lantern build, lantern.log will carry err_class structured-log fields on every "QUIC connection ended" record, distinguishing idle_timeout / handshake_timeout / application_close_remote / application_close_local / transport_error / stateless_reset / context_canceled / etc. The transitive broflake bump also picked up getlantern/unbounded#359 (integration test for connection migration), which is test-only. `go mod tidy` clean; pre-existing `cmd/lantern/lantern.go:78` ipc.NewClient build error reproduces on main and is unrelated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
myleshorton
added a commit
to getlantern/radiance
that referenced
this pull request
May 6, 2026
Picks up getlantern/lantern-box#253 which bumped broflake to land getlantern/unbounded#360 (clientcore: classify QUIC connection errors). After this ships in a Lantern build, lantern.log will carry err_class structured-log fields on every "QUIC connection ended" record, distinguishing idle_timeout / handshake_timeout / application_close_remote / application_close_local / transport_error / stateless_reset / context_canceled / etc. The transitive broflake bump also picked up getlantern/unbounded#359 (integration test for connection migration), which is test-only. `go mod tidy` clean; pre-existing `cmd/lantern/lantern.go:78` ipc.NewClient build error reproduces on main and is unrelated. Co-authored-by: Adam Fisk <afisk@mini.local> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds two integration tests in
egress/migration_test.gocovering the connection-migration happy path and the probe-timeout failure mode. Plus a 1-line refactor tocreateOrMigrate's parameter type to make the test feasible without a real WebSocket.Background
Nelson reported intermittent connection-migration failures on Slack. The us-linode-nj egress's logs over the last 24h confirm: 13 successful migrations / 15 attempts, with 2 failures all of the form
path probe error: context deadline exceeded. PR #301 (where migration was first implemented) had no tests — so we have zero regression coverage on the load-bearing quic-goAddPath/Probe/Switchcontract.Two suspect dep bumps landed mid-April:
Either could in principle have broken the migration API contract. Both tests in this PR pass under the current quic-go v0.59, so we now know the API contract still holds — the prod failures are likely consumer-side state GC during the WebRTC re-pair window, not a quic-go regression.
What's tested
TestConnectionManager_Migration_HappyPathReal UDP loopback for both paths. Establishes a QUIC connection over path A, then calls
createOrMigrate(csid, pathB)and asserts:*quic.Connis the same pointer as path A's (i.e. migrate branch ran, not the dial branch)TestConnectionManager_Migration_ProbeTimeoutSame setup but path B's
WriteTois wired to silently drop all writes —PATH_CHALLENGEnever reaches the consumer. Asserts:createOrMigratereturns an error rather than hangingprobeTimeout(not later)cm.connectionsafter probe failure (so the next migration attempt can retry — production'sdeleteIfNotMigratedSinceis the only thing that should remove it)Refactor
createOrMigrate(csid, pconn *errorlessWebSocketPacketConn)→createOrMigrate(csid, pconn net.PacketConn). The function only readpconn.addrfor logging; now usespconn.LocalAddr().errorlessWebSocketPacketConnalready implementsnet.PacketConn, so no production caller needs to change. This is the only way I found to test migration without standing up a real WebSocket handshake — production's path-A pconn can be anything that talks UDP-like.Test plan
go vet ./egress/...clean (the two pre-existing wsCancel warnings on egresslib.go are unchanged)go test -run TestConnectionManager_Migration -vboth passgo test -count=5stable across runsgo test -run TestNewListener(the existing egress test) still passes🤖 Generated with Claude Code