clientcore: classify QUIC connection errors for triage#360
Merged
Conversation
Following the migration integration test (#359), the most likely remaining cause of the prod "path probe error: context deadline exceeded" failures we've been seeing on the egress is consumer-side QUIC connection-state GC during the WebRTC re-pair gap. We can't distinguish that from other QUIC failure modes today because the consumer-side log of every connection death is the same: "QUIC connection error (<err>), closing!" This change replaces the freeform error log with a structured slog record that adds err_class and lifetime_s fields: "QUIC connection ended" err=... err_class=idle_timeout|application_close|transport_error|... lifetime_s=42.3 Doing it this way means we don't have to read individual error strings to count failure modes — grep on err_class and we're done. Specifically: - err_class=idle_timeout combined with lifetime_s ≈ MaxIdleTimeout is the smoking-gun signature for "consumer GC'd during re-pair gap". If we see this dominate the failure mix, bumping MaxIdleTimeout (currently 60s) and/or shortening KeepAlivePeriod (currently 15s) is a cheap structural fix. - err_class=application_close is normal teardown when the egress's migrationWindow expires. - err_class=context_canceled is local Close(). - err_class=transport_error / stateless_reset / version_negotiation point at quic-go-internal failures. Also adds the same err_class field to the existing listener.Accept error log, since that path can also fail with the same set of error types. Tests cover all eight known classifications via type-asserted constructors plus wrapped/nil/plain-error baselines. quic-go's error types are exported as of v0.59 so we don't need any reflection gymnastics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds consumer-side QUIC connection error classification and structured logging in clientcore to make QUIC migration failures easier to triage (e.g., distinguishing idle timeouts vs peer closes vs local cancellation).
Changes:
- Emit structured slog records for QUIC listener/connection termination including
err_class(andlifetime_sfor established connections). - Add
classifyQUICError(err error) stringhelper that maps quic-go/context error types to stable short tags. - Add unit test
TestClassifyQUICErrorto pin the tag mapping contract.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
clientcore/quic.go |
Adds structured termination logging + classifyQUICError helper for error triage. |
clientcore/quic_test.go |
Adds table-driven tests to lock down err_class mappings. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Address Copilot review on #360: quic.ApplicationError carries a Remote flag indicating whether the close was peer-initiated or local-initiated. Lumping both into the same "application_close" tag hid the distinction between "egress finished its migrationWindow and flushed the record" (remote, expected normal teardown) and "the local QUICLayer closed its own connection mid-flight" (local, suspicious enough to want flagged separately). Split into application_close_remote and application_close_local. Doc comment + the inline classifier comment now describe both. Tests construct the ApplicationError both with Remote=true and Remote=false (plus the wrapped-error variants) to lock in the routing. In production the consumer's QUICLayer doesn't currently call CloseWithError on the connection except as cleanup AFTER AcceptStream errors, so we expect application_close_local to be near-zero. If it shows up, that's a signal worth investigating separately rather than ignoring under the "egress closed us" assumption. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
myleshorton
pushed a commit
to getlantern/radiance
that referenced
this pull request
May 4, 2026
Picks up getlantern/lantern-box#253 which bumped broflake to land getlantern/unbounded#360 (clientcore: classify QUIC connection errors). After this ships in a Lantern build, lantern.log will carry err_class structured-log fields on every "QUIC connection ended" record, distinguishing idle_timeout / handshake_timeout / application_close_remote / application_close_local / transport_error / stateless_reset / context_canceled / etc. The transitive broflake bump also picked up getlantern/unbounded#359 (integration test for connection migration), which is test-only. `go mod tidy` clean; pre-existing `cmd/lantern/lantern.go:78` ipc.NewClient build error reproduces on main and is unrelated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
myleshorton
added a commit
to getlantern/radiance
that referenced
this pull request
May 6, 2026
Picks up getlantern/lantern-box#253 which bumped broflake to land getlantern/unbounded#360 (clientcore: classify QUIC connection errors). After this ships in a Lantern build, lantern.log will carry err_class structured-log fields on every "QUIC connection ended" record, distinguishing idle_timeout / handshake_timeout / application_close_remote / application_close_local / transport_error / stateless_reset / context_canceled / etc. The transitive broflake bump also picked up getlantern/unbounded#359 (integration test for connection migration), which is test-only. `go mod tidy` clean; pre-existing `cmd/lantern/lantern.go:78` ipc.NewClient build error reproduces on main and is unrelated. Co-authored-by: Adam Fisk <afisk@mini.local> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds structured error classification to the consumer-side
QUICLayer.ListenAndMaintainQUICConnectionso we can distinguish why a QUIC connection died when triaging migration failures.Why
The integration test in #359 confirmed quic-go's
AddPath/Probe/SwitchAPI still works correctly under v0.59 — so the prod "path probe error: context deadline exceeded" failures we see on the egress (~13% of migration attempts on us-linode-nj) are not a quic-go API regression.The most plausible remaining cause is consumer-side QUIC state GC during the WebRTC re-pair gap:
MaxIdleTimeout=60s+KeepAlivePeriod=15sincommon/QUICCfg. If a producer-replacement gap exceeds the consumer's keep-alive expectations, the consumer's listener idle-times-out the connection and discards state. The egress later sendsPATH_CHALLENGEover the new producer's WS, gets nothing back, and times out at 35s.Today's consumer-side log just says
"QUIC connection error (<err>), closing!"— same line for every failure mode. Can't tell idle timeout from app close from transport error from local cancel without reading individual error strings.What changes
AcceptStreamerror path now emits a structured slog record:err_classis computed by a newclassifyQUICError(err error) stringhelper that useserrors.Asagainst quic-go's exported error types (*quic.IdleTimeoutError,*quic.ApplicationError, etc.) pluserrors.Is(err, context.Canceled / DeadlineExceeded). Thelistener.Accepterror path gets the same field for consistency.Diagnostic value
err_class=idle_timeoutcombined withlifetime_s ≈ 60s(=MaxIdleTimeout) is the smoking-gun signature for "consumer GC'd during re-pair gap." Once we have data on whether this dominates the failure mix, the fix is one of:MaxIdleTimeoutto 120s (one-line change incommon/QUICCfg)KeepAlivePeriodto ~5s so keepalives have a chance during shorter gapsBut we shouldn't pick before we have data, hence the instrumentation.
Test plan
go vet ./clientcore/...clean (the existinguser.go:112,114andegress_consumer.go:52warnings are unchanged pre-existing)go build ./...cleanTestClassifyQUICErrorcovers all 8 mapped error classes plus nil / wrapped / plain-error baselines (12 subtests, all pass)lantern.logforQUIC connection endedand break down byerr_classover a week. Reasonable mix gives the green light to bumpMaxIdleTimeout; iftransport_errororstateless_resetdominates instead, it's a different problem.🤖 Generated with Claude Code