Skip to content

clientcore: classify QUIC connection errors for triage#360

Merged
myleshorton merged 2 commits into
mainfrom
fisk/quic-error-classification
May 4, 2026
Merged

clientcore: classify QUIC connection errors for triage#360
myleshorton merged 2 commits into
mainfrom
fisk/quic-error-classification

Conversation

@myleshorton

Copy link
Copy Markdown
Contributor

Summary

Adds structured error classification to the consumer-side QUICLayer.ListenAndMaintainQUICConnection so we can distinguish why a QUIC connection died when triaging migration failures.

Why

The integration test in #359 confirmed quic-go's AddPath / Probe / Switch API still works correctly under v0.59 — so the prod "path probe error: context deadline exceeded" failures we see on the egress (~13% of migration attempts on us-linode-nj) are not a quic-go API regression.

The most plausible remaining cause is consumer-side QUIC state GC during the WebRTC re-pair gap: MaxIdleTimeout=60s + KeepAlivePeriod=15s in common/QUICCfg. If a producer-replacement gap exceeds the consumer's keep-alive expectations, the consumer's listener idle-times-out the connection and discards state. The egress later sends PATH_CHALLENGE over the new producer's WS, gets nothing back, and times out at 35s.

Today's consumer-side log just says "QUIC connection error (<err>), closing!" — same line for every failure mode. Can't tell idle timeout from app close from transport error from local cancel without reading individual error strings.

What changes

AcceptStream error path now emits a structured slog record:

"QUIC connection ended"
  err=...
  err_class=idle_timeout|application_close|transport_error|context_canceled|stateless_reset|version_negotiation|handshake_timeout|deadline_exceeded|other
  lifetime_s=42.3

err_class is computed by a new classifyQUICError(err error) string helper that uses errors.As against quic-go's exported error types (*quic.IdleTimeoutError, *quic.ApplicationError, etc.) plus errors.Is(err, context.Canceled / DeadlineExceeded). The listener.Accept error path gets the same field for consistency.

Diagnostic value

err_class=idle_timeout combined with lifetime_s ≈ 60s (= MaxIdleTimeout) is the smoking-gun signature for "consumer GC'd during re-pair gap." Once we have data on whether this dominates the failure mix, the fix is one of:

  • Bump MaxIdleTimeout to 120s (one-line change in common/QUICCfg)
  • Shorten KeepAlivePeriod to ~5s so keepalives have a chance during shorter gaps
  • Or something more structural if those don't move the needle

But we shouldn't pick before we have data, hence the instrumentation.

Test plan

  • go vet ./clientcore/... clean (the existing user.go:112,114 and egress_consumer.go:52 warnings are unchanged pre-existing)
  • go build ./... clean
  • TestClassifyQUICError covers all 8 mapped error classes plus nil / wrapped / plain-error baselines (12 subtests, all pass)
  • After this lands and is bumped through radiance + lantern, grep the laptop's lantern.log for QUIC connection ended and break down by err_class over a week. Reasonable mix gives the green light to bump MaxIdleTimeout; if transport_error or stateless_reset dominates instead, it's a different problem.

🤖 Generated with Claude Code

Following the migration integration test (#359), the most likely
remaining cause of the prod "path probe error: context deadline
exceeded" failures we've been seeing on the egress is consumer-side
QUIC connection-state GC during the WebRTC re-pair gap. We can't
distinguish that from other QUIC failure modes today because the
consumer-side log of every connection death is the same:

  "QUIC connection error (<err>), closing!"

This change replaces the freeform error log with a structured slog
record that adds err_class and lifetime_s fields:

  "QUIC connection ended"
    err=...
    err_class=idle_timeout|application_close|transport_error|...
    lifetime_s=42.3

Doing it this way means we don't have to read individual error
strings to count failure modes — grep on err_class and we're done.

Specifically:
- err_class=idle_timeout combined with lifetime_s ≈ MaxIdleTimeout is
  the smoking-gun signature for "consumer GC'd during re-pair gap".
  If we see this dominate the failure mix, bumping MaxIdleTimeout
  (currently 60s) and/or shortening KeepAlivePeriod (currently 15s)
  is a cheap structural fix.
- err_class=application_close is normal teardown when the egress's
  migrationWindow expires.
- err_class=context_canceled is local Close().
- err_class=transport_error / stateless_reset / version_negotiation
  point at quic-go-internal failures.

Also adds the same err_class field to the existing listener.Accept
error log, since that path can also fail with the same set of error
types.

Tests cover all eight known classifications via type-asserted
constructors plus wrapped/nil/plain-error baselines. quic-go's error
types are exported as of v0.59 so we don't need any reflection
gymnastics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 4, 2026 21:22

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds consumer-side QUIC connection error classification and structured logging in clientcore to make QUIC migration failures easier to triage (e.g., distinguishing idle timeouts vs peer closes vs local cancellation).

Changes:

  • Emit structured slog records for QUIC listener/connection termination including err_class (and lifetime_s for established connections).
  • Add classifyQUICError(err error) string helper that maps quic-go/context error types to stable short tags.
  • Add unit test TestClassifyQUICError to pin the tag mapping contract.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
clientcore/quic.go Adds structured termination logging + classifyQUICError helper for error triage.
clientcore/quic_test.go Adds table-driven tests to lock down err_class mappings.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread clientcore/quic.go
Address Copilot review on #360: quic.ApplicationError carries a
Remote flag indicating whether the close was peer-initiated or
local-initiated. Lumping both into the same "application_close" tag
hid the distinction between "egress finished its migrationWindow and
flushed the record" (remote, expected normal teardown) and "the
local QUICLayer closed its own connection mid-flight" (local,
suspicious enough to want flagged separately).

Split into application_close_remote and application_close_local. Doc
comment + the inline classifier comment now describe both. Tests
construct the ApplicationError both with Remote=true and Remote=false
(plus the wrapped-error variants) to lock in the routing.

In production the consumer's QUICLayer doesn't currently call
CloseWithError on the connection except as cleanup AFTER AcceptStream
errors, so we expect application_close_local to be near-zero. If it
shows up, that's a signal worth investigating separately rather than
ignoring under the "egress closed us" assumption.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@myleshorton myleshorton merged commit ed3cf75 into main May 4, 2026
myleshorton pushed a commit to getlantern/radiance that referenced this pull request May 4, 2026
Picks up getlantern/lantern-box#253 which bumped broflake to land
getlantern/unbounded#360 (clientcore: classify QUIC connection
errors). After this ships in a Lantern build, lantern.log will carry
err_class structured-log fields on every "QUIC connection ended"
record, distinguishing idle_timeout / handshake_timeout /
application_close_remote / application_close_local / transport_error
/ stateless_reset / context_canceled / etc.

The transitive broflake bump also picked up getlantern/unbounded#359
(integration test for connection migration), which is test-only.

`go mod tidy` clean; pre-existing `cmd/lantern/lantern.go:78`
ipc.NewClient build error reproduces on main and is unrelated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
myleshorton added a commit to getlantern/radiance that referenced this pull request May 6, 2026
Picks up getlantern/lantern-box#253 which bumped broflake to land
getlantern/unbounded#360 (clientcore: classify QUIC connection
errors). After this ships in a Lantern build, lantern.log will carry
err_class structured-log fields on every "QUIC connection ended"
record, distinguishing idle_timeout / handshake_timeout /
application_close_remote / application_close_local / transport_error
/ stateless_reset / context_canceled / etc.

The transitive broflake bump also picked up getlantern/unbounded#359
(integration test for connection migration), which is test-only.

`go mod tidy` clean; pre-existing `cmd/lantern/lantern.go:78`
ipc.NewClient build error reproduces on main and is unrelated.

Co-authored-by: Adam Fisk <afisk@mini.local>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants