Skip to content

fix(macos): improve tailscale gateway discovery#40167

Merged
ngutman merged 1 commit intomainfrom
fix/macos-tailscale-gateway-discovery
Mar 8, 2026
Merged

fix(macos): improve tailscale gateway discovery#40167
ngutman merged 1 commit intomainfrom
fix/macos-tailscale-gateway-discovery

Conversation

@ngutman
Copy link
Contributor

@ngutman ngutman commented Mar 8, 2026

Summary

Describe the problem and fix in 2–5 bullets:

  • Problem: macOS remote gateway discovery could stop Tailscale Serve probing as soon as another remote DNS-SD gateway appeared, and GUI-launched app sessions could fail to read tailscale status --json when TERM was missing.
  • Why it matters: Serve-only gateways could disappear from onboarding/settings, and selecting tailnet-hosted gateways was more likely to land on the wrong transport.
  • What changed: keep Tailscale Serve discovery running until it finds a Serve result, prefer direct transport for Serve and resolved .ts.net gateways, and set TERM=dumb for the Tailscale subprocess when needed.
  • What did NOT change (scope boundary): no protocol/config schema changes, no auth/token handling changes, and no non-macOS runtime behavior changes.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #
  • Related #

User-visible / Behavior Changes

  • Tailnet-hosted gateways discovered via Tailscale Serve or resolved .ts.net endpoints now default to direct transport when selected.
  • Tailscale Serve discovery is no longer suppressed just because another remote gateway was already discovered.
  • GUI-launched macOS app sessions can populate Tailscale Serve candidates more reliably.

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (Yes)
  • Command/tool execution surface changed? (Yes)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:

This changes when existing Tailscale Serve probes run and how the existing tailscale status --json subprocess is launched from the macOS app. Routing still uses resolved service endpoints rather than TXT hints, and the subprocess change is limited to supplying TERM=dumb when the GUI environment omits it.

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: SwiftPM package tests
  • Model/provider: N/A
  • Integration/channel (if any): Tailscale Serve discovery
  • Relevant config (redacted): none

Steps

  1. Build and run the targeted macOS discovery tests.
  2. Exercise gateway selection and Tailscale Serve discovery logic in the touched suites.
  3. Verify direct transport preference and subprocess environment handling in assertions.

Expected

  • Tailscale Serve discovery continues until a Serve gateway is found.
  • Selecting a tailnet-hosted gateway prefers direct transport.
  • GUI-launched environments without TERM can still query Tailscale status.

Actual

  • Verified by targeted macOS discovery tests on this branch.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Targeted test command run locally:

swift test --package-path /tmp/openclaw-review-P1bgki/apps/macos --filter 'GatewayDiscoveryModelTests|GatewayDiscoverySelectionSupportTests|TailscaleServeGatewayDiscoveryTests'

Result: 22 tests passed across 3 suites.

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: ran the targeted macOS discovery test suites covering discovery continuation, transport selection, and Tailscale subprocess environment behavior.
  • Edge cases checked: existing remote DNS-SD results no longer suppress Serve probing; Serve and merged tailnet gateways switch to direct transport; existing TERM is preserved.
  • What you did not verify: live onboarding/settings smoke on a real Mac with Tailscale running.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps:

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: switch remote transport back manually in Settings/onboarding, or revert this branch.
  • Files/config to restore: macOS gateway discovery selection and Tailscale Serve discovery files only.
  • Known bad symptoms reviewers should watch for: tailnet gateways preferring the wrong transport, or Serve candidates failing to appear in macOS discovery UI.

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

  • Risk: direct transport preference could surprise users who expected SSH for .ts.net gateways.
    • Mitigation: the selection logic only flips when a direct endpoint is actually available, and tests cover nearby LAN gateways staying on SSH.
  • Risk: continuing Serve discovery could increase background discovery churn briefly.
    • Mitigation: the loop still stops once a Serve result is found and retains bounded retry/backoff behavior.

@openclaw-barnacle openclaw-barnacle bot added app: macos App: macos size: M maintainer Maintainer-authored PR labels Mar 8, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 8, 2026

Greptile Summary

This PR fixes three independent macOS-only bugs in Tailscale gateway discovery: (1) Tailscale Serve probing was being suppressed as soon as any DNS-SD remote gateway appeared; (2) selecting a Tailscale Serve or resolved .ts.net gateway would keep the user on SSH transport instead of switching to the more appropriate direct transport; and (3) GUI-launched tailscale status --json subprocesses could fail with CLIError 3 when TERM was absent from the app's environment. All three fixes are narrowly scoped to macOS discovery code, are accompanied by new unit tests, and do not touch protocol, auth, or config schemas.

Key changes:

  • GatewayDiscoveryModel.shouldContinueTailscaleServeDiscovery now ignores DNS-SD results and stops only once a Tailscale Serve result is found, preventing premature discovery termination.
  • GatewayDiscoverySelectionSupport.shouldPreferDirectTransport promotes gateways with a tailscale-serve| stable ID or a resolved .ts.net service host to direct transport when a directUrl is available.
  • applyRemoteSelection now always writes both remoteUrl and remoteTarget on every selection, keeping both state fields in sync regardless of the current transport mode.
  • TailscaleServeGatewayDiscovery.commandEnvironment injects TERM=dumb when the variable is missing from the subprocess environment, preserving existing values when present.
  • The currentGateways parameter of shouldContinueTailscaleServeDiscovery is intentionally unused (_), which may cause confusion for future maintainers — consider removing it or adding a comment explaining why it is reserved.
  • The dual-write of remoteUrl and remoteTarget in applyRemoteSelection means a half-resolved gateway (no serviceHost) would now blank both fields rather than only the active one; callers should ensure only fully-resolved gateways are passed.

Confidence Score: 4/5

  • Safe to merge with no blocking issues; two minor style observations noted but neither affects correctness or security.
  • All three bug fixes are well-scoped to macOS-only discovery code with no auth, protocol, or config schema changes. The logic in each fix is sound: the discovery loop continuation condition is correct, the transport-preference guard on directUrl availability prevents false promotions, and the TERM=dumb injection is limited to the subprocess environment and respects existing values. The 22 new/updated test cases cover the primary code paths. Two minor style issues were flagged — an unused currentGateways parameter that could mislead future maintainers, and the new dual-write of both state fields creating a subtle edge when a half-resolved gateway is passed — but neither represents a correctness or security risk in the current implementation.
  • No files require special attention; GatewayDiscoverySelectionSupport.swift is worth a second read to confirm that only fully-resolved gateways are ever passed to applyRemoteSelection.

Last reviewed commit: f4db070

Comment on lines +365 to +372
static func shouldContinueTailscaleServeDiscovery(
currentGateways _: [DiscoveredGateway],
tailscaleServeGateways: [DiscoveredGateway]) -> Bool
{
// Tailscale Serve is a parallel discovery source. DNS-SD results should not suppress the
// probe, otherwise Serve-only gateways disappear as soon as any other remote gateway is found.
tailscaleServeGateways.isEmpty
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused parameter signals future API ambiguity

currentGateways is labeled _ and entirely ignored in the implementation — only tailscaleServeGateways.isEmpty is evaluated. While the in-line comment explains the intent clearly, keeping a labeled-but-discarded parameter in a static helper makes the call-site misleading: callers pass self.gateways expecting it to influence the result, but it never does.

If there are no near-term plans to use currentGateways, consider removing it from the signature and updating the call-site accordingly:

Suggested change
static func shouldContinueTailscaleServeDiscovery(
currentGateways _: [DiscoveredGateway],
tailscaleServeGateways: [DiscoveredGateway]) -> Bool
{
// Tailscale Serve is a parallel discovery source. DNS-SD results should not suppress the
// probe, otherwise Serve-only gateways disappear as soon as any other remote gateway is found.
tailscaleServeGateways.isEmpty
}
static func shouldContinueTailscaleServeDiscovery(
tailscaleServeGateways: [DiscoveredGateway]) -> Bool
{
// Tailscale Serve is a parallel discovery source. DNS-SD results should not suppress the
// probe, otherwise Serve-only gateways disappear as soon as any other remote gateway is found.
tailscaleServeGateways.isEmpty
}

And update the call-site in scheduleTailscaleServeFallback to drop the unused argument. If the parameter is intentionally reserved for a future condition, a comment noting that intent would prevent it from being pruned by the next contributor.

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/macos/Sources/OpenClawDiscovery/GatewayDiscoveryModel.swift
Line: 365-372

Comment:
**Unused parameter signals future API ambiguity**

`currentGateways` is labeled `_` and entirely ignored in the implementation — only `tailscaleServeGateways.isEmpty` is evaluated. While the in-line comment explains the intent clearly, keeping a labeled-but-discarded parameter in a `static` helper makes the call-site misleading: callers pass `self.gateways` expecting it to influence the result, but it never does.

If there are no near-term plans to use `currentGateways`, consider removing it from the signature and updating the call-site accordingly:

```suggestion
    static func shouldContinueTailscaleServeDiscovery(
        tailscaleServeGateways: [DiscoveredGateway]) -> Bool
    {
        // Tailscale Serve is a parallel discovery source. DNS-SD results should not suppress the
        // probe, otherwise Serve-only gateways disappear as soon as any other remote gateway is found.
        tailscaleServeGateways.isEmpty
    }
```

And update the call-site in `scheduleTailscaleServeFallback` to drop the unused argument. If the parameter is intentionally reserved for a future condition, a comment noting that intent would prevent it from being pruned by the next contributor.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +16 to +17
state.remoteUrl = GatewayDiscoveryHelpers.directUrl(for: gateway) ?? ""
state.remoteTarget = GatewayDiscoveryHelpers.sshTarget(for: gateway) ?? ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both state fields are now always written on every gateway selection

Previously remoteUrl was only written in .direct mode and remoteTarget only in .ssh mode, so whichever field wasn't active was left holding its previous value. The new code always sets both, which is logically cleaner (a later transport switch will find both values populated).

One edge to be aware of: if serviceHost is nil or empty (e.g. a Bonjour service that hasn't resolved yet), both directUrl(for:) and sshTarget(for:) return nil, so both fields are overwritten with "". Under the old code only the active field would have been blanked. This is unlikely to surface in practice — the gateway list typically only shows entries with resolved hosts — but if a half-resolved gateway could ever be passed to applyRemoteSelection, the caller's previous direct URL would now also be lost. Worth a quick sanity-check in the calling code that only fully-resolved gateways are passed here.

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/macos/Sources/OpenClaw/GatewayDiscoverySelectionSupport.swift
Line: 16-17

Comment:
**Both state fields are now always written on every gateway selection**

Previously `remoteUrl` was only written in `.direct` mode and `remoteTarget` only in `.ssh` mode, so whichever field wasn't active was left holding its previous value. The new code always sets both, which is logically cleaner (a later transport switch will find both values populated).

One edge to be aware of: if `serviceHost` is `nil` or empty (e.g. a Bonjour service that hasn't resolved yet), both `directUrl(for:)` and `sshTarget(for:)` return `nil`, so both fields are overwritten with `""`. Under the old code only the active field would have been blanked. This is unlikely to surface in practice — the gateway list typically only shows entries with resolved hosts — but if a half-resolved gateway could ever be passed to `applyRemoteSelection`, the caller's previous direct URL would now also be lost. Worth a quick sanity-check in the calling code that only fully-resolved gateways are passed here.

How can I resolve this? If you propose a fix, please make it concise.

@ngutman ngutman force-pushed the fix/macos-tailscale-gateway-discovery branch 2 times, most recently from 9b793f0 to bb923c6 Compare March 8, 2026 19:40
@ngutman ngutman force-pushed the fix/macos-tailscale-gateway-discovery branch from bb923c6 to 5d14656 Compare March 8, 2026 19:48
@ngutman ngutman merged commit a613143 into main Mar 8, 2026
19 of 20 checks passed
@ngutman ngutman deleted the fix/macos-tailscale-gateway-discovery branch March 8, 2026 19:49
@ngutman
Copy link
Contributor Author

ngutman commented Mar 8, 2026

Landed via squash after sanitizing the macOS test tailnet hostname.

Thanks @ngutman!

GordonSH-oss pushed a commit to GordonSH-oss/openclaw that referenced this pull request Mar 9, 2026
Sanitized test tailnet hostnames and re-ran the targeted macOS gateway discovery test suite before merge.
jenawant pushed a commit to jenawant/openclaw that referenced this pull request Mar 10, 2026
Sanitized test tailnet hostnames and re-ran the targeted macOS gateway discovery test suite before merge.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

app: macos App: macos maintainer Maintainer-authored PR size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant