Skip to content

feat: distributed locking for platform-database writes#12062

Open
premtsd-code wants to merge 25 commits into1.9.xfrom
distributed-lock
Open

feat: distributed locking for platform-database writes#12062
premtsd-code wants to merge 25 commits into1.9.xfrom
distributed-lock

Conversation

@premtsd-code
Copy link
Copy Markdown
Contributor

@premtsd-code premtsd-code commented Apr 27, 2026

Summary

Adds distributed locking on the four platform-DB write paths that were causing thundering-herd slow queries on the central platform DB when many regional pods coalesced on the same throttled accessedAt / cache-purge writes.

The lock primitive lives in a single class Appwrite\Locking\Lock (built on utopia-php/lock), exposed in DI as $lock. It owns the full acquire/release/telemetry/error-reporting/fail-open/kill-switch flow.

API

$lock->set($collection, $id, $attribute, $value);    // throttled single-attr write under per-attribute skip-lock + auth bypass
$lock->run($collection, $id, $fn);                    // skip-on-contention generic for multi-statement writes
$lock->runOrFail($collection, $id, $fn);              // block-then-409 for read-modify-write that must not silently drop
$lock->withKey($key, $fn, $ttl, $orFail, $waitTimeout);  // escape hatch for non-platform key shapes

Lock keys are scoped to the project's internal id ($project->getSequence()) so projects can't compete on the same key:

  • set: lock:platform:{projectInternalId}:{collection}:{id}:{attribute} — per-attribute, so accessedAt and mcpAccessedAt writes on the same project don't compete
  • run / runOrFail: lock:platform:{projectInternalId}:{collection}:{id} — per-document, what lost-update endpoints will need

Wired sites (4)

File Line Method Purpose
app/controllers/general.php 140 set('projects', $id, 'accessedAt', DateTime::now()) Router accessedAt bump
app/controllers/shared/api.php 254 run('keys', $id, fn) Key update + parent-doc cache purge
app/controllers/shared/api.php 395 set('projects', $id, 'accessedAt', DateTime::now()) API middleware accessedAt bump
app/controllers/shared/api.php 412 set('users', $id, 'accessedAt', $userAccessedAt) Admin path accessedAt bump

All four sites previously had every regional pod racing to write the same value; with skip-on-contention semantics, N concurrent identical writes coalesce into 1 actual platform-DB write per throttle window.

Behavior

Mode Outcome Telemetry outcome When
Acquired callback runs, lock released acquired normal path
Skipped callback does not run skipped set / run lost the race; another pod is doing the same work
Contended throws GENERAL_RESOURCE_LOCKED (HTTP 409) contended runOrFail only — for read-modify-write where dropping is wrong
Backend error callback runs unlocked (fail-open) backend_error Redis unreachable; better stale-write than total outage
Release error logged + telemetry release_error TTL expired before release
Disabled callback runs unlocked (not counted) _APP_LOCKING_ENABLED=disabled kill switch

Telemetry counter: lock.attempts{outcome,target,project}target extracted from key (e.g., projects, users, keys); project is the project internal id.

Sentry / Logger pushes are rate-limited to one per 60s per (action, target) so a sustained backend outage doesn't flood the error log across the pod fleet.

Config

New env var _APP_LOCKING_ENABLED (default enabled):

Value Effect
enabled (default) Lock factory operates normally
disabled Both Lock::run and Lock::runOrFail short-circuit; callbacks run unlocked, return values preserved

Documented in app/config/variables.php. Wired through docker-compose.yml.

New error code Exception::GENERAL_RESOURCE_LOCKED → HTTP 409, used only when runOrFail times out on contention. Reserved for the deferred lost-update follow-up; no current call site triggers it.

Pattern guide for future lock sites

"If two requests race and one wins, is losing the other one's work a bug?"

  • Noset (single-attribute throttled bumps) or run (multi-statement idempotent writes)
  • YesrunOrFail (the user/SDK retries on 409; the second write applies on top of the first)

Existing accessedAt writes are idempotent fan-out (same value, same effect) → skip semantics. The deferred lost-update endpoints (services map, OAuth providers, SMTP arrays, etc.) are read-modify-write on shared state → 409 semantics.

Tests

Unit tests at tests/unit/Locking/LockTest.php (11 tests, 27 assertions, ~3s):

  • set per-attribute key shape + auth-skipped sparse Document write
  • set skip-on-contention (held key → no DB write)
  • set different attributes don't compete (accessedAt and mcpAccessedAt independent)
  • run per-document key + callback invocation
  • run skip-on-contention (held key → callback doesn't run)
  • runOrFail throws GENERAL_RESOURCE_LOCKED on contention
  • runOrFail callback return value passes through when uncontended
  • withKey raw-key passthrough for non-platform shapes
  • withKey orFail flag throws on contention
  • _APP_LOCKING_ENABLED=disabled runs callback unlocked
  • Empty project document → unknown projectInternalId bucket

Tests fail loudly (no markTestSkipped) if Redis isn't reachable — the appwrite test container always has it.

Test plan

  • composer lint clean on all modified files
  • composer analyze (PHPStan level 3) clean on all modified files
  • composer validate clean
  • App boots with default config (_APP_LOCKING_ENABLED unset → enabled)
  • Unit tests pass (11/11)
  • App boots with _APP_LOCKING_ENABLED=disabledLock no-ops; no errors
  • Telemetry: lock.attempts{outcome=acquired,target=projects} increments on acquire
  • Telemetry: lock.attempts{outcome=skipped,target=*} increments on contention for the accessedAt paths under load
  • Manual: regional burst on the same project shows 1 accessedAt write per throttle window instead of N

Pre-merge cleanup

  • Remove repositories[utopia-php/lock] VCS pin from composer.json once utopia-php/lock is published as a tagged release (currently pinned to 0.2.* from VCS)

Notes

  • The two earlier inline factories $distributedLock / $distributedLockOrFail were consolidated into Appwrite\Locking\Lock so all lock logic lives in one file. The class uses \Redis directly (no cluster mode); cloud's local docker-compose passes _APP_REDIS_HOST=dragonfly so the same single-instance backend powers both pool and direct paths. Cluster support is staged behind a separate upstream PR to utopia-php/lock (utopia-php/lock#1) — not depended on here.
  • The deferred lost-update follow-up (18 endpoints across Project/* settings + Projects/Update/Team) was scoped out of this PR; tracked locally in required_lock.md with the recovery commit SHAs. Plan for that PR is middleware + ->label('lock', 'projects:{projectId}') driving Lock::runOrFail, matching how Appwrite already wires scope / event / audits.event.
  • Cloud companion PR: appwrite-labs/cloud#3907 — adds mcpAccessedAt (project) and teams.accessedAt lock sites, consumes the same Lock class via DI.

Adds two DI factories and wires them where coordination is needed:

  - distributedLock — skip on contention, void return. For idempotent
    fan-out where N pods doing the same write is wasteful but losing
    the race is correct.
  - distributedLockOrFail — blocking acquire (3s default) then throws
    GENERAL_RESOURCE_LOCKED (HTTP 409) on contention. For
    read-modify-write on shared mutable state where a silent skip
    would drop a user's change.

Both factories: _APP_LOCKING_ENABLED kill switch (set 'disabled' for
fail-open), fail-open on Redis-unreachable, and a lock.attempts
telemetry counter sliced by outcome and target collection.

Wired sites:
  - shared/api.php × 3 (distributedLock): keys.accessedAt + sdks,
    projects.accessedAt, users.accessedAt. Reduces redundant writes
    and cache-purge fan-out under request bursts on the same project.
  - Project/Services/Update.php × 1 (distributedLockOrFail): the
    services map toggle. Re-reads inside the lock so the baseline
    reflects concurrent updates. Two simultaneous toggles to
    different services no longer lose one of them.

Lock key namespace: lock:platform:{collection}:{id}.

Dep: premtsd-code/lock pinned to a specific commit as a development
preview. Migration to utopia-php/lock is a follow-up once that
package is published.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 27, 2026

Greptile Summary

This PR introduces distributed locking for four platform-DB write paths (accessedAt bumps for projects/users and API-key update+cache-purge) to reduce thundering-herd writes when many regional pods race on the same document. The lock primitive lives in Appwrite\Locking\Lock, registered as a per-request DI resource backed by utopia-php/lock over Redis.

Several P1 issues were flagged in prior review rounds and remain unresolved: reportError mutates the shared per-request $log object (corrupting the real request context on any lock backend error), the execute() acquire path catches only \RedisException (breaking fail-open for any other throwable from the library), the sdks array inside the key-update run() block is computed outside the lock so a concurrent request using a different SDK will have its registration silently dropped, and the blocking acquire() in runOrFail runs synchronously which will stall the Swoole event loop for up to 3 s per contended request.

Confidence Score: 3/5

Not safe to merge until the P1 issues from prior rounds (log mutation, narrow catch, SDK data loss, Swoole blocking) are addressed.

Four independently confirmed P1 defects in previous review rounds remain open: shared Log mutation corrupts error reporting, narrow catch breaks fail-open semantics, sdks append is lossy under skip-on-contention, and blocking acquire stalls the Swoole event loop. Multiple P1s pull the score below the 4/5 ceiling.

src/Appwrite/Locking/Lock.php (log mutation, narrow catch), app/controllers/shared/api.php (sdk data loss)

Important Files Changed

Filename Overview
src/Appwrite/Locking/Lock.php New Lock abstraction wrapping utopia-php/lock; has reported issues with shared $log mutation in reportError, narrow \RedisException catch breaking fail-open guarantee, and Swoole-blocking acquire() in runOrFail
app/controllers/shared/api.php Key-update path wrapped in lock.run() with skip-on-contention, but the non-idempotent sdks array accumulation (computed outside the lock) can silently lose a registration on contention
app/init/resources/request.php Single per-request lock DI factory registered cleanly; Lock constructor reads _APP_LOCKING_ENABLED at construction time so kill-switch requires worker restart to take effect
app/controllers/general.php Router accessedAt bump migrated to lock.set(); Lock dependency injected and propagated through all router call sites cleanly
composer.json Adds utopia-php/lock 0.2.* via a VCS repository entry (not yet on Packagist); PR notes the pin should be removed post-publication
tests/unit/Locking/LockTest.php 11 focused unit tests covering all public Lock methods including contention, disabled mode, and unknown-project fallback; test setup correctly cleans Redis keys per test
app/config/errors.php Adds GENERAL_RESOURCE_LOCKED → HTTP 409 error entry; no issues
app/config/variables.php Registers _APP_LOCKING_ENABLED variable with introduction version 1.9.3; no issues
src/Appwrite/Extend/Exception.php Adds GENERAL_RESOURCE_LOCKED constant; no issues
docker-compose.yml Passes _APP_LOCKING_ENABLED env var to the appwrite service; no issues

Reviews (23): Last reviewed commit: "revert: drop $log clone and sdks in-lock..." | Re-trigger Greptile

Comment thread src/Appwrite/Platform/Modules/Project/Http/Project/Services/Update.php Outdated
Comment thread app/config/variables.php
Comment thread app/init/resources.php Outdated
Comment thread composer.json
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 27, 2026

🔄 PHP-Retry Summary

Flaky tests detected across commits:

Commit 542aac7 - 4 flaky tests
Test Retries Total Time Details
UsageTest::testFunctionsStats 1 10.15s Logs
UsageTest::testPrepareSitesStats 1 8ms Logs
UsageTest::testEmbeddingsTextUsageDoesNotBreakProjectUsage 1 6ms Logs
ProxyCustomServerTest::testCreateSiteRule 1 120.60s Logs
Commit 7b2f6ac - 8 flaky tests
Test Retries Total Time Details
UsageTest::testFunctionsStats 1 10.20s Logs
UsageTest::testPrepareSitesStats 1 7ms Logs
UsageTest::testEmbeddingsTextUsageDoesNotBreakProjectUsage 1 5ms Logs
LegacyCustomServerTest::testCreatedAfter 1 240.52s Logs
LegacyPermissionsGuestTest::testReadDocuments with data set #3 1 240.14s Logs
LegacyTransactionsConsoleClientTest::testBulkDeleteMatchingCreatedDocuments 1 240.47s Logs
VectorsDBConsoleClientTest::testGetCollectionLogs 1 4ms Logs
DatabasesConsoleClientTest::testGetCollectionLogs 1 3ms Logs
Commit 03575e6 - 4 flaky tests
Test Retries Total Time Details
UsageTest::testFunctionsStats 1 10.27s Logs
UsageTest::testPrepareSitesStats 1 8ms Logs
UsageTest::testEmbeddingsTextUsageDoesNotBreakProjectUsage 1 6ms Logs
TablesDBACIDTest::testAtomicity 1 240.92s Logs
Commit 18f3280 - 4 flaky tests
Test Retries Total Time Details
UsageTest::testFunctionsStats 1 10.28s Logs
UsageTest::testPrepareSitesStats 1 7ms Logs
UsageTest::testEmbeddingsTextUsageDoesNotBreakProjectUsage 1 5ms Logs
FunctionsScheduleTest::testCreateScheduledAtExecution 1 127.73s Logs
Commit e0ec28f - 28 flaky tests
Test Retries Total Time Details
UsageTest::testFunctionsStats 1 10.21s Logs
UsageTest::testPrepareSitesStats 1 8ms Logs
UsageTest::testEmbeddingsTextUsageDoesNotBreakProjectUsage 1 6ms Logs
WebhooksCustomServerTest::testDeleteDeployment 1 18ms Logs
WebhooksCustomServerTest::testDeleteFunction 1 11ms Logs
WebhooksCustomServerTest::testCreateCollection 1 11ms Logs
WebhooksCustomServerTest::testCreateAttributes 1 12ms Logs
WebhooksCustomServerTest::testCreateDocument 1 75ms Logs
WebhooksCustomServerTest::testUpdateDocument 1 9ms Logs
WebhooksCustomServerTest::testDeleteDocument 1 13ms Logs
WebhooksCustomServerTest::testCreateTable 1 14ms Logs
WebhooksCustomServerTest::testCreateColumns 1 11ms Logs
WebhooksCustomServerTest::testCreateRow 1 9ms Logs
WebhooksCustomServerTest::testUpdateRow 1 14ms Logs
WebhooksCustomServerTest::testDeleteRow 1 14ms Logs
WebhooksCustomServerTest::testCreateStorageBucket 1 5ms Logs
WebhooksCustomServerTest::testUpdateStorageBucket 1 9ms Logs
WebhooksCustomServerTest::testCreateBucketFile 1 14ms Logs
WebhooksCustomServerTest::testUpdateBucketFile 1 7ms Logs
WebhooksCustomServerTest::testDeleteBucketFile 1 5ms Logs
WebhooksCustomServerTest::testDeleteStorageBucket 1 13ms Logs
WebhooksCustomServerTest::testCreateTeam 1 6ms Logs
WebhooksCustomServerTest::testUpdateTeam 1 11ms Logs
WebhooksCustomServerTest::testUpdateTeamPrefs 1 14ms Logs
WebhooksCustomServerTest::testDeleteTeam 1 5ms Logs
WebhooksCustomServerTest::testCreateTeamMembership 1 13ms Logs
WebhooksCustomServerTest::testDeleteTeamMembership 1 9ms Logs
WebhooksCustomServerTest::testWebhookAutoDisable 1 35ms Logs

Note: Flaky test results are tracked for the last 5 commits

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 27, 2026

✨ Benchmark results

Comparing 1.9.x (before) to distributed-lock (after).

Before

Scenario P50 (ms) P95 (ms) Requests RPS
API total 14.37 130.4 185 34.04
Account 24.68 205.79 35 7.04
TablesDB 13.15 19.79 35 8.57
Storage 12.06 52.59 75 18.01
Functions 21.23 35.35 40 9.75

After

Scenario P50 (ms) P95 (ms) Requests RPS
API total 15.1 140.29 185 33.81
Account 25.38 182.46 35 6.99
TablesDB 13.47 24.23 35 8.4
Storage 11.92 51.46 75 17.68
Functions 21.8 34.01 40 9.53

Delta

Scenario P95 delta (ms)
API total +9.89
Account -23.33
TablesDB +4.44
Storage -1.13
Functions -1.33
Top API waits
API request Max wait (ms)
account.sessions.email.create 687.39
account.create 182.35
account.password.update 157.15

premtsd-code added a commit that referenced this pull request Apr 27, 2026
Three P1 issues flagged on the initial commit:

1. Lock key in updateProjectService used "platform:project:{id}" —
   missing the "lock:" namespace prefix and using singular "project"
   instead of the conventional plural collection name. The factory's
   `lockTargetOf` extracts segment [2] as the telemetry target, so
   the broken key was emitting the project ID itself as the target
   attribute (cardinality blowup, broken dashboards). Fixed to
   "lock:platform:projects:{id}" matching the convention used in
   shared/api.php.

2. The 409 contention exception embedded the raw Redis lock key in
   its user-facing message, leaking internal collection names and
   the locking namespace to API clients. Removed the custom message
   so the catalog default ("The requested resource is currently
   being modified...") is used. Telemetry already carries the
   target collection for operator-side observability.

3. _APP_LOCKING_ENABLED variable doc had `introduction: '1.10.0'`
   on a 1.9.x-targeted PR. Corrected to '1.9.3' (next 1.9.x patch).
Comment thread app/init/resources.php Outdated
Three P1 issues flagged on the initial commit:

1. Lock key in updateProjectService used "platform:project:{id}" —
   missing the "lock:" namespace prefix and using singular "project"
   instead of the conventional plural collection name. The factory's
   `lockTargetOf` extracts segment [2] as the telemetry target, so
   the broken key was emitting the project ID itself as the target
   attribute (cardinality blowup, broken dashboards). Fixed to
   "lock:platform:projects:{id}" matching the convention used in
   shared/api.php.

2. The 409 contention exception embedded the raw Redis lock key in
   its user-facing message, leaking internal collection names and
   the locking namespace to API clients. Removed the custom message
   so the catalog default ("The requested resource is currently
   being modified...") is used. Telemetry already carries the
   target collection for operator-side observability.

3. _APP_LOCKING_ENABLED variable doc had `introduction: '1.10.0'`
   on a 1.9.x-targeted PR. Corrected to '1.9.3' (next 1.9.x patch).
Two complementary tests for the lost-update bug the lock fixes, plus
the test-Client patches needed to make Swoole-coroutine HTTP work.

- tests/e2e/Services/Project/ServicesBase.php :: testConcurrentTogglesAllPersist
  Fires N parallel PATCHes via Swoole\Coroutine\run + SWOOLE_HOOK_CURL,
  then refetches the project and asserts (successCount == enabledCount).
  With the lock disabled (_APP_LOCKING_ENABLED=disabled) sparse
  updateDocument() calls overwrite each other and the assertion fails —
  proving the test detects the bug.

- dev/test-distributed-lock.sh
  Same proof via curl + bash background jobs. Runnable outside the
  PHPUnit suite for manual verification or load-style sweeps. Reads
  APPWRITE_ENDPOINT / APPWRITE_PROJECT_ID / APPWRITE_API_KEY.

- tests/e2e/Client.php (test util):
    * Skip CURLOPT_PATH_AS_IS (option 234) when SWOOLE_HOOK_CURL is
      active. Swoole's emulated cURL doesn't support it; setting it
      fatal-errors as soon as any test enables the cURL hook.
    * Don't redundantly set CURLOPT_NOBODY=false on non-HEAD requests
      (false is cURL's default). Under Swoole's hooked cURL this
      strips the body of PATCH/PUT requests, hitting the framework's
      404 fallback instead of the intended route.
  Both changes preserve native (non-hooked) cURL behavior unchanged.
  They unblock any future test that wants real parallel HTTP via
  Swoole\Coroutine\run + the cURL hook.

Both follow the same proof-of-bug pattern: run with locking enabled
(must pass) AND with it disabled (must fail). Verified locally against
the running stack:
  _APP_LOCKING_ENABLED=enabled  -> PASS  (15 assertions)
  _APP_LOCKING_ENABLED=disabled -> FAIL  (successCount=5 enabledCount=4)
Lock backend errors (Redis/Dragonfly unreachable) and release errors
(TTL expired or backend dropped while held) were previously visible only
in the lock.attempts counter and Console::warning lines. They now also
push a structured Log entry through the configured logger adapter, so
operators using Sentry/Raygun/AppSignal/LogOwl get first-class events
for these specific failure modes.

Pattern matches Embeddings/Text/Create.php exactly:

  - Action injects 'log' (per-request Log object) and 'logger'
    (?Logger, nullable when _APP_LOGGING_CONFIG unset).
  - Helper mutates the per-request $log instead of constructing a
    fresh one — preserves the per-request context Embeddings expects.
  - Same field set: namespace='http', server, version, type,
    setMessage, setAction, setEnvironment, addTag('code', ...),
    addExtra('file' / 'line' / 'trace').
  - Defensive try/catch around addLog() so logging failures don't
    break fail-open.

Lock-specific tags added for slicing in Sentry:

  - lock.target — collection name (projects, keys, users, ...).
    Bounded set, safe for high-cardinality stores.
  - lock.key_pattern — full key with the trailing document ID
    stripped (lock:platform:projects:* not lock:platform:projects:abc).
    Prevents unbounded log cardinality from per-document IDs.

Rate limiting via per-pod static buckets, 60s window per
(action, target) combo. During a 5-minute Dragonfly outage, a fleet
of N pods produces at most N events/min, well within Sentry's dedup
tolerance. Static state is per-Swoole-worker; coroutines may race
on the bucket boundary but the worst case is one duplicate report.

Type level set to Log::TYPE_WARNING (not ERROR): fail-open means the
request still succeeds, so this is degraded operation, not a failed
request.

Deliberately NOT reported to Sentry:

  - 409 GENERAL_RESOURCE_LOCKED (normal user-facing concurrency)
  - skip-on-contention events (idempotent fan-out by design)
  - acquire retry conflicts (internal loop)
  - destructor cleanups (have an expected baseline rate; the
    lock.attempts counter aggregates them better than Sentry would)

Factory signature change: distributedLock and distributedLockOrFail
now accept ?Log and ?Logger as optional named args at call time
(rather than capturing Logger at factory-build time). The factory
closure runs once at boot but the per-request Log resource is
fresh per request — capturing at boot would have given stale state.
Existing call sites threaded log: $log, logger: $logger. Sites that
don't (workers, CLI tasks) get null and just log to Console as
before.
Comment thread app/init/resources.php Outdated
`utopia-php/lock` v0.2.0 was published this week and provides the same
Redis SET-NX-EX + Lua-compare-and-delete primitive we built locally as
`premtsd-code/lock`. Drop the dev-preview package in favor of the
official Utopia PHP library.

- composer: replace `premtsd-code/lock` with `utopia-php/lock` 0.2.*
  (still via VCS — not on Packagist yet)
- resources.php: rewire both factory variants
  - `Lock + Adapter\Redis` → `Distributed`
  - `acquire()` → `tryAcquire()` for skip variant
  - `acquire(blocking: true, waitTimeout)` → `acquire($waitTimeout)` for
    OrFail variant
  - `LockAcquireException` → `\RedisException`
  - `(int) $ttl` cast — utopia-php/lock takes seconds as int
- docker-compose: thread `_APP_LOCKING_ENABLED` into the appwrite
  service environment so the kill switch documented in
  `app/config/variables.php` is actually usable from `.env`

Verified end-to-end on local stack:
- positive case (locking enabled): 5/5 testConcurrentTogglesAllPersist
  pass, lock keys observed in `redis-cli MONITOR` with concurrent SET
  NX contention
- negative case (locking disabled): 1/3 detect lost updates as before
The pilot on `Project/Services/Update.php` (commit fb0d43d) fixed the
silent lost-update on the `services` JSON map. The same read-modify-write
pattern exists on 17 other endpoints — all writing to the SAME `projects`
document, mostly via sparse `updateDocument`. Concurrent writes silently
overwrite each other, both within an endpoint (two toggles to different
auth methods) and across endpoints (parallel SMTP update + service toggle).

The worst offender is `Projects/Update.php` (`PATCH /v1/projects/:projectId`):
it calls `setAttribute` 11 times then writes the WHOLE document back, so a
concurrent write to *any* attribute on the project doc — services, auths,
smtp, templates, mockNumbers — gets clobbered.

Wrap every read-modify-write window with `distributedLockOrFail` keyed on
`lock:platform:projects:{$projectId}`. The single shared key serializes
both same-endpoint and cross-endpoint races on the same project.

Endpoints now under the lock:

  Projects module (console-managed, plural):
    - Projects/Update                  (full doc setAttribute chain)
    - Projects/Team/Update             (sparse + computed permissions; cascade
                                        to installations/repositories/vcsComments
                                        runs after release — separate collections)

  Project module (current-project, singular):
    - AuthMethods/Update               (auths[authKey])
    - Protocols/Update                 (apis[protocolId])
    - SMTP/Update                      (smtp map; PHPMailer probe inside lock —
                                        10s default TTL covers Timeout=5)
    - Templates/Email/Update           (templates['email.X-locale'])
    - MockPhone/{Create,Delete,Update} (auths['mockNumbers'][])
    - Policies/MembershipPrivacy       (auths['memberships*'])
    - Policies/PasswordDictionary      (auths['passwordDictionary'])
    - Policies/PasswordHistory         (auths['passwordHistory'])
    - Policies/PasswordPersonalData    (auths['personalDataCheck'])
    - Policies/SessionAlert            (auths['sessionAlerts'])
    - Policies/SessionDuration         (auths['duration'])
    - Policies/SessionInvalidation     (auths['invalidateSessions'])
    - Policies/SessionLimit            (auths['maxSessions'])
    - Policies/UserLimit               (auths['limit'])

Each endpoint:
  1. Injects `distributedLockOrFail`, `log`, `logger`
  2. Re-reads the project document inside the lock so the baseline reflects
     any update that landed between request init and lock acquisition
  3. Throws `GENERAL_RESOURCE_LOCKED` (HTTP 409) on contention timeout

SMTP/Update and Templates/Email/Update additionally refactor variable-
variables (`${$key}`) to an explicit input array — PHPStan can't trace
dynamic references through closure `use` clauses.
Every request that arrives via a custom-domain rule (router path) reads
the project's `accessedAt` timestamp and, if the throttle window
(`APP_PROJECT_ACCESS`) has elapsed, writes a fresh value. With concurrent
traffic across multiple pods, this is a per-row hot RMW that loses
updates silently — the surviving timestamp depends on which pod's write
landed last.

Wrap the read-modify-write in `distributedLock('lock:platform:projects:{id}')`
(skip-on-contention variant). Every concurrent pod would write the same
throttled value, so losing the race is correct: the winner's update covers
ours.

Wires `distributedLock` and `?Logger` through:
  - `router()` function signature (app/controllers/general.php:70)
  - the three Http::init / Http::get callbacks that invoke router():
    `*` catch-all (init), `/robots.txt`, `/humans.txt`

Two related cloud-only RMW sites (`teams.accessedAt`,
`projects.mcpAccessedAt`) live in `appwrite-labs/cloud` and need a
follow-up PR there. They depend on this branch reaching 1.9.x so the
`distributedLock` DI resource is available downstream.
The previous shape required every caller to thread `log: $log, logger: $logger`
as named args into each `distributedLock(...)` invocation, plus inject `log`
and `logger` into the surrounding action just to forward them to the lock.
Across 21 call sites this added ~100 LOC of pure plumbing.

The cause: the lock factory was registered on the global container in
`app/init/resources.php`, where per-request resources like `log` aren't
visible. That forced the factory to expose its inner closure with optional
`?Log $log = null, ?Logger $logger = null` params, which every caller had
to satisfy.

Move the lock factory + its `lockErrorReporter`/`lockTargetOf` helpers from
the global container to the per-request container (`resources/request.php`),
and add `'log'` + `'logger'` to the factory's dep list. The factory closure
now runs per-request and closes over the per-request `Log`/`Logger`. Inner
closure returned to callers no longer needs the optional params, and call
sites drop the named args entirely.

Knock-on cleanup:
- Drop `->inject('log')`, `->inject('logger')`, the corresponding action
  params, and `use Utopia\Logger\{Log,Logger}` imports from 19 endpoint
  files where they were only there for the lock
- Drop the same plumbing from `app/controllers/shared/api.php` (3 lock call
  sites)
- Drop just the Logger plumbing from `app/controllers/general.php` (router
  function + 3 callbacks); `Log` is kept because it's used elsewhere in
  that file
- Net 120 LOC removed across 23 files

No behavior change: the lock factories still produce the same closures
(skip-on-contention `distributedLock`, blocking-with-409 `distributedLockOrFail`).
The static lockErrorReporter rate limiter (1 push per 60s per
`(action, target)` bucket) continues to work — it lives on a closure-static
in the helper, which is independent of where the helper is constructed.

Verified end-to-end: testConcurrentTogglesAllPersist passes 4/5 (the cold-
start race flake is the same one we've consistently seen and is orthogonal
to lock changes).
Comment thread tests/e2e/Services/Project/ServicesBase.php Outdated
Drop the 18 lost-update endpoint locks (Project/* settings + Projects
team/update). Those address a different bug class (read-modify-write
races) than the manager-flagged production problem (regions slow queries
to platform from thundering herd on accessedAt writes).

Kept:
- distributedLock + distributedLockOrFail factories on the per-request
  container, GENERAL_RESOURCE_LOCKED exception, 409 mapping
- 4 thundering-herd sites: cache-invalidation in shared/api.php (3) +
  router projects.accessedAt in general.php (1)

Dropped:
- 18 endpoint OrFail wires
- testConcurrentTogglesAllPersist + Swoole-cURL test client patches
- dev/test-distributed-lock.sh smoke script
Comment thread app/init/resources/request.php Outdated
Comment thread app/init/resources/request.php Outdated
@blacksmith-sh

This comment has been minimized.

Extracts the lock-key format and the lock+auth-skip+sparse-update pattern
into Appwrite\Locking\Lock with three methods:
  - set(collection, id, attribute=accessedAt, value=null) — throttled
    single-attribute write
  - run(collection, id, fn) — generic skip-on-contention
  - runOrFail(collection, id, fn) — block-then-409 for the deferred
    lost-update follow-up

Migrates the 4 call sites (router projects accessedAt + 3 in shared/api)
off the raw $distributedLock callable. Raw factories stay as escape
hatches for non-platform key shapes.
Comment thread app/init/resources/request.php Outdated
Lock now uses Utopia\Lock\Distributed directly and owns the full
acquire/release/telemetry/error-reporting/fail-open/kill-switch logic
that previously lived in two inline DI factory closures.

Adds withKey($key, $fn, $ttl, $orFail, $waitTimeout) as a generic
escape hatch for non-platform key shapes (cache, queue, edge) and
unusual TTL/timeout requirements.

Per-attribute lock keys for set() so that an accessedAt bump and a
mcpAccessedAt bump on the same projects:{id} document don't compete.
Whole-document operations (run, runOrFail) keep document-level keys.

Removes the standalone distributedLock and distributedLockOrFail DI
factories — Lock is the single API.

request.php shrinks ~150 LOC; Lock.php grows to ~190 LOC.
…llback

The dedicated \Redis DI resource (used by timelimit and the new Lock
class) was reading _APP_REDIS_HOST/PORT/PASS exclusively. Cloud
deployments configure cache via _APP_CONNECTIONS_CACHE URI form
(e.g. cache=redis://dragonfly:6379) and don't pass the legacy
_APP_REDIS_* vars to the appwrite container locally, so timelimit and
Lock both fail to connect outside production where Helm separately
injects the legacy vars.

Now prefers _APP_CONNECTIONS_CACHE when set (matching the cache pool
backend), falls back to _APP_REDIS_* for CE-style configs. No new env
vars introduced; both timelimit and Lock work in CE, cloud-local, and
cloud-production without compose changes.
Comment thread src/Appwrite/Locking/Lock.php
Per-manager request, lock keys are now prefixed with the project's
internal id (sequence) so that:
  - Locks are partitioned by project — Redis cluster slot affinity
    if/when sharded.
  - Cross-project requests can't compete on the same key for
    collection-scoped resources.
  - Telemetry (counter + Sentry tags) carries 'project' alongside
    'target', so dashboards can filter contention by project.

Key shapes:
  set:        lock:platform:{project}:{collection}:{id}:{attribute}
  run/orFail: lock:platform:{project}:{collection}:{id}
  withKey:    raw (caller-provided)

Lock now requires a project document at construction. All existing
call sites (4 in CE + 2 in cloud) run inside Http::init()-resolved
request scope where the project document is set, so no migration
needed. Workers/CLI without project context can use withKey directly.
Cloud production runs four separate single-master+replica Dragonfly
deployments (cache, queue-dragonfly, queue-usage, pubsub-dragonfly),
not sharded Redis Cluster topology — confirmed by deploy/cloud/values
+ environments/production/*.values.yaml (Dragonfly Operator with
replicas=2 = 1 primary + 1 read replica), and by the dev DSN scheme
'redis://' (not 'redis-cluster://').

So a standard \Redis client suffices for the direct redis resource
(timelimit, Lock). Cloud just needs to pass _APP_REDIS_HOST/PORT/USER/
PASS through to the appwrite container — handled in the cloud PR's
docker-compose.yml change.

This reverts the resource to its original pre-PR shape. The
utopia-php/lock cluster-support PR (utopia-php/lock#1) stays open at
upstream as a future-ready option if cloud ever moves to actual
Redis Cluster mode.
Covers the four public methods (set, run, runOrFail, withKey) plus the
kill switch and the project-fallback behavior.

Tests against a real Redis (the appwrite container has it always-on);
no markTestSkipped fallback — the suite fails loudly if Redis is
unreachable rather than silently passing.
Comment thread app/controllers/shared/api.php
Comment thread src/Appwrite/Locking/Lock.php
Comment thread src/Appwrite/Locking/Lock.php
@blacksmith-sh

This comment has been minimized.

Two P1s from review:

- Acquire path caught only \RedisException; any other Throwable from the
  lock library would escape and skip the fail-open callback. Widened to
  \Throwable so any backend failure falls back to running unlocked.
- reportError mutated the per-request $log directly, leaving it in a
  'lock.{action}' state that the request-end handler would then submit
  as the request log. Clones $log into a dedicated $lockLog so the
  shared instance is untouched.
The sdks attribute is an append-only list, not idempotent. With the
read happening outside the lock, two sequential acquirers could each
read the same stale list and overwrite each other's appends.

Now the lock body re-reads the keys document and re-derives the sdks
array from the fresh state. Skip-on-contention still drops the update
when the lock is held, but a same-SDK retry on the next request picks
the registration up.

Bounded loss only affects the rare 'first-seen' SDK request that
happens to land while the lock is held; sequential traffic from the
same SDK (or any later request from any SDK) re-attempts and writes.

Co-authored-by: greptile-apps[bot]
Reverts the catch widening from earlier — the underlying
Utopia\Lock\Distributed::tryAcquire only calls ext-redis methods,
which throw RedisException exclusively. Catching Throwable was
over-broad and would silently swallow real bugs (TypeError, Error)
into a fail-open path.
… parse

- TTL defaults (5s skip / 10s fail), 3s wait timeout, and 60s sentry
  rate-limit window are now class constants.
- Telemetry outcome labels (acquired/skipped/contended/backend_error/
  release_error) are class constants — typo-safe across the 9 use sites.
- Fix: inferTargetFromKey was indexing into segment 2 with limit 4,
  which returned the project sequence instead of the collection name
  ever since project-id was added to the key in c2a249c. Telemetry
  was tagging 'target' with project sequences instead of 'projects' /
  'users' / etc. Now indexes segment 3 with limit 5 to correctly pick
  the collection segment from lock:platform:{project}:{target}:...
@blacksmith-sh
Copy link
Copy Markdown

blacksmith-sh Bot commented Apr 30, 2026

Found 1 test failure on Blacksmith runners:

Failure

Test View Logs
› Tests\E2E\Services\TablesDB\Transactions</code>
TablesDBTransactionPermissionsCustomClientTest/testCannotDeleteNonExistentDocument
View Logs

Fix in Cursor

Both were defensive against rare edge cases the reviewer flagged but
which don't justify the complexity:

- $log clone: protects against tag pollution on the per-request Log
  if reportError fires AND the request later errors. http.php's
  request-end handler overwrites the core fields (namespace, message,
  action, etc.) anyway; only addTag/addExtra accumulate. Aligns with
  Embeddings/Text/Create.php precedent which mutates $log directly.

- sdks in-lock re-read: closes a sequential-acquire stale-read race on
  the sdks append-list. The race exists but the impact is bounded —
  one SDK registration delayed until the next request from that SDK
  fires. Self-healing on retry. The codebase already accepts this
  exact race for auths, oAuthProviders, services, identities, sessions,
  factors, etc. Special-casing this one site is precision the analytics
  use case doesn't need.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant