feat: distributed locking for platform-database writes#12062
feat: distributed locking for platform-database writes#12062premtsd-code wants to merge 25 commits into1.9.xfrom
Conversation
Adds two DI factories and wires them where coordination is needed:
- distributedLock — skip on contention, void return. For idempotent
fan-out where N pods doing the same write is wasteful but losing
the race is correct.
- distributedLockOrFail — blocking acquire (3s default) then throws
GENERAL_RESOURCE_LOCKED (HTTP 409) on contention. For
read-modify-write on shared mutable state where a silent skip
would drop a user's change.
Both factories: _APP_LOCKING_ENABLED kill switch (set 'disabled' for
fail-open), fail-open on Redis-unreachable, and a lock.attempts
telemetry counter sliced by outcome and target collection.
Wired sites:
- shared/api.php × 3 (distributedLock): keys.accessedAt + sdks,
projects.accessedAt, users.accessedAt. Reduces redundant writes
and cache-purge fan-out under request bursts on the same project.
- Project/Services/Update.php × 1 (distributedLockOrFail): the
services map toggle. Re-reads inside the lock so the baseline
reflects concurrent updates. Two simultaneous toggles to
different services no longer lose one of them.
Lock key namespace: lock:platform:{collection}:{id}.
Dep: premtsd-code/lock pinned to a specific commit as a development
preview. Migration to utopia-php/lock is a follow-up once that
package is published.
Greptile SummaryThis PR introduces distributed locking for four platform-DB write paths ( Several P1 issues were flagged in prior review rounds and remain unresolved: Confidence Score: 3/5Not safe to merge until the P1 issues from prior rounds (log mutation, narrow catch, SDK data loss, Swoole blocking) are addressed. Four independently confirmed P1 defects in previous review rounds remain open: shared Log mutation corrupts error reporting, narrow catch breaks fail-open semantics, sdks append is lossy under skip-on-contention, and blocking acquire stalls the Swoole event loop. Multiple P1s pull the score below the 4/5 ceiling.
Important Files Changed
Reviews (23): Last reviewed commit: "revert: drop $log clone and sdks in-lock..." | Re-trigger Greptile |
🔄 PHP-Retry SummaryFlaky tests detected across commits: Commit
|
| Test | Retries | Total Time | Details |
|---|---|---|---|
UsageTest::testFunctionsStats |
1 | 10.15s | Logs |
UsageTest::testPrepareSitesStats |
1 | 8ms | Logs |
UsageTest::testEmbeddingsTextUsageDoesNotBreakProjectUsage |
1 | 6ms | Logs |
ProxyCustomServerTest::testCreateSiteRule |
1 | 120.60s | Logs |
Commit 7b2f6ac - 8 flaky tests
| Test | Retries | Total Time | Details |
|---|---|---|---|
UsageTest::testFunctionsStats |
1 | 10.20s | Logs |
UsageTest::testPrepareSitesStats |
1 | 7ms | Logs |
UsageTest::testEmbeddingsTextUsageDoesNotBreakProjectUsage |
1 | 5ms | Logs |
LegacyCustomServerTest::testCreatedAfter |
1 | 240.52s | Logs |
LegacyPermissionsGuestTest::testReadDocuments with data set #3 |
1 | 240.14s | Logs |
LegacyTransactionsConsoleClientTest::testBulkDeleteMatchingCreatedDocuments |
1 | 240.47s | Logs |
VectorsDBConsoleClientTest::testGetCollectionLogs |
1 | 4ms | Logs |
DatabasesConsoleClientTest::testGetCollectionLogs |
1 | 3ms | Logs |
Commit 03575e6 - 4 flaky tests
| Test | Retries | Total Time | Details |
|---|---|---|---|
UsageTest::testFunctionsStats |
1 | 10.27s | Logs |
UsageTest::testPrepareSitesStats |
1 | 8ms | Logs |
UsageTest::testEmbeddingsTextUsageDoesNotBreakProjectUsage |
1 | 6ms | Logs |
TablesDBACIDTest::testAtomicity |
1 | 240.92s | Logs |
Commit 18f3280 - 4 flaky tests
| Test | Retries | Total Time | Details |
|---|---|---|---|
UsageTest::testFunctionsStats |
1 | 10.28s | Logs |
UsageTest::testPrepareSitesStats |
1 | 7ms | Logs |
UsageTest::testEmbeddingsTextUsageDoesNotBreakProjectUsage |
1 | 5ms | Logs |
FunctionsScheduleTest::testCreateScheduledAtExecution |
1 | 127.73s | Logs |
Commit e0ec28f - 28 flaky tests
| Test | Retries | Total Time | Details |
|---|---|---|---|
UsageTest::testFunctionsStats |
1 | 10.21s | Logs |
UsageTest::testPrepareSitesStats |
1 | 8ms | Logs |
UsageTest::testEmbeddingsTextUsageDoesNotBreakProjectUsage |
1 | 6ms | Logs |
WebhooksCustomServerTest::testDeleteDeployment |
1 | 18ms | Logs |
WebhooksCustomServerTest::testDeleteFunction |
1 | 11ms | Logs |
WebhooksCustomServerTest::testCreateCollection |
1 | 11ms | Logs |
WebhooksCustomServerTest::testCreateAttributes |
1 | 12ms | Logs |
WebhooksCustomServerTest::testCreateDocument |
1 | 75ms | Logs |
WebhooksCustomServerTest::testUpdateDocument |
1 | 9ms | Logs |
WebhooksCustomServerTest::testDeleteDocument |
1 | 13ms | Logs |
WebhooksCustomServerTest::testCreateTable |
1 | 14ms | Logs |
WebhooksCustomServerTest::testCreateColumns |
1 | 11ms | Logs |
WebhooksCustomServerTest::testCreateRow |
1 | 9ms | Logs |
WebhooksCustomServerTest::testUpdateRow |
1 | 14ms | Logs |
WebhooksCustomServerTest::testDeleteRow |
1 | 14ms | Logs |
WebhooksCustomServerTest::testCreateStorageBucket |
1 | 5ms | Logs |
WebhooksCustomServerTest::testUpdateStorageBucket |
1 | 9ms | Logs |
WebhooksCustomServerTest::testCreateBucketFile |
1 | 14ms | Logs |
WebhooksCustomServerTest::testUpdateBucketFile |
1 | 7ms | Logs |
WebhooksCustomServerTest::testDeleteBucketFile |
1 | 5ms | Logs |
WebhooksCustomServerTest::testDeleteStorageBucket |
1 | 13ms | Logs |
WebhooksCustomServerTest::testCreateTeam |
1 | 6ms | Logs |
WebhooksCustomServerTest::testUpdateTeam |
1 | 11ms | Logs |
WebhooksCustomServerTest::testUpdateTeamPrefs |
1 | 14ms | Logs |
WebhooksCustomServerTest::testDeleteTeam |
1 | 5ms | Logs |
WebhooksCustomServerTest::testCreateTeamMembership |
1 | 13ms | Logs |
WebhooksCustomServerTest::testDeleteTeamMembership |
1 | 9ms | Logs |
WebhooksCustomServerTest::testWebhookAutoDisable |
1 | 35ms | Logs |
Note: Flaky test results are tracked for the last 5 commits
✨ Benchmark resultsComparing 1.9.x (before) to distributed-lock (after). Before
After
Delta
Top API waits
|
Three P1 issues flagged on the initial commit:
1. Lock key in updateProjectService used "platform:project:{id}" —
missing the "lock:" namespace prefix and using singular "project"
instead of the conventional plural collection name. The factory's
`lockTargetOf` extracts segment [2] as the telemetry target, so
the broken key was emitting the project ID itself as the target
attribute (cardinality blowup, broken dashboards). Fixed to
"lock:platform:projects:{id}" matching the convention used in
shared/api.php.
2. The 409 contention exception embedded the raw Redis lock key in
its user-facing message, leaking internal collection names and
the locking namespace to API clients. Removed the custom message
so the catalog default ("The requested resource is currently
being modified...") is used. Telemetry already carries the
target collection for operator-side observability.
3. _APP_LOCKING_ENABLED variable doc had `introduction: '1.10.0'`
on a 1.9.x-targeted PR. Corrected to '1.9.3' (next 1.9.x patch).
Three P1 issues flagged on the initial commit:
1. Lock key in updateProjectService used "platform:project:{id}" —
missing the "lock:" namespace prefix and using singular "project"
instead of the conventional plural collection name. The factory's
`lockTargetOf` extracts segment [2] as the telemetry target, so
the broken key was emitting the project ID itself as the target
attribute (cardinality blowup, broken dashboards). Fixed to
"lock:platform:projects:{id}" matching the convention used in
shared/api.php.
2. The 409 contention exception embedded the raw Redis lock key in
its user-facing message, leaking internal collection names and
the locking namespace to API clients. Removed the custom message
so the catalog default ("The requested resource is currently
being modified...") is used. Telemetry already carries the
target collection for operator-side observability.
3. _APP_LOCKING_ENABLED variable doc had `introduction: '1.10.0'`
on a 1.9.x-targeted PR. Corrected to '1.9.3' (next 1.9.x patch).
Two complementary tests for the lost-update bug the lock fixes, plus
the test-Client patches needed to make Swoole-coroutine HTTP work.
- tests/e2e/Services/Project/ServicesBase.php :: testConcurrentTogglesAllPersist
Fires N parallel PATCHes via Swoole\Coroutine\run + SWOOLE_HOOK_CURL,
then refetches the project and asserts (successCount == enabledCount).
With the lock disabled (_APP_LOCKING_ENABLED=disabled) sparse
updateDocument() calls overwrite each other and the assertion fails —
proving the test detects the bug.
- dev/test-distributed-lock.sh
Same proof via curl + bash background jobs. Runnable outside the
PHPUnit suite for manual verification or load-style sweeps. Reads
APPWRITE_ENDPOINT / APPWRITE_PROJECT_ID / APPWRITE_API_KEY.
- tests/e2e/Client.php (test util):
* Skip CURLOPT_PATH_AS_IS (option 234) when SWOOLE_HOOK_CURL is
active. Swoole's emulated cURL doesn't support it; setting it
fatal-errors as soon as any test enables the cURL hook.
* Don't redundantly set CURLOPT_NOBODY=false on non-HEAD requests
(false is cURL's default). Under Swoole's hooked cURL this
strips the body of PATCH/PUT requests, hitting the framework's
404 fallback instead of the intended route.
Both changes preserve native (non-hooked) cURL behavior unchanged.
They unblock any future test that wants real parallel HTTP via
Swoole\Coroutine\run + the cURL hook.
Both follow the same proof-of-bug pattern: run with locking enabled
(must pass) AND with it disabled (must fail). Verified locally against
the running stack:
_APP_LOCKING_ENABLED=enabled -> PASS (15 assertions)
_APP_LOCKING_ENABLED=disabled -> FAIL (successCount=5 enabledCount=4)
Lock backend errors (Redis/Dragonfly unreachable) and release errors
(TTL expired or backend dropped while held) were previously visible only
in the lock.attempts counter and Console::warning lines. They now also
push a structured Log entry through the configured logger adapter, so
operators using Sentry/Raygun/AppSignal/LogOwl get first-class events
for these specific failure modes.
Pattern matches Embeddings/Text/Create.php exactly:
- Action injects 'log' (per-request Log object) and 'logger'
(?Logger, nullable when _APP_LOGGING_CONFIG unset).
- Helper mutates the per-request $log instead of constructing a
fresh one — preserves the per-request context Embeddings expects.
- Same field set: namespace='http', server, version, type,
setMessage, setAction, setEnvironment, addTag('code', ...),
addExtra('file' / 'line' / 'trace').
- Defensive try/catch around addLog() so logging failures don't
break fail-open.
Lock-specific tags added for slicing in Sentry:
- lock.target — collection name (projects, keys, users, ...).
Bounded set, safe for high-cardinality stores.
- lock.key_pattern — full key with the trailing document ID
stripped (lock:platform:projects:* not lock:platform:projects:abc).
Prevents unbounded log cardinality from per-document IDs.
Rate limiting via per-pod static buckets, 60s window per
(action, target) combo. During a 5-minute Dragonfly outage, a fleet
of N pods produces at most N events/min, well within Sentry's dedup
tolerance. Static state is per-Swoole-worker; coroutines may race
on the bucket boundary but the worst case is one duplicate report.
Type level set to Log::TYPE_WARNING (not ERROR): fail-open means the
request still succeeds, so this is degraded operation, not a failed
request.
Deliberately NOT reported to Sentry:
- 409 GENERAL_RESOURCE_LOCKED (normal user-facing concurrency)
- skip-on-contention events (idempotent fan-out by design)
- acquire retry conflicts (internal loop)
- destructor cleanups (have an expected baseline rate; the
lock.attempts counter aggregates them better than Sentry would)
Factory signature change: distributedLock and distributedLockOrFail
now accept ?Log and ?Logger as optional named args at call time
(rather than capturing Logger at factory-build time). The factory
closure runs once at boot but the per-request Log resource is
fresh per request — capturing at boot would have given stale state.
Existing call sites threaded log: $log, logger: $logger. Sites that
don't (workers, CLI tasks) get null and just log to Console as
before.
d66f555 to
92b5f0d
Compare
`utopia-php/lock` v0.2.0 was published this week and provides the same
Redis SET-NX-EX + Lua-compare-and-delete primitive we built locally as
`premtsd-code/lock`. Drop the dev-preview package in favor of the
official Utopia PHP library.
- composer: replace `premtsd-code/lock` with `utopia-php/lock` 0.2.*
(still via VCS — not on Packagist yet)
- resources.php: rewire both factory variants
- `Lock + Adapter\Redis` → `Distributed`
- `acquire()` → `tryAcquire()` for skip variant
- `acquire(blocking: true, waitTimeout)` → `acquire($waitTimeout)` for
OrFail variant
- `LockAcquireException` → `\RedisException`
- `(int) $ttl` cast — utopia-php/lock takes seconds as int
- docker-compose: thread `_APP_LOCKING_ENABLED` into the appwrite
service environment so the kill switch documented in
`app/config/variables.php` is actually usable from `.env`
Verified end-to-end on local stack:
- positive case (locking enabled): 5/5 testConcurrentTogglesAllPersist
pass, lock keys observed in `redis-cli MONITOR` with concurrent SET
NX contention
- negative case (locking disabled): 1/3 detect lost updates as before
The pilot on `Project/Services/Update.php` (commit fb0d43d) fixed the silent lost-update on the `services` JSON map. The same read-modify-write pattern exists on 17 other endpoints — all writing to the SAME `projects` document, mostly via sparse `updateDocument`. Concurrent writes silently overwrite each other, both within an endpoint (two toggles to different auth methods) and across endpoints (parallel SMTP update + service toggle). The worst offender is `Projects/Update.php` (`PATCH /v1/projects/:projectId`): it calls `setAttribute` 11 times then writes the WHOLE document back, so a concurrent write to *any* attribute on the project doc — services, auths, smtp, templates, mockNumbers — gets clobbered. Wrap every read-modify-write window with `distributedLockOrFail` keyed on `lock:platform:projects:{$projectId}`. The single shared key serializes both same-endpoint and cross-endpoint races on the same project. Endpoints now under the lock: Projects module (console-managed, plural): - Projects/Update (full doc setAttribute chain) - Projects/Team/Update (sparse + computed permissions; cascade to installations/repositories/vcsComments runs after release — separate collections) Project module (current-project, singular): - AuthMethods/Update (auths[authKey]) - Protocols/Update (apis[protocolId]) - SMTP/Update (smtp map; PHPMailer probe inside lock — 10s default TTL covers Timeout=5) - Templates/Email/Update (templates['email.X-locale']) - MockPhone/{Create,Delete,Update} (auths['mockNumbers'][]) - Policies/MembershipPrivacy (auths['memberships*']) - Policies/PasswordDictionary (auths['passwordDictionary']) - Policies/PasswordHistory (auths['passwordHistory']) - Policies/PasswordPersonalData (auths['personalDataCheck']) - Policies/SessionAlert (auths['sessionAlerts']) - Policies/SessionDuration (auths['duration']) - Policies/SessionInvalidation (auths['invalidateSessions']) - Policies/SessionLimit (auths['maxSessions']) - Policies/UserLimit (auths['limit']) Each endpoint: 1. Injects `distributedLockOrFail`, `log`, `logger` 2. Re-reads the project document inside the lock so the baseline reflects any update that landed between request init and lock acquisition 3. Throws `GENERAL_RESOURCE_LOCKED` (HTTP 409) on contention timeout SMTP/Update and Templates/Email/Update additionally refactor variable- variables (`${$key}`) to an explicit input array — PHPStan can't trace dynamic references through closure `use` clauses.
Every request that arrives via a custom-domain rule (router path) reads
the project's `accessedAt` timestamp and, if the throttle window
(`APP_PROJECT_ACCESS`) has elapsed, writes a fresh value. With concurrent
traffic across multiple pods, this is a per-row hot RMW that loses
updates silently — the surviving timestamp depends on which pod's write
landed last.
Wrap the read-modify-write in `distributedLock('lock:platform:projects:{id}')`
(skip-on-contention variant). Every concurrent pod would write the same
throttled value, so losing the race is correct: the winner's update covers
ours.
Wires `distributedLock` and `?Logger` through:
- `router()` function signature (app/controllers/general.php:70)
- the three Http::init / Http::get callbacks that invoke router():
`*` catch-all (init), `/robots.txt`, `/humans.txt`
Two related cloud-only RMW sites (`teams.accessedAt`,
`projects.mcpAccessedAt`) live in `appwrite-labs/cloud` and need a
follow-up PR there. They depend on this branch reaching 1.9.x so the
`distributedLock` DI resource is available downstream.
The previous shape required every caller to thread `log: $log, logger: $logger`
as named args into each `distributedLock(...)` invocation, plus inject `log`
and `logger` into the surrounding action just to forward them to the lock.
Across 21 call sites this added ~100 LOC of pure plumbing.
The cause: the lock factory was registered on the global container in
`app/init/resources.php`, where per-request resources like `log` aren't
visible. That forced the factory to expose its inner closure with optional
`?Log $log = null, ?Logger $logger = null` params, which every caller had
to satisfy.
Move the lock factory + its `lockErrorReporter`/`lockTargetOf` helpers from
the global container to the per-request container (`resources/request.php`),
and add `'log'` + `'logger'` to the factory's dep list. The factory closure
now runs per-request and closes over the per-request `Log`/`Logger`. Inner
closure returned to callers no longer needs the optional params, and call
sites drop the named args entirely.
Knock-on cleanup:
- Drop `->inject('log')`, `->inject('logger')`, the corresponding action
params, and `use Utopia\Logger\{Log,Logger}` imports from 19 endpoint
files where they were only there for the lock
- Drop the same plumbing from `app/controllers/shared/api.php` (3 lock call
sites)
- Drop just the Logger plumbing from `app/controllers/general.php` (router
function + 3 callbacks); `Log` is kept because it's used elsewhere in
that file
- Net 120 LOC removed across 23 files
No behavior change: the lock factories still produce the same closures
(skip-on-contention `distributedLock`, blocking-with-409 `distributedLockOrFail`).
The static lockErrorReporter rate limiter (1 push per 60s per
`(action, target)` bucket) continues to work — it lives on a closure-static
in the helper, which is independent of where the helper is constructed.
Verified end-to-end: testConcurrentTogglesAllPersist passes 4/5 (the cold-
start race flake is the same one we've consistently seen and is orthogonal
to lock changes).
Drop the 18 lost-update endpoint locks (Project/* settings + Projects team/update). Those address a different bug class (read-modify-write races) than the manager-flagged production problem (regions slow queries to platform from thundering herd on accessedAt writes). Kept: - distributedLock + distributedLockOrFail factories on the per-request container, GENERAL_RESOURCE_LOCKED exception, 409 mapping - 4 thundering-herd sites: cache-invalidation in shared/api.php (3) + router projects.accessedAt in general.php (1) Dropped: - 18 endpoint OrFail wires - testConcurrentTogglesAllPersist + Swoole-cURL test client patches - dev/test-distributed-lock.sh smoke script
0063743 to
b15457b
Compare
This comment has been minimized.
This comment has been minimized.
Extracts the lock-key format and the lock+auth-skip+sparse-update pattern
into Appwrite\Locking\Lock with three methods:
- set(collection, id, attribute=accessedAt, value=null) — throttled
single-attribute write
- run(collection, id, fn) — generic skip-on-contention
- runOrFail(collection, id, fn) — block-then-409 for the deferred
lost-update follow-up
Migrates the 4 call sites (router projects accessedAt + 3 in shared/api)
off the raw $distributedLock callable. Raw factories stay as escape
hatches for non-platform key shapes.
Lock now uses Utopia\Lock\Distributed directly and owns the full
acquire/release/telemetry/error-reporting/fail-open/kill-switch logic
that previously lived in two inline DI factory closures.
Adds withKey($key, $fn, $ttl, $orFail, $waitTimeout) as a generic
escape hatch for non-platform key shapes (cache, queue, edge) and
unusual TTL/timeout requirements.
Per-attribute lock keys for set() so that an accessedAt bump and a
mcpAccessedAt bump on the same projects:{id} document don't compete.
Whole-document operations (run, runOrFail) keep document-level keys.
Removes the standalone distributedLock and distributedLockOrFail DI
factories — Lock is the single API.
request.php shrinks ~150 LOC; Lock.php grows to ~190 LOC.
…llback The dedicated \Redis DI resource (used by timelimit and the new Lock class) was reading _APP_REDIS_HOST/PORT/PASS exclusively. Cloud deployments configure cache via _APP_CONNECTIONS_CACHE URI form (e.g. cache=redis://dragonfly:6379) and don't pass the legacy _APP_REDIS_* vars to the appwrite container locally, so timelimit and Lock both fail to connect outside production where Helm separately injects the legacy vars. Now prefers _APP_CONNECTIONS_CACHE when set (matching the cache pool backend), falls back to _APP_REDIS_* for CE-style configs. No new env vars introduced; both timelimit and Lock work in CE, cloud-local, and cloud-production without compose changes.
Per-manager request, lock keys are now prefixed with the project's
internal id (sequence) so that:
- Locks are partitioned by project — Redis cluster slot affinity
if/when sharded.
- Cross-project requests can't compete on the same key for
collection-scoped resources.
- Telemetry (counter + Sentry tags) carries 'project' alongside
'target', so dashboards can filter contention by project.
Key shapes:
set: lock:platform:{project}:{collection}:{id}:{attribute}
run/orFail: lock:platform:{project}:{collection}:{id}
withKey: raw (caller-provided)
Lock now requires a project document at construction. All existing
call sites (4 in CE + 2 in cloud) run inside Http::init()-resolved
request scope where the project document is set, so no migration
needed. Workers/CLI without project context can use withKey directly.
Cloud production runs four separate single-master+replica Dragonfly deployments (cache, queue-dragonfly, queue-usage, pubsub-dragonfly), not sharded Redis Cluster topology — confirmed by deploy/cloud/values + environments/production/*.values.yaml (Dragonfly Operator with replicas=2 = 1 primary + 1 read replica), and by the dev DSN scheme 'redis://' (not 'redis-cluster://'). So a standard \Redis client suffices for the direct redis resource (timelimit, Lock). Cloud just needs to pass _APP_REDIS_HOST/PORT/USER/ PASS through to the appwrite container — handled in the cloud PR's docker-compose.yml change. This reverts the resource to its original pre-PR shape. The utopia-php/lock cluster-support PR (utopia-php/lock#1) stays open at upstream as a future-ready option if cloud ever moves to actual Redis Cluster mode.
Covers the four public methods (set, run, runOrFail, withKey) plus the kill switch and the project-fallback behavior. Tests against a real Redis (the appwrite container has it always-on); no markTestSkipped fallback — the suite fails loudly if Redis is unreachable rather than silently passing.
# Conflicts: # composer.lock
This comment has been minimized.
This comment has been minimized.
Two P1s from review:
- Acquire path caught only \RedisException; any other Throwable from the
lock library would escape and skip the fail-open callback. Widened to
\Throwable so any backend failure falls back to running unlocked.
- reportError mutated the per-request $log directly, leaving it in a
'lock.{action}' state that the request-end handler would then submit
as the request log. Clones $log into a dedicated $lockLog so the
shared instance is untouched.
The sdks attribute is an append-only list, not idempotent. With the read happening outside the lock, two sequential acquirers could each read the same stale list and overwrite each other's appends. Now the lock body re-reads the keys document and re-derives the sdks array from the fresh state. Skip-on-contention still drops the update when the lock is held, but a same-SDK retry on the next request picks the registration up. Bounded loss only affects the rare 'first-seen' SDK request that happens to land while the lock is held; sequential traffic from the same SDK (or any later request from any SDK) re-attempts and writes. Co-authored-by: greptile-apps[bot]
Reverts the catch widening from earlier — the underlying Utopia\Lock\Distributed::tryAcquire only calls ext-redis methods, which throw RedisException exclusively. Catching Throwable was over-broad and would silently swallow real bugs (TypeError, Error) into a fail-open path.
… parse - TTL defaults (5s skip / 10s fail), 3s wait timeout, and 60s sentry rate-limit window are now class constants. - Telemetry outcome labels (acquired/skipped/contended/backend_error/ release_error) are class constants — typo-safe across the 9 use sites. - Fix: inferTargetFromKey was indexing into segment 2 with limit 4, which returned the project sequence instead of the collection name ever since project-id was added to the key in c2a249c. Telemetry was tagging 'target' with project sequences instead of 'projects' / 'users' / etc. Now indexes segment 3 with limit 5 to correctly pick the collection segment from lock:platform:{project}:{target}:...
|
Found 1 test failure on Blacksmith runners: Failure
|
Both were defensive against rare edge cases the reviewer flagged but which don't justify the complexity: - $log clone: protects against tag pollution on the per-request Log if reportError fires AND the request later errors. http.php's request-end handler overwrites the core fields (namespace, message, action, etc.) anyway; only addTag/addExtra accumulate. Aligns with Embeddings/Text/Create.php precedent which mutates $log directly. - sdks in-lock re-read: closes a sequential-acquire stale-read race on the sdks append-list. The race exists but the impact is bounded — one SDK registration delayed until the next request from that SDK fires. Self-healing on retry. The codebase already accepts this exact race for auths, oAuthProviders, services, identities, sessions, factors, etc. Special-casing this one site is precision the analytics use case doesn't need.
Summary
Adds distributed locking on the four platform-DB write paths that were causing thundering-herd slow queries on the central platform DB when many regional pods coalesced on the same throttled
accessedAt/ cache-purge writes.The lock primitive lives in a single class
Appwrite\Locking\Lock(built onutopia-php/lock), exposed in DI as$lock. It owns the full acquire/release/telemetry/error-reporting/fail-open/kill-switch flow.API
Lock keys are scoped to the project's internal id (
$project->getSequence()) so projects can't compete on the same key:set:lock:platform:{projectInternalId}:{collection}:{id}:{attribute}— per-attribute, soaccessedAtandmcpAccessedAtwrites on the same project don't competerun/runOrFail:lock:platform:{projectInternalId}:{collection}:{id}— per-document, what lost-update endpoints will needWired sites (4)
app/controllers/general.phpset('projects', $id, 'accessedAt', DateTime::now())accessedAtbumpapp/controllers/shared/api.phprun('keys', $id, fn)app/controllers/shared/api.phpset('projects', $id, 'accessedAt', DateTime::now())accessedAtbumpapp/controllers/shared/api.phpset('users', $id, 'accessedAt', $userAccessedAt)accessedAtbumpAll four sites previously had every regional pod racing to write the same value; with skip-on-contention semantics, N concurrent identical writes coalesce into 1 actual platform-DB write per throttle window.
Behavior
outcomeacquiredskippedset/runlost the race; another pod is doing the same workGENERAL_RESOURCE_LOCKED(HTTP 409)contendedrunOrFailonly — for read-modify-write where dropping is wrongbackend_errorrelease_error_APP_LOCKING_ENABLED=disabledkill switchTelemetry counter:
lock.attempts{outcome,target,project}—targetextracted from key (e.g.,projects,users,keys);projectis the project internal id.Sentry / Logger pushes are rate-limited to one per 60s per
(action, target)so a sustained backend outage doesn't flood the error log across the pod fleet.Config
New env var
_APP_LOCKING_ENABLED(defaultenabled):enabled(default)disabledLock::runandLock::runOrFailshort-circuit; callbacks run unlocked, return values preservedDocumented in
app/config/variables.php. Wired throughdocker-compose.yml.New error code
Exception::GENERAL_RESOURCE_LOCKED→ HTTP 409, used only whenrunOrFailtimes out on contention. Reserved for the deferred lost-update follow-up; no current call site triggers it.Pattern guide for future lock sites
set(single-attribute throttled bumps) orrun(multi-statement idempotent writes)runOrFail(the user/SDK retries on 409; the second write applies on top of the first)Existing
accessedAtwrites are idempotent fan-out (same value, same effect) → skip semantics. The deferred lost-update endpoints (services map, OAuth providers, SMTP arrays, etc.) are read-modify-write on shared state → 409 semantics.Tests
Unit tests at
tests/unit/Locking/LockTest.php(11 tests, 27 assertions, ~3s):setper-attribute key shape + auth-skipped sparseDocumentwritesetskip-on-contention (held key → no DB write)setdifferent attributes don't compete (accessedAtandmcpAccessedAtindependent)runper-document key + callback invocationrunskip-on-contention (held key → callback doesn't run)runOrFailthrowsGENERAL_RESOURCE_LOCKEDon contentionrunOrFailcallback return value passes through when uncontendedwithKeyraw-key passthrough for non-platform shapeswithKeyorFailflag throws on contention_APP_LOCKING_ENABLED=disabledruns callback unlockedunknownprojectInternalId bucketTests fail loudly (no
markTestSkipped) if Redis isn't reachable — the appwrite test container always has it.Test plan
composer lintclean on all modified filescomposer analyze(PHPStan level 3) clean on all modified filescomposer validateclean_APP_LOCKING_ENABLEDunset →enabled)_APP_LOCKING_ENABLED=disabled—Lockno-ops; no errorslock.attempts{outcome=acquired,target=projects}increments on acquirelock.attempts{outcome=skipped,target=*}increments on contention for theaccessedAtpaths under loadaccessedAtwrite per throttle window instead of NPre-merge cleanup
repositories[utopia-php/lock]VCS pin fromcomposer.jsononceutopia-php/lockis published as a tagged release (currently pinned to0.2.*from VCS)Notes
$distributedLock/$distributedLockOrFailwere consolidated intoAppwrite\Locking\Lockso all lock logic lives in one file. The class uses\Redisdirectly (no cluster mode); cloud's local docker-compose passes_APP_REDIS_HOST=dragonflyso the same single-instance backend powers both pool and direct paths. Cluster support is staged behind a separate upstream PR toutopia-php/lock(utopia-php/lock#1) — not depended on here.Project/*settings +Projects/Update/Team) was scoped out of this PR; tracked locally inrequired_lock.mdwith the recovery commit SHAs. Plan for that PR is middleware +->label('lock', 'projects:{projectId}')drivingLock::runOrFail, matching how Appwrite already wiresscope/event/audits.event.mcpAccessedAt(project) andteams.accessedAtlock sites, consumes the sameLockclass via DI.