feat: distributed locking for platform-database writes by premtsd-code · Pull Request #12062 · appwrite/appwrite

premtsd-code · 2026-04-27T06:57:38Z

Summary

Adds distributed locking on the four platform-DB write paths that were causing thundering-herd slow queries on the central platform DB when many regional pods coalesced on the same throttled accessedAt / cache-purge writes.

The lock primitive lives in a single class Appwrite\Locking\Lock (built on utopia-php/lock), exposed in DI as $lock. It owns the full acquire/release/telemetry/error-reporting/fail-open/kill-switch flow.

API

$lock->set($collection, $id, $attribute, $value);    // throttled single-attr write under per-attribute skip-lock + auth bypass
$lock->run($collection, $id, $fn);                    // skip-on-contention generic for multi-statement writes
$lock->runOrFail($collection, $id, $fn);              // block-then-409 for read-modify-write that must not silently drop
$lock->withKey($key, $fn, $ttl, $orFail, $waitTimeout);  // escape hatch for non-platform key shapes

Lock keys are scoped to the project's internal id ($project->getSequence()) so projects can't compete on the same key:

set: lock:platform:{projectInternalId}:{collection}:{id}:{attribute} — per-attribute, so accessedAt and mcpAccessedAt writes on the same project don't compete
run / runOrFail: lock:platform:{projectInternalId}:{collection}:{id} — per-document, what lost-update endpoints will need

Wired sites (4)

File	Line	Method	Purpose
`app/controllers/general.php`	140	`set('projects', $id, 'accessedAt', DateTime::now())`	Router `accessedAt` bump
`app/controllers/shared/api.php`	254	`run('keys', $id, fn)`	Key update + parent-doc cache purge
`app/controllers/shared/api.php`	395	`set('projects', $id, 'accessedAt', DateTime::now())`	API middleware `accessedAt` bump
`app/controllers/shared/api.php`	412	`set('users', $id, 'accessedAt', $userAccessedAt)`	Admin path `accessedAt` bump

All four sites previously had every regional pod racing to write the same value; with skip-on-contention semantics, N concurrent identical writes coalesce into 1 actual platform-DB write per throttle window.

Behavior

Mode	Outcome	Telemetry `outcome`	When
Acquired	callback runs, lock released	`acquired`	normal path
Skipped	callback does not run	`skipped`	`set` / `run` lost the race; another pod is doing the same work
Contended	throws `GENERAL_RESOURCE_LOCKED` (HTTP 409)	`contended`	`runOrFail` only — for read-modify-write where dropping is wrong
Backend error	callback runs unlocked (fail-open)	`backend_error`	Redis unreachable; better stale-write than total outage
Release error	logged + telemetry	`release_error`	TTL expired before release
Disabled	callback runs unlocked	(not counted)	`_APP_LOCKING_ENABLED=disabled` kill switch

Telemetry counter: lock.attempts{outcome,target,project} — target extracted from key (e.g., projects, users, keys); project is the project internal id.

Sentry / Logger pushes are rate-limited to one per 60s per (action, target) so a sustained backend outage doesn't flood the error log across the pod fleet.

Config

New env var _APP_LOCKING_ENABLED (default enabled):

Value	Effect
`enabled` (default)	Lock factory operates normally
`disabled`	Both `Lock::run` and `Lock::runOrFail` short-circuit; callbacks run unlocked, return values preserved

Documented in app/config/variables.php. Wired through docker-compose.yml.

New error code Exception::GENERAL_RESOURCE_LOCKED → HTTP 409, used only when runOrFail times out on contention. Reserved for the deferred lost-update follow-up; no current call site triggers it.

Pattern guide for future lock sites

"If two requests race and one wins, is losing the other one's work a bug?"

No → set (single-attribute throttled bumps) or run (multi-statement idempotent writes)
Yes → runOrFail (the user/SDK retries on 409; the second write applies on top of the first)

Existing accessedAt writes are idempotent fan-out (same value, same effect) → skip semantics. The deferred lost-update endpoints (services map, OAuth providers, SMTP arrays, etc.) are read-modify-write on shared state → 409 semantics.

Tests

Unit tests at tests/unit/Locking/LockTest.php (11 tests, 27 assertions, ~3s):

set per-attribute key shape + auth-skipped sparse Document write
set skip-on-contention (held key → no DB write)
set different attributes don't compete (accessedAt and mcpAccessedAt independent)
run per-document key + callback invocation
run skip-on-contention (held key → callback doesn't run)
runOrFail throws GENERAL_RESOURCE_LOCKED on contention
runOrFail callback return value passes through when uncontended
withKey raw-key passthrough for non-platform shapes
withKey orFail flag throws on contention
_APP_LOCKING_ENABLED=disabled runs callback unlocked
Empty project document → unknown projectInternalId bucket

Tests fail loudly (no markTestSkipped) if Redis isn't reachable — the appwrite test container always has it.

Test plan

composer lint clean on all modified files
composer analyze (PHPStan level 3) clean on all modified files
composer validate clean
App boots with default config (_APP_LOCKING_ENABLED unset → enabled)
Unit tests pass (11/11)
App boots with _APP_LOCKING_ENABLED=disabled — Lock no-ops; no errors
Telemetry: lock.attempts{outcome=acquired,target=projects} increments on acquire
Telemetry: lock.attempts{outcome=skipped,target=*} increments on contention for the accessedAt paths under load
Manual: regional burst on the same project shows 1 accessedAt write per throttle window instead of N

Pre-merge cleanup

Remove repositories[utopia-php/lock] VCS pin from composer.json once utopia-php/lock is published as a tagged release (currently pinned to 0.2.* from VCS)

Notes

The two earlier inline factories $distributedLock / $distributedLockOrFail were consolidated into Appwrite\Locking\Lock so all lock logic lives in one file. The class uses \Redis directly (no cluster mode); cloud's local docker-compose passes _APP_REDIS_HOST=dragonfly so the same single-instance backend powers both pool and direct paths. Cluster support is staged behind a separate upstream PR to utopia-php/lock (utopia-php/lock#1) — not depended on here.
The deferred lost-update follow-up (18 endpoints across Project/* settings + Projects/Update/Team) was scoped out of this PR; tracked locally in required_lock.md with the recovery commit SHAs. Plan for that PR is middleware + ->label('lock', 'projects:{projectId}') driving Lock::runOrFail, matching how Appwrite already wires scope / event / audits.event.
Cloud companion PR: appwrite-labs/cloud#3907 — adds mcpAccessedAt (project) and teams.accessedAt lock sites, consumes the same Lock class via DI.

Adds two DI factories and wires them where coordination is needed: - distributedLock — skip on contention, void return. For idempotent fan-out where N pods doing the same write is wasteful but losing the race is correct. - distributedLockOrFail — blocking acquire (3s default) then throws GENERAL_RESOURCE_LOCKED (HTTP 409) on contention. For read-modify-write on shared mutable state where a silent skip would drop a user's change. Both factories: _APP_LOCKING_ENABLED kill switch (set 'disabled' for fail-open), fail-open on Redis-unreachable, and a lock.attempts telemetry counter sliced by outcome and target collection. Wired sites: - shared/api.php × 3 (distributedLock): keys.accessedAt + sdks, projects.accessedAt, users.accessedAt. Reduces redundant writes and cache-purge fan-out under request bursts on the same project. - Project/Services/Update.php × 1 (distributedLockOrFail): the services map toggle. Re-reads inside the lock so the baseline reflects concurrent updates. Two simultaneous toggles to different services no longer lose one of them. Lock key namespace: lock:platform:{collection}:{id}. Dep: premtsd-code/lock pinned to a specific commit as a development preview. Migration to utopia-php/lock is a follow-up once that package is published.

greptile-apps · 2026-04-27T07:01:03Z

Greptile Summary

This PR introduces distributed locking for four platform-DB write paths (accessedAt bumps for projects/users and API-key update+cache-purge) to reduce thundering-herd writes when many regional pods race on the same document. The lock primitive lives in Appwrite\Locking\Lock, registered as a per-request DI resource backed by utopia-php/lock over Redis.

Several P1 issues were flagged in prior review rounds and remain unresolved: reportError mutates the shared per-request $log object (corrupting the real request context on any lock backend error), the execute() acquire path catches only \RedisException (breaking fail-open for any other throwable from the library), the sdks array inside the key-update run() block is computed outside the lock so a concurrent request using a different SDK will have its registration silently dropped, and the blocking acquire() in runOrFail runs synchronously which will stall the Swoole event loop for up to 3 s per contended request.

Confidence Score: 3/5

Not safe to merge until the P1 issues from prior rounds (log mutation, narrow catch, SDK data loss, Swoole blocking) are addressed.

Four independently confirmed P1 defects in previous review rounds remain open: shared Log mutation corrupts error reporting, narrow catch breaks fail-open semantics, sdks append is lossy under skip-on-contention, and blocking acquire stalls the Swoole event loop. Multiple P1s pull the score below the 4/5 ceiling.

src/Appwrite/Locking/Lock.php (log mutation, narrow catch), app/controllers/shared/api.php (sdk data loss)

Important Files Changed

Filename	Overview
src/Appwrite/Locking/Lock.php	New Lock abstraction wrapping utopia-php/lock; has reported issues with shared `$log` mutation in `reportError`, narrow `\RedisException` catch breaking fail-open guarantee, and Swoole-blocking `acquire()` in `runOrFail`
app/controllers/shared/api.php	Key-update path wrapped in `lock.run()` with skip-on-contention, but the non-idempotent `sdks` array accumulation (computed outside the lock) can silently lose a registration on contention
app/init/resources/request.php	Single per-request `lock` DI factory registered cleanly; `Lock` constructor reads `_APP_LOCKING_ENABLED` at construction time so kill-switch requires worker restart to take effect
app/controllers/general.php	Router `accessedAt` bump migrated to `lock.set()`; `Lock` dependency injected and propagated through all router call sites cleanly
composer.json	Adds `utopia-php/lock 0.2.*` via a VCS repository entry (not yet on Packagist); PR notes the pin should be removed post-publication
tests/unit/Locking/LockTest.php	11 focused unit tests covering all public Lock methods including contention, disabled mode, and unknown-project fallback; test setup correctly cleans Redis keys per test
app/config/errors.php	Adds `GENERAL_RESOURCE_LOCKED` → HTTP 409 error entry; no issues
app/config/variables.php	Registers `_APP_LOCKING_ENABLED` variable with introduction version `1.9.3`; no issues
src/Appwrite/Extend/Exception.php	Adds `GENERAL_RESOURCE_LOCKED` constant; no issues
docker-compose.yml	Passes `_APP_LOCKING_ENABLED` env var to the appwrite service; no issues

_{Reviews (23): Last reviewed commit: "revert: drop $log clone and sdks in-lock..." | Re-trigger Greptile}

github-actions · 2026-04-27T07:04:06Z

🔄 PHP-Retry Summary

Flaky tests detected across commits:

Commit 542aac7 - 4 flaky tests

Test	Retries	Total Time	Details
`UsageTest::testFunctionsStats`	1	10.15s	Logs
`UsageTest::testPrepareSitesStats`	1	8ms	Logs
`UsageTest::testEmbeddingsTextUsageDoesNotBreakProjectUsage`	1	6ms	Logs
`ProxyCustomServerTest::testCreateSiteRule`	1	120.60s	Logs

Commit 7b2f6ac - 8 flaky tests

Test	Retries	Total Time	Details
`UsageTest::testFunctionsStats`	1	10.20s	Logs
`UsageTest::testPrepareSitesStats`	1	7ms	Logs
`UsageTest::testEmbeddingsTextUsageDoesNotBreakProjectUsage`	1	5ms	Logs
`LegacyCustomServerTest::testCreatedAfter`	1	240.52s	Logs
`LegacyPermissionsGuestTest::testReadDocuments with data set #3`	1	240.14s	Logs
`LegacyTransactionsConsoleClientTest::testBulkDeleteMatchingCreatedDocuments`	1	240.47s	Logs
`VectorsDBConsoleClientTest::testGetCollectionLogs`	1	4ms	Logs
`DatabasesConsoleClientTest::testGetCollectionLogs`	1	3ms	Logs

Commit 03575e6 - 4 flaky tests

Test	Retries	Total Time	Details
`UsageTest::testFunctionsStats`	1	10.27s	Logs
`UsageTest::testPrepareSitesStats`	1	8ms	Logs
`UsageTest::testEmbeddingsTextUsageDoesNotBreakProjectUsage`	1	6ms	Logs
`TablesDBACIDTest::testAtomicity`	1	240.92s	Logs

Commit 18f3280 - 4 flaky tests

Test	Retries	Total Time	Details
`UsageTest::testFunctionsStats`	1	10.28s	Logs
`UsageTest::testPrepareSitesStats`	1	7ms	Logs
`UsageTest::testEmbeddingsTextUsageDoesNotBreakProjectUsage`	1	5ms	Logs
`FunctionsScheduleTest::testCreateScheduledAtExecution`	1	127.73s	Logs

Commit e0ec28f - 28 flaky tests

Test	Retries	Total Time	Details
`UsageTest::testFunctionsStats`	1	10.21s	Logs
`UsageTest::testPrepareSitesStats`	1	8ms	Logs
`UsageTest::testEmbeddingsTextUsageDoesNotBreakProjectUsage`	1	6ms	Logs
`WebhooksCustomServerTest::testDeleteDeployment`	1	18ms	Logs
`WebhooksCustomServerTest::testDeleteFunction`	1	11ms	Logs
`WebhooksCustomServerTest::testCreateCollection`	1	11ms	Logs
`WebhooksCustomServerTest::testCreateAttributes`	1	12ms	Logs
`WebhooksCustomServerTest::testCreateDocument`	1	75ms	Logs
`WebhooksCustomServerTest::testUpdateDocument`	1	9ms	Logs
`WebhooksCustomServerTest::testDeleteDocument`	1	13ms	Logs
`WebhooksCustomServerTest::testCreateTable`	1	14ms	Logs
`WebhooksCustomServerTest::testCreateColumns`	1	11ms	Logs
`WebhooksCustomServerTest::testCreateRow`	1	9ms	Logs
`WebhooksCustomServerTest::testUpdateRow`	1	14ms	Logs
`WebhooksCustomServerTest::testDeleteRow`	1	14ms	Logs
`WebhooksCustomServerTest::testCreateStorageBucket`	1	5ms	Logs
`WebhooksCustomServerTest::testUpdateStorageBucket`	1	9ms	Logs
`WebhooksCustomServerTest::testCreateBucketFile`	1	14ms	Logs
`WebhooksCustomServerTest::testUpdateBucketFile`	1	7ms	Logs
`WebhooksCustomServerTest::testDeleteBucketFile`	1	5ms	Logs
`WebhooksCustomServerTest::testDeleteStorageBucket`	1	13ms	Logs
`WebhooksCustomServerTest::testCreateTeam`	1	6ms	Logs
`WebhooksCustomServerTest::testUpdateTeam`	1	11ms	Logs
`WebhooksCustomServerTest::testUpdateTeamPrefs`	1	14ms	Logs
`WebhooksCustomServerTest::testDeleteTeam`	1	5ms	Logs
`WebhooksCustomServerTest::testCreateTeamMembership`	1	13ms	Logs
`WebhooksCustomServerTest::testDeleteTeamMembership`	1	9ms	Logs
`WebhooksCustomServerTest::testWebhookAutoDisable`	1	35ms	Logs

Note: Flaky test results are tracked for the last 5 commits

github-actions · 2026-04-27T07:04:27Z

✨ Benchmark results

Comparing 1.9.x (before) to distributed-lock (after).

Before

Scenario	P50 (ms)	P95 (ms)	Requests	RPS
API total	14.37	130.4	185	34.04
Account	24.68	205.79	35	7.04
TablesDB	13.15	19.79	35	8.57
Storage	12.06	52.59	75	18.01
Functions	21.23	35.35	40	9.75

After

Scenario	P50 (ms)	P95 (ms)	Requests	RPS
API total	15.1	140.29	185	33.81
Account	25.38	182.46	35	6.99
TablesDB	13.47	24.23	35	8.4
Storage	11.92	51.46	75	17.68
Functions	21.8	34.01	40	9.53

Delta

Scenario	P95 delta (ms)
API total	+9.89
Account	-23.33
TablesDB	+4.44
Storage	-1.13
Functions	-1.33

Top API waits

API request	Max wait (ms)
account.sessions.email.create	687.39
account.create	182.35
account.password.update	157.15

Three P1 issues flagged on the initial commit: 1. Lock key in updateProjectService used "platform:project:{id}" — missing the "lock:" namespace prefix and using singular "project" instead of the conventional plural collection name. The factory's `lockTargetOf` extracts segment [2] as the telemetry target, so the broken key was emitting the project ID itself as the target attribute (cardinality blowup, broken dashboards). Fixed to "lock:platform:projects:{id}" matching the convention used in shared/api.php. 2. The 409 contention exception embedded the raw Redis lock key in its user-facing message, leaking internal collection names and the locking namespace to API clients. Removed the custom message so the catalog default ("The requested resource is currently being modified...") is used. Telemetry already carries the target collection for operator-side observability. 3. _APP_LOCKING_ENABLED variable doc had `introduction: '1.10.0'` on a 1.9.x-targeted PR. Corrected to '1.9.3' (next 1.9.x patch).

Two complementary tests for the lost-update bug the lock fixes, plus the test-Client patches needed to make Swoole-coroutine HTTP work. - tests/e2e/Services/Project/ServicesBase.php :: testConcurrentTogglesAllPersist Fires N parallel PATCHes via Swoole\Coroutine\run + SWOOLE_HOOK_CURL, then refetches the project and asserts (successCount == enabledCount). With the lock disabled (_APP_LOCKING_ENABLED=disabled) sparse updateDocument() calls overwrite each other and the assertion fails — proving the test detects the bug. - dev/test-distributed-lock.sh Same proof via curl + bash background jobs. Runnable outside the PHPUnit suite for manual verification or load-style sweeps. Reads APPWRITE_ENDPOINT / APPWRITE_PROJECT_ID / APPWRITE_API_KEY. - tests/e2e/Client.php (test util): * Skip CURLOPT_PATH_AS_IS (option 234) when SWOOLE_HOOK_CURL is active. Swoole's emulated cURL doesn't support it; setting it fatal-errors as soon as any test enables the cURL hook. * Don't redundantly set CURLOPT_NOBODY=false on non-HEAD requests (false is cURL's default). Under Swoole's hooked cURL this strips the body of PATCH/PUT requests, hitting the framework's 404 fallback instead of the intended route. Both changes preserve native (non-hooked) cURL behavior unchanged. They unblock any future test that wants real parallel HTTP via Swoole\Coroutine\run + the cURL hook. Both follow the same proof-of-bug pattern: run with locking enabled (must pass) AND with it disabled (must fail). Verified locally against the running stack: _APP_LOCKING_ENABLED=enabled -> PASS (15 assertions) _APP_LOCKING_ENABLED=disabled -> FAIL (successCount=5 enabledCount=4)

Lock backend errors (Redis/Dragonfly unreachable) and release errors (TTL expired or backend dropped while held) were previously visible only in the lock.attempts counter and Console::warning lines. They now also push a structured Log entry through the configured logger adapter, so operators using Sentry/Raygun/AppSignal/LogOwl get first-class events for these specific failure modes. Pattern matches Embeddings/Text/Create.php exactly: - Action injects 'log' (per-request Log object) and 'logger' (?Logger, nullable when _APP_LOGGING_CONFIG unset). - Helper mutates the per-request $log instead of constructing a fresh one — preserves the per-request context Embeddings expects. - Same field set: namespace='http', server, version, type, setMessage, setAction, setEnvironment, addTag('code', ...), addExtra('file' / 'line' / 'trace'). - Defensive try/catch around addLog() so logging failures don't break fail-open. Lock-specific tags added for slicing in Sentry: - lock.target — collection name (projects, keys, users, ...). Bounded set, safe for high-cardinality stores. - lock.key_pattern — full key with the trailing document ID stripped (lock:platform:projects:* not lock:platform:projects:abc). Prevents unbounded log cardinality from per-document IDs. Rate limiting via per-pod static buckets, 60s window per (action, target) combo. During a 5-minute Dragonfly outage, a fleet of N pods produces at most N events/min, well within Sentry's dedup tolerance. Static state is per-Swoole-worker; coroutines may race on the bucket boundary but the worst case is one duplicate report. Type level set to Log::TYPE_WARNING (not ERROR): fail-open means the request still succeeds, so this is degraded operation, not a failed request. Deliberately NOT reported to Sentry: - 409 GENERAL_RESOURCE_LOCKED (normal user-facing concurrency) - skip-on-contention events (idempotent fan-out by design) - acquire retry conflicts (internal loop) - destructor cleanups (have an expected baseline rate; the lock.attempts counter aggregates them better than Sentry would) Factory signature change: distributedLock and distributedLockOrFail now accept ?Log and ?Logger as optional named args at call time (rather than capturing Logger at factory-build time). The factory closure runs once at boot but the per-request Log resource is fresh per request — capturing at boot would have given stale state. Existing call sites threaded log: $log, logger: $logger. Sites that don't (workers, CLI tasks) get null and just log to Console as before.

`utopia-php/lock` v0.2.0 was published this week and provides the same Redis SET-NX-EX + Lua-compare-and-delete primitive we built locally as `premtsd-code/lock`. Drop the dev-preview package in favor of the official Utopia PHP library. - composer: replace `premtsd-code/lock` with `utopia-php/lock` 0.2.* (still via VCS — not on Packagist yet) - resources.php: rewire both factory variants - `Lock + Adapter\Redis` → `Distributed` - `acquire()` → `tryAcquire()` for skip variant - `acquire(blocking: true, waitTimeout)` → `acquire($waitTimeout)` for OrFail variant - `LockAcquireException` → `\RedisException` - `(int) $ttl` cast — utopia-php/lock takes seconds as int - docker-compose: thread `_APP_LOCKING_ENABLED` into the appwrite service environment so the kill switch documented in `app/config/variables.php` is actually usable from `.env` Verified end-to-end on local stack: - positive case (locking enabled): 5/5 testConcurrentTogglesAllPersist pass, lock keys observed in `redis-cli MONITOR` with concurrent SET NX contention - negative case (locking disabled): 1/3 detect lost updates as before

The pilot on `Project/Services/Update.php` (commit fb0d43d) fixed the silent lost-update on the `services` JSON map. The same read-modify-write pattern exists on 17 other endpoints — all writing to the SAME `projects` document, mostly via sparse `updateDocument`. Concurrent writes silently overwrite each other, both within an endpoint (two toggles to different auth methods) and across endpoints (parallel SMTP update + service toggle). The worst offender is `Projects/Update.php` (`PATCH /v1/projects/:projectId`): it calls `setAttribute` 11 times then writes the WHOLE document back, so a concurrent write to *any* attribute on the project doc — services, auths, smtp, templates, mockNumbers — gets clobbered. Wrap every read-modify-write window with `distributedLockOrFail` keyed on `lock:platform:projects:{$projectId}`. The single shared key serializes both same-endpoint and cross-endpoint races on the same project. Endpoints now under the lock: Projects module (console-managed, plural): - Projects/Update (full doc setAttribute chain) - Projects/Team/Update (sparse + computed permissions; cascade to installations/repositories/vcsComments runs after release — separate collections) Project module (current-project, singular): - AuthMethods/Update (auths[authKey]) - Protocols/Update (apis[protocolId]) - SMTP/Update (smtp map; PHPMailer probe inside lock — 10s default TTL covers Timeout=5) - Templates/Email/Update (templates['email.X-locale']) - MockPhone/{Create,Delete,Update} (auths['mockNumbers'][]) - Policies/MembershipPrivacy (auths['memberships*']) - Policies/PasswordDictionary (auths['passwordDictionary']) - Policies/PasswordHistory (auths['passwordHistory']) - Policies/PasswordPersonalData (auths['personalDataCheck']) - Policies/SessionAlert (auths['sessionAlerts']) - Policies/SessionDuration (auths['duration']) - Policies/SessionInvalidation (auths['invalidateSessions']) - Policies/SessionLimit (auths['maxSessions']) - Policies/UserLimit (auths['limit']) Each endpoint: 1. Injects `distributedLockOrFail`, `log`, `logger` 2. Re-reads the project document inside the lock so the baseline reflects any update that landed between request init and lock acquisition 3. Throws `GENERAL_RESOURCE_LOCKED` (HTTP 409) on contention timeout SMTP/Update and Templates/Email/Update additionally refactor variable- variables (`${$key}`) to an explicit input array — PHPStan can't trace dynamic references through closure `use` clauses.

Every request that arrives via a custom-domain rule (router path) reads the project's `accessedAt` timestamp and, if the throttle window (`APP_PROJECT_ACCESS`) has elapsed, writes a fresh value. With concurrent traffic across multiple pods, this is a per-row hot RMW that loses updates silently — the surviving timestamp depends on which pod's write landed last. Wrap the read-modify-write in `distributedLock('lock:platform:projects:{id}')` (skip-on-contention variant). Every concurrent pod would write the same throttled value, so losing the race is correct: the winner's update covers ours. Wires `distributedLock` and `?Logger` through: - `router()` function signature (app/controllers/general.php:70) - the three Http::init / Http::get callbacks that invoke router(): `*` catch-all (init), `/robots.txt`, `/humans.txt` Two related cloud-only RMW sites (`teams.accessedAt`, `projects.mcpAccessedAt`) live in `appwrite-labs/cloud` and need a follow-up PR there. They depend on this branch reaching 1.9.x so the `distributedLock` DI resource is available downstream.

The previous shape required every caller to thread `log: $log, logger: $logger` as named args into each `distributedLock(...)` invocation, plus inject `log` and `logger` into the surrounding action just to forward them to the lock. Across 21 call sites this added ~100 LOC of pure plumbing. The cause: the lock factory was registered on the global container in `app/init/resources.php`, where per-request resources like `log` aren't visible. That forced the factory to expose its inner closure with optional `?Log $log = null, ?Logger $logger = null` params, which every caller had to satisfy. Move the lock factory + its `lockErrorReporter`/`lockTargetOf` helpers from the global container to the per-request container (`resources/request.php`), and add `'log'` + `'logger'` to the factory's dep list. The factory closure now runs per-request and closes over the per-request `Log`/`Logger`. Inner closure returned to callers no longer needs the optional params, and call sites drop the named args entirely. Knock-on cleanup: - Drop `->inject('log')`, `->inject('logger')`, the corresponding action params, and `use Utopia\Logger\{Log,Logger}` imports from 19 endpoint files where they were only there for the lock - Drop the same plumbing from `app/controllers/shared/api.php` (3 lock call sites) - Drop just the Logger plumbing from `app/controllers/general.php` (router function + 3 callbacks); `Log` is kept because it's used elsewhere in that file - Net 120 LOC removed across 23 files No behavior change: the lock factories still produce the same closures (skip-on-contention `distributedLock`, blocking-with-409 `distributedLockOrFail`). The static lockErrorReporter rate limiter (1 push per 60s per `(action, target)` bucket) continues to work — it lives on a closure-static in the helper, which is independent of where the helper is constructed. Verified end-to-end: testConcurrentTogglesAllPersist passes 4/5 (the cold- start race flake is the same one we've consistently seen and is orthogonal to lock changes).

Drop the 18 lost-update endpoint locks (Project/* settings + Projects team/update). Those address a different bug class (read-modify-write races) than the manager-flagged production problem (regions slow queries to platform from thundering herd on accessedAt writes). Kept: - distributedLock + distributedLockOrFail factories on the per-request container, GENERAL_RESOURCE_LOCKED exception, 409 mapping - 4 thundering-herd sites: cache-invalidation in shared/api.php (3) + router projects.accessedAt in general.php (1) Dropped: - 18 endpoint OrFail wires - testConcurrentTogglesAllPersist + Swoole-cURL test client patches - dev/test-distributed-lock.sh smoke script

Extracts the lock-key format and the lock+auth-skip+sparse-update pattern into Appwrite\Locking\Lock with three methods: - set(collection, id, attribute=accessedAt, value=null) — throttled single-attribute write - run(collection, id, fn) — generic skip-on-contention - runOrFail(collection, id, fn) — block-then-409 for the deferred lost-update follow-up Migrates the 4 call sites (router projects accessedAt + 3 in shared/api) off the raw $distributedLock callable. Raw factories stay as escape hatches for non-platform key shapes.

Lock now uses Utopia\Lock\Distributed directly and owns the full acquire/release/telemetry/error-reporting/fail-open/kill-switch logic that previously lived in two inline DI factory closures. Adds withKey($key, $fn, $ttl, $orFail, $waitTimeout) as a generic escape hatch for non-platform key shapes (cache, queue, edge) and unusual TTL/timeout requirements. Per-attribute lock keys for set() so that an accessedAt bump and a mcpAccessedAt bump on the same projects:{id} document don't compete. Whole-document operations (run, runOrFail) keep document-level keys. Removes the standalone distributedLock and distributedLockOrFail DI factories — Lock is the single API. request.php shrinks ~150 LOC; Lock.php grows to ~190 LOC.

…llback The dedicated \Redis DI resource (used by timelimit and the new Lock class) was reading _APP_REDIS_HOST/PORT/PASS exclusively. Cloud deployments configure cache via _APP_CONNECTIONS_CACHE URI form (e.g. cache=redis://dragonfly:6379) and don't pass the legacy _APP_REDIS_* vars to the appwrite container locally, so timelimit and Lock both fail to connect outside production where Helm separately injects the legacy vars. Now prefers _APP_CONNECTIONS_CACHE when set (matching the cache pool backend), falls back to _APP_REDIS_* for CE-style configs. No new env vars introduced; both timelimit and Lock work in CE, cloud-local, and cloud-production without compose changes.

Per-manager request, lock keys are now prefixed with the project's internal id (sequence) so that: - Locks are partitioned by project — Redis cluster slot affinity if/when sharded. - Cross-project requests can't compete on the same key for collection-scoped resources. - Telemetry (counter + Sentry tags) carries 'project' alongside 'target', so dashboards can filter contention by project. Key shapes: set: lock:platform:{project}:{collection}:{id}:{attribute} run/orFail: lock:platform:{project}:{collection}:{id} withKey: raw (caller-provided) Lock now requires a project document at construction. All existing call sites (4 in CE + 2 in cloud) run inside Http::init()-resolved request scope where the project document is set, so no migration needed. Workers/CLI without project context can use withKey directly.

Cloud production runs four separate single-master+replica Dragonfly deployments (cache, queue-dragonfly, queue-usage, pubsub-dragonfly), not sharded Redis Cluster topology — confirmed by deploy/cloud/values + environments/production/*.values.yaml (Dragonfly Operator with replicas=2 = 1 primary + 1 read replica), and by the dev DSN scheme 'redis://' (not 'redis-cluster://'). So a standard \Redis client suffices for the direct redis resource (timelimit, Lock). Cloud just needs to pass _APP_REDIS_HOST/PORT/USER/ PASS through to the appwrite container — handled in the cloud PR's docker-compose.yml change. This reverts the resource to its original pre-PR shape. The utopia-php/lock cluster-support PR (utopia-php/lock#1) stays open at upstream as a future-ready option if cloud ever moves to actual Redis Cluster mode.

Covers the four public methods (set, run, runOrFail, withKey) plus the kill switch and the project-fallback behavior. Tests against a real Redis (the appwrite container has it always-on); no markTestSkipped fallback — the suite fails loudly if Redis is unreachable rather than silently passing.

# Conflicts: # composer.lock

Two P1s from review: - Acquire path caught only \RedisException; any other Throwable from the lock library would escape and skip the fail-open callback. Widened to \Throwable so any backend failure falls back to running unlocked. - reportError mutated the per-request $log directly, leaving it in a 'lock.{action}' state that the request-end handler would then submit as the request log. Clones $log into a dedicated $lockLog so the shared instance is untouched.

The sdks attribute is an append-only list, not idempotent. With the read happening outside the lock, two sequential acquirers could each read the same stale list and overwrite each other's appends. Now the lock body re-reads the keys document and re-derives the sdks array from the fresh state. Skip-on-contention still drops the update when the lock is held, but a same-SDK retry on the next request picks the registration up. Bounded loss only affects the rare 'first-seen' SDK request that happens to land while the lock is held; sequential traffic from the same SDK (or any later request from any SDK) re-attempts and writes. Co-authored-by: greptile-apps[bot]

Reverts the catch widening from earlier — the underlying Utopia\Lock\Distributed::tryAcquire only calls ext-redis methods, which throw RedisException exclusively. Catching Throwable was over-broad and would silently swallow real bugs (TypeError, Error) into a fail-open path.

… parse - TTL defaults (5s skip / 10s fail), 3s wait timeout, and 60s sentry rate-limit window are now class constants. - Telemetry outcome labels (acquired/skipped/contended/backend_error/ release_error) are class constants — typo-safe across the 9 use sites. - Fix: inferTargetFromKey was indexing into segment 2 with limit 4, which returned the project sequence instead of the collection name ever since project-id was added to the key in c2a249c. Telemetry was tagging 'target' with project sequences instead of 'projects' / 'users' / etc. Now indexes segment 3 with limit 5 to correctly pick the collection segment from lock:platform:{project}:{target}:...

blacksmith-sh · 2026-04-30T07:05:05Z

Found 1 test failure on Blacksmith runners:

Failure

Test	View Logs
`› Tests\E2E\Services\TablesDB\Transactions</code> TablesDBTransactionPermissionsCustomClientTest/testCannotDeleteNonExistentDocument`	View Logs

Both were defensive against rare edge cases the reviewer flagged but which don't justify the complexity: - $log clone: protects against tag pollution on the per-request Log if reportError fires AND the request later errors. http.php's request-end handler overwrites the core fields (namespace, message, action, etc.) anyway; only addTag/addExtra accumulate. Aligns with Embeddings/Text/Create.php precedent which mutates $log directly. - sdks in-lock re-read: closes a sequential-acquire stale-read race on the sdks append-list. The race exists but the impact is bounded — one SDK registration delayed until the next request from that SDK fires. Self-healing on retry. The codebase already accepts this exact race for auths, oAuthProviders, services, identities, sessions, factors, etc. Special-casing this one site is precision the analytics use case doesn't need.

greptile-apps Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread src/Appwrite/Platform/Modules/Project/Http/Project/Services/Update.php Outdated

Comment thread app/config/variables.php

Comment thread app/init/resources.php Outdated

Comment thread composer.json

greptile-apps Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread app/init/resources.php Outdated

premtsd-code added 3 commits April 27, 2026 17:24

premtsd-code force-pushed the distributed-lock branch from d66f555 to 92b5f0d Compare April 27, 2026 16:34

greptile-apps Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread app/init/resources.php Outdated

premtsd-code added 5 commits April 28, 2026 09:38

Merge branch '1.9.x' into distributed-lock

da5382d

greptile-apps Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread tests/e2e/Services/Project/ServicesBase.php Outdated

greptile-apps Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread app/init/resources/request.php Outdated

Comment thread app/init/resources/request.php Outdated

style: trim verbose comments on lock factories and call sites

b15457b

premtsd-code force-pushed the distributed-lock branch from 0063743 to b15457b Compare April 29, 2026 04:50

This comment has been minimized.

Sign in to view

greptile-apps Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread app/init/resources/request.php Outdated

premtsd-code added 3 commits April 29, 2026 07:41

fix: refresh composer.lock content-hash

18a67e0

greptile-apps Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread src/Appwrite/Locking/Lock.php

premtsd-code added 3 commits April 29, 2026 16:39

Merge remote-tracking branch 'origin/1.9.x' into distributed-lock

542aac7

# Conflicts: # composer.lock

greptile-apps Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread app/controllers/shared/api.php

Comment thread src/Appwrite/Locking/Lock.php

Comment thread src/Appwrite/Locking/Lock.php

This comment has been minimized.

Sign in to view

premtsd-code added 5 commits April 30, 2026 07:13

fix: add explicit parens to new ClassName calls in LockTest (PSR-12)

7b2f6ac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: distributed locking for platform-database writes#12062

feat: distributed locking for platform-database writes#12062
premtsd-code wants to merge 25 commits into1.9.xfrom
distributed-lock

premtsd-code commented Apr 27, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

blacksmith-sh Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

premtsd-code commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

API

Wired sites (4)

Behavior

Config

Pattern guide for future lock sites

Tests

Test plan

Pre-merge cleanup

Notes

Uh oh!

greptile-apps Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔄 PHP-Retry Summary

Uh oh!

github-actions Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Benchmark results

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

blacksmith-sh Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Failure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

premtsd-code commented Apr 27, 2026 •

edited

Loading

greptile-apps Bot commented Apr 27, 2026 •

edited

Loading

github-actions Bot commented Apr 27, 2026 •

edited

Loading

github-actions Bot commented Apr 27, 2026 •

edited

Loading

blacksmith-sh Bot commented Apr 30, 2026 •

edited

Loading