Comparing changes

…ling Add Morpher.MaxDegreeOfParallelism (1 = fully single-threaded), replacing the dead compile-time SINGLE_THREADED flag with a runtime knob across all three within-word parallel sites (synthesis, Unordered analysis cascade, affix-template unapplication). This lets a caller (FieldWorks "Parse All Words") parallelize across words without nested oversubscription. Add MorpherStatistics (opt-in, zero overhead when disabled): Word.Clone count, analysis/synthesis phase timing, parallel-section counter (proves the sequential path runs under degree-1), and a corpus benchmark (Explicit) that reports GC.GetTotalAllocatedBytes + Gen0/1/2 against a real FLEx-exported grammar. Profiling the real Sena grammar showed ~8,793 Word.Clone and ~371 MB allocated per word (the combinatorial unapplication search). First allocation win: Shape.CopyTo builds the src->dest node map inline instead of .Zip().ToDictionary() + double re-enumeration (-2.3% alloc/word, fewer Gen0). Tests: 62 HermitCrab + 790 SIL.Machine pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> HC: memoize multiApp cascade re-expansion; measure GC under parallel load CombinationRuleCascade: in multiApp mode a word's expansion depends only on the word, so memoize already-expanded words and skip re-descending them (collapses the combinatorial re-exploration to a DAG; output set unchanged). Output-identical: 62 HC + 790 core tests pass. Measured ~0% on short Sena words (their clones come from the phonological/synthesis layers, not morphological re-expansion) but it bounds pathological re-expansion blow-up at no correctness cost. Benchmark: measure GC (allocated bytes + Gen0/Gen2) under the parallel-ACROSS-words load and report Server vs Workstation GC — this is where alloc/GC contention actually bites, unlike a single-threaded run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> HC perf plan: record Sena optimization results + Server-GC dominance finding Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> HC: COW design study — 3 scoped plans to cut the FeatureStruct clone firehose Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> HC: add copy-on-write safety-net tests before the COW refactor - FeatureStruct: clone-of-frozen + mutate-clone leaves source unchanged, for every mutator incl. nested-child recursion (PriorityUnion/Union/Subtract/AddValue/RemoveValue/ Clear), plus clone-is-mutable, never-mutated-clone equality, re-entrancy sharing, and ReplaceVariables isolation. Asserts the SOURCE is unchanged (not just "no throw"). - Shape: clone + mutate a cloned node's FeatureStruct leaves the source shape unchanged. - Morpher: concurrent repeated parsing is deterministic (guards COW under parallel load). All pin CURRENT behavior (801 core + 63 HC pass) so the COW refactor can't silently regress. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Plan A: copy-on-write FeatureStruct.Clone for frozen structs Clone() of a FROZEN feature struct now returns a shell that borrows the source's immutable backing dictionary; the first mutation (EnsureWritable, replacing CheckFrozen) inflates a private deep copy via the existing CloneImpl, so neither the mutation nor any recursion into children can touch shared frozen data. Clone() of an unfrozen FS still deep-copies. Single-file change; no public API change. Most cloned feature structs are never mutated, so they stay O(1) shells. Measured on the real Sena grammar: -11% managed allocation/word and ~-29% wall on the 16-way parallel pass (less GC contention). 801 core + 63 HC tests pass, including the new COW safety-net tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> HC COW doc: record Plan A result (-11% alloc) and Plan B subsumed/blocked finding Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> HC: out-of-process Server-GC parser (worker host + reusable client) New SIL.Machine.Morphology.HermitCrab.Server project lets a host application get Server-GC parsing throughput WITHOUT changing its own GC mode, by running the morpher in a child process: - HermitCrabServerHost: loads a compiled HC config, serves analyze requests over stdin/stdout (newline-delimited JSON), parses each word single-threaded with parallelism across the batch. Launched with DOTNET_gcServer=1. - HermitCrabServerClient: reusable IMorphologicalAnalyzer that launches/manages the worker, drives the batch protocol, and returns WordAnalysis. Morphemes cross the boundary as DTOs that implement IMorpheme, so the client needs no grammar load. - Shared protocol DTOs guarantee the two ends agree. Unlike XAmple (native, in-process, no managed GC), HC is managed, and GC mode is fixed at process startup — so a worker subprocess is the only way to scope Server GC to the parser. Grammar-config-driven, so any Machine HC consumer can use it; FieldWorks adds a thin IParser adapter mapping morph Properties -> LCM. End-to-end on the real Sena grammar: out-of-process results match in-process; worker runs Server GC while the host runs Workstation GC (verified). 63 HC tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> HC: apply CSharpier formatting + braces to satisfy CI (formatting + code-style) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> HC: address Copilot review comments - CombinationRuleCascade: seed the memoization set with the initial input so a cycle back to it (A->B->A) doesn't re-expand it. - Morpher.ParseWord: drop the redundant origAnalyses copy (analyses is already materialized and Synthesize no longer drains it). - Server host/client: handle null JsonSerializer.Deserialize results with a clear protocol error instead of an NRE. - MorpherBenchmark: clamp across-word degree-of-parallelism to >= 1 so it doesn't throw on single-core (ProcessorCount-1 == 0) or when HC_ACROSS_DOP=0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Remove copy-on-write FeatureStruct; keep deep-clone Revert FeatureStruct.Clone and Shape.CopyTo to the upstream deep-clone behavior. The copy-on-write FeatureStruct (clone-of-frozen shares backing, inflate on first write) measured ~-11% allocation but is being held back from this performance PR to keep it scoped to the single-threaded option + instrumentation + out-of-process Server-GC parser. COW can return as its own focused change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> HC: address Copilot review comments (round 2) - Honor Morpher.MaxDegreeOfParallelism cap in the two within-word parallel sites that previously ran at the default scheduler degree: ParallelCombinationRuleCascade (new MaxDegreeOfParallelism property, wired from AnalysisStratumRule) and AnalysisAffixTemplateRule.ParallelApplySlots. - Server host: catch JsonException on a malformed request line and reply with an empty response instead of terminating the worker. - Server client: kill+dispose the worker process if it fails to report READY (no leaked process on the startup-failure path). - Server client: validate the worker returns exactly one result per requested word; fail fast with a clear error instead of misaligning/indexing out. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY: plan for data-oriented C# perf work on HermitCrab Capture Rust's memory-architecture wins (pooling, struct-of-arrays, Span, indices-not-pointers) in C# to attack the measured allocation/GC bottleneck, piece by piece with a measurement after each change. One engine, no native lib. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY Phase 2: copy-on-write FeatureStruct (-20% bytes/word on en-hc) Re-apply the COW FeatureStruct (reverts 892816f2): Clone() of a frozen feature struct borrows the immutable backing and inflates (deep-copies) only on first mutation. Inflate only reads the shared frozen backing, so it is thread-safe; guarded by AnalyzeWord_ConcurrentRepeatedParsing_IsDeterministic. Measured (en-hc toy grammar, 439 forms): managed allocated 106.5 -> 84.8 KB/word (-20%), Gen0 3 -> 2, single-thread 91 -> 79 ms. 63 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY: record Sena-too-slow measurement finding + harness strategy Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY: fast single-pass Sena allocation probe (SenaQuick) Budget-bounded, Console-flushed, single-pass probe usable on the real Sena grammar (2789 words/20s) where the multi-pass MorpherBenchmark is too slow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY: COW confirmed on real Sena grammar (-14% bytes/word, +9.5% throughput) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY Phase 1/4: pool the per-clone Shape.CopyTo mapping dictionary Reuse a [ThreadStatic] src->dest node map across CopyTo calls instead of allocating one per Word.Clone. The map is fully consumed before CopyTo returns and CopyTo is not reentrant, so per-thread reuse is safe. Measured (Sena, SenaQuick): 11,997 -> 11,943 KB/word, Gen0 2621 -> 2561. Small (the per-clone ShapeNode/Annotation objects, not the map, are the bulk); kept as a safe step. 63 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY: finish the plan — Phase 6 decision gate + status Record the mapping-pool result, mark phase statuses (Phase 2 done; Phase 1 partial/scoped; Phase 5 deferred to FW integration), and write the Phase 6 decision: continue capturing Rust's memory architecture in C# (COW shipped at -14% Sena / -20% en-hc; next chunk = per-thread pooling of Word/ShapeNode/FST buffers, now measurable via SenaQuick) rather than adopt Rust's runtime. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY: measure 16-thread throughput (SenaParallel) — the parallel answer Add SenaParallel: one shared serial-within-word morpher, same word set at dop=1/4/8/16, wall-clock words/sec + scaling. Measured (800 Sena words, 20-core box): Workstation GC: 3.4x @4, 3.55x @8 (peak), 3.14x @16 -> REGRESSES (GC ceiling, gen0 ~580 regardless of threads). Allocation is the parallel ceiling. Server GC: 5.7x @4, 8.1x @8, 10.3x @16 (gen0 ~88) -> ~11x vs 1-thread WS. Confirms: the out-of-process Server-GC worker (PR #438) already delivers the 16-thread win; RUSTIFY pooling is what lifts the in-process Workstation curve. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY: pinpoint allocation split — 20% Word.Clone, 80% FST traversal Add an opt-in per-thread AllocationProbe hook (set from the net10 test via GC.GetAllocatedBytesForCurrentThread) to attribute Word.Clone's allocation. Measured (Sena, SenaQuick): of ~11.8 MB/word, ~20% is Word.Clone (Shape deep copy) and ~80% is the FST traversal/cascade (a fresh TraversalMethod + List + Queue + register snapshots + FstResults per rule application). Redirects the plan: the FST traversal (esp. reusing the instance cache across Transduce calls) is the high-ROI lever, Word/Shape pooling the secondary one. Probe is zero-overhead when disabled and behavior-identical when no probe is set. 63 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY Phase 1 (FST): pool traversal method per-thread under Server GC Reuse one traversal method per thread per Fst (via a new Reset()) so its instance free-list survives across the thousands of Transduce calls per parse, instead of allocating + discarding a fresh traversal method + instance pool on every rule application (measured: ~80% of parse allocation is the FST traversal). Gated to Server GC (cached GCSettings.IsServerGC), because pooling trades transient garbage for a larger LIVE working set: under Workstation GC that triggers stop-the-world Gen2 pauses that serialize threads and REGRESS parallel scaling (16T 3.1x -> 1.5x). Under Server GC it is a clear win. Measured (Sena, 800 words, SenaParallel): Server GC 16T: 10.3x -> 11.2x; allocation -16% (7.0->5.9 GB); Gen0 88 -> 42. Workstation 16T: 3.16x unchanged (per-call path retained). 803 SIL.Machine + 63 HermitCrab tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY: conclusion — GC no longer dominates at 16 threads (Server GC, 11.2x) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY: back out FST traversal pooling (restore pre-pooling FST) Removing the per-thread traversal-method pool: it only paid off under Server GC and complicates the engine. Reverting to the original allocate-per-call FST before restructuring to bit-packed feature vectors (the better lever). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY Phase 3: bit-packed feature-vector unify fast path Add a flat ulong-per-feature vector to FeatureStruct and a bitwise IsUnifiable fast path in Input.Matches for the common phonological case (no defaults, no negation, fully-symbolic arc input). Gated so the arc INPUT must be fully bit-packable while the SEGMENT may carry ignorable non-symbolic features (FLEx stamps a StringFeatureValue on every segment); FlatIndex is globally unique across feature systems and assigned lazily. FeatureStruct.FlatUnifyEnabled toggles it for A/B. Correct: 63 HermitCrab + 806 SIL.Machine tests pass; parity assertion found zero divergence on en/Sena/Indonesian. Measured (single-thread, SenaQuick): Indonesian: 12,463 -> 11,268 KB/word (-9.7%), Gen0 44->40, 100% fast coverage. Sena: 9,053 -> 9,018 KB/word (neutral), 22% coverage (Bantu agreement uses variable arcs that fall back) -- no regression. Next lever for variable-heavy grammars: bit-pack variable bindings too. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY: remove the out-of-process Server-GC worker/client architecture Delete SIL.Machine.Morphology.HermitCrab.Server (worker host, HermitCrabServerClient, protocol, Program) + its tests, and the .sln / test-project references. It was a workaround for the in-process Workstation-GC parallel ceiling (separate Server-GC process: ~100 MB worker, .NET 10 runtime dependency, a richer protocol + FieldWorks adapter still to build). The RUSTIFY direction supersedes it: drive allocation low enough in-process (COW + bit-packed unify + arena work) that plain .NET needs no Server GC. Server GC stays available as a runtimeconfig flag if ever wanted. 63 HermitCrab tests pass; solution builds without the project. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY: per-word FST-traversal arena (off by default) + key parallel finding Add a per-thread arena that reuses traversal methods + instance free-lists across a word (FstThreadPool, reset per word from Morpher.ParseWord) via a Reset() on the traversal methods. Gated by Fst.TraversalPoolEnabled, DEFAULT OFF. Measured (Sena, A/B same load): single-thread allocation -13%, BUT 16-thread scaling collapses 2.87x -> 1.29x. Confirmed across 4 pooling variants. Cause: under Workstation GC, pooled objects live across the word -> survive Gen0 -> promote -> stop-the-world Gen2 serializes the threads. Short-lived (Gen0-only) allocation is actually BETTER for parallel. So object-pooling is the wrong tool for the no-Server-GC-at-16-threads goal; the right arena is struct/Span/stackalloc (no GC retention). Kept off-by-default as a single-thread/Server-GC opt-in. 63 HermitCrab + 806 SIL.Machine tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY: struct/Span FST traversal is blocked on the data model (record path) Verified blocker: the FST offset type for HermitCrab is ShapeNode (a class), so Register<ShapeNode> and the traversal instances are managed -> cannot stackalloc or hold in a stack Span, and pooling them recreates the Phase 1b Gen2 regression (Advance is also an iterator, forbidding stackalloc). The struct/Span no-GC traversal therefore requires the foundational change: represent the shape as a flat array with int-index offsets so Register<int> is unmanaged -> value-type register/instance buffers, zero GC-heap allocation in the traversal, Gen0 pressure drops, parallel scales without Server GC. Large, foundational rewrite. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY Phase 3c: FstStatistics per-category allocation breakdown harness Adds FstStatistics (SIL.Machine) to decompose the \"80% FST scaffolding\" into four named buckets — VarBindings.Clone, Registers.Clone, per-Transduce Scaffold, and TraversalMethod creation — so the flat-buffer investment can be gated on real numbers from Sena (not theory). Key findings from en-hc + WEB-PT run (439 words, 82.9 KB/word): Word.Clone 21% Pure scaffold 1% (Register[], HashSet, List per Transduce) VarBindings 1% (negligible on English; will be larger on Sena) Registers 0.1% TraversalMethod 0.7% Other (cascade) 55% (MarkMorph/Annotation/stratum-rule overhead, NOT FST) Flat-buffer addresses ~22% on the toy grammar; Sena breakdown needed to decide whether to pursue the full int-offset Shape rewrite. See RUSTIFY.md § Phase 3c. RustifyBenchmark now falls back to en-hc + WEB-PT when HC_GRAMMAR/HC_WORDS are not set, so the breakdown harness is immediately runnable without a FLEx grammar. 63 HC tests green. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> HC: expand cascade breakdown harness (Segment, Word.ctor, MarkMorph, analysis window) Adds four new allocation probes to fully decompose the 55% 'Other' bucket: - MorpherStatistics.SegmentBytes: wraps Segment() (initial Shape/ShapeNode creation) - MorpherStatistics.WordCtorBytes: wraps new Word(stratum, shape) construction - MorpherStatistics.MarkMorphBytes: wraps Word.MarkMorph() annotation allocation - MorpherStatistics.AnalysisCascadeBytes: wraps _analysisRule.Apply().ToList() (superset) English toy grammar result (439 words, 35 MB total): Segment (initial Shape) 7.2% Scaffold (pure FST) ≈ 0% Word.ctor(new) 9.6% Rule-chain machinery ~40.7% Word.Clone 21.3% Synthesis + other ~18.8% Scaffold (incl. clones) 21.9% → analysis window superset 64.4% Key finding: MarkMorph ≈ 0%; pure FST scaffold ≈ 0%; dominant costs are Word.Clone (21%), Word.ctor+Segment (17%), and rule-chain LINQ/FstResult (~41%). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> RUSTIFY: move Word.ctor allocation probe into the Word constructor The Word.ctor probe lived in Morpher.AnalyzeWord, so it only measured the single initial construction per word, not the cascade-created Words. Move it into Word(Stratum, Shape) itself (gated on MorpherStatistics.Enabled, off in production) and add WordCtorCount so the breakdown reports calls as well as bytes. Harness-only; no production-path behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> RUSTIFY Phase 4a: hot-loop allocation eliminations (safe, no retention) Four pure-elimination changes in the FST traversal + analysis cascade. Each removes an allocation outright without extending any object lifetime, so none can trigger the Phase-1b parallel regression (pooling promotes to Gen2 -> serializes threads). All validated: 803 SIL.Machine + 63 HermitCrab tests green; SenaParallel scaling unchanged. 1. ITraversalMethod.Traverse returns List<FstResult> (was IEnumerable) -> drop the redundant .ToList() in Fst.Transduce. All four concrete Traverse impls already return the curResults List; the interface was needlessly widened. 2. Remove redundant .Distinct(FreezableEqualityComparer<Word>.Default) x2 in AnalysisStratumRule.ApplyMorphologicalRules/ApplyTemplates. Both _mrulesRule and _templatesRule are built with that same comparer and return a HashSet already deduped by it, so the Distinct pass is a no-op DistinctIterator. 3. Skip the DistinctIterator for trivial result sets in Fst.Transduce: (allMatches && resultList.Count > 1) ? resultList.Distinct() : resultList. resultList is non-null Count>=1 there; Count==1 Distinct is identity. 4. TraversalMethodBase.Reset: replace the per-Transduce GetNodesDepthFirst yield iterator (heap state machine per top annotation, thousands/word) with the allocation-free PreorderTraverse(action) form; delegate cached as a field (allocated once in ctor, not per call). Measured (en-hc, SenaQuick, 439 words): Other 38.5% -> 36.2% (14,145KB -> 12,884KB), KB/word 83.6 -> 81.1. Toy-grammar deltas are small; real magnitude needs the Sena grammar. See RUSTIFY.md Phase 4a/4b (4b documents the rejected scaffold-buffer ThreadStatic pooling: re-entrant via acceptInfo.Acceptable, lifetime extension against the thesis, unmeasurable). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> RUSTIFY Phase 4c: single-hash traversed.Add in nondet FST traversal In NondeterministicFstTraversalMethod.Traverse (both epsilon and input-match branches), the dedup check hashed the expensive structural key (state + annotation index + register array + outputs array) twice: once for Contains(key), then again for Add(key). HashSet.Add already returns false when the element is present, so collapse to `if (traversed.Add(key)) Push(newInst);` — a single hash/lookup in the innermost traversal loop. Byte-identical. CPU-only cleanup (no allocation change), so KB/word is flat on the toy grammar; the structural-hash cost it removes is not resolvable there. Gated: 803 SIL.Machine + 63 HermitCrab tests green; SenaQuick no regression. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY Phase 4c: drop per-call List<int> in TraversalMethodBase.Advance Advance collected the same-offset annotation window into `var anns = new List<int>()` and then iterated it. That window is a contiguous index range [nextIndex, annsEnd) (the build loop adds every consecutive i whose start offset matches), so track the end bound and iterate the range directly, eliminating one List allocation per arc match (one of the hottest paths in the traversal). cloneOutputs/first flow is unchanged. Measured (en-hc toy, SenaQuick, 439 words): KB/word 80.3 -> 79.0, totalMB 34 -> 33. Gated: 803 SIL.Machine + 63 HermitCrab tests green; toy under- measures so treat the delta as directional, no-regression is the bar. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY Phase 4c: collapse identity-map LINQ in TraversalInstance.CopyTo Both Deterministic and Nondeterministic CopyTo built an `outputMappings` dictionary by zipping this.Output's node sequence with ITSELF — a deterministic Queue-based BFS enumeration paired element-for-element, i.e. the identity map — then projected _mappings through it. Since outputMappings[v]==v, the entire block reduces to copying _mappings unchanged. Replace with `other.Mappings.AddRange(_mappings)`, removing a Dictionary + two SelectMany(GetNodesBreadthFirst, each allocating a Queue + yield iterator) + Zip + Select per instance copy. CopyTo runs on every branch of nondeterministic traversal, so this is allocation-heavy at scale (Sena ~276 clones/word) though the toy grammar (2 clones/word, few branches) can't resolve it. Byte-identical (provable identity-map reduction); other.Mappings is empty pre- AddRange (GetCachedInstance -> Clear). Removed now-unused usings (System.Linq, SIL.Machine.DataStructures) to satisfy IDE0005-as-error. Gated: 803 SIL.Machine + 63 HermitCrab tests green; SenaQuick no regression. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY Phase 4c: paired-walk clone mapping in InitializeStack (Det + Nondet) Both InitializeStack methods built inst.Mappings (source annotation -> clone) by zipping two BFS node sequences: Data.Annotations.SelectMany(GetNodesBreadthFirst) .Zip(inst.Output.Annotations.SelectMany(GetNodesBreadthFirst), KVP) allocating, per Transduce, a Queue per top annotation (BFS), two SelectMany state machines, a Zip state machine, and one KeyValuePair per node. Data and inst.Output are isomorphic (inst.Output = Data.Clone()), and the resulting dictionary is independent of traversal order, so replace with a new allocation-light helper DataStructuresExtensions.PairedPreorderTraverse that walks the two forests in lockstep (preorder) and writes pairs straight into the dict via a static (closure-free) callback. Debug.Asserts guard the isomorphism invariant (root/leaf/child-count) so any future violation fails loudly instead of silently truncating like Zip. Runs once per Transduce (thousands/word). Toy grammar has tiny annotation trees so the delta is below its resolution; the win compounds on Sena's long words. Removed now-unused usings (System.Linq, SIL.Extensions) in the deterministic method to satisfy IDE0005-as-error. Gated: 803 SIL.Machine + 63 HermitCrab tests green (incl. the concurrent- determinism test); SenaQuick no regression. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY Phase 4c: precompute initializer partition at Fst.Freeze Fst.Transduce rebuilt a List<TagMapCommand> on every call (and every outer annIndex iteration), filtering _initializers into Dest!=0 (the per-call cmds list) vs Dest==0 (which drive a per-annotation SetOffset). The partition is identical every call for a frozen FST, and cmds is read-only downstream (Initialize -> ExecuteCommands only iterate it). Partition _initializers once in Freeze() into _zeroDestInitializers / _nonZeroDestInitializers (built into locals, gating field published last so a reader never sees a half-filled list). Transduce reuses the shared read-only _nonZeroDestInitializers as cmds and walks _zeroDestInitializers for the SetOffsets, eliminating the per-call list allocation + filter loop. When the FST isn't frozen the fields are null and Transduce falls back to the exact inline build, so unfrozen callers are unaffected. The frozen FST is shared read-only across parsing threads, so concurrent reads of the shared cmds list are safe. Measured (en-hc toy, SenaQuick, 439 words): Scaffold 22.4% -> 21.0% (7897KB -> 7261KB), KB/word 79.0 -> 78.8. SenaParallel: scaling and allocation unchanged (no parallel regression from sharing the list). Gated: 803 SIL.Machine + 63 HermitCrab tests green, incl. the concurrent-determinism test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY: document Phase 4c (five safe no-retention FST eliminations) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY cleanup: strip unrelated USFM work + remove dead GC/pooling machinery Audit of master..hc-rustify identified work to drop now that allocation is driven down and the branch is being squashed: 1. Strip the unrelated USFM/versification change set (3 commits: #430, #432, normalization port) — restored src/SIL.Machine/Corpora, src/SIL.Machine/PunctuationAnalysis and their tests to master, removed the USFM-added files. None of it is perf work; it belongs in its own PR. 2. Remove the dead FST traversal pooling (measured to REGRESS parallel parsing, Phase 1b): the FstThreadPool class, Fst.TraversalPoolEnabled, the pooling branch in Fst.Transduce, FstThreadPool.Reset() in Morpher.ParseWord, and the HC_ARENA toggle in RustifyBenchmark. Transduce now always uses a fresh (die-in-Gen0) traversal method, which is the right tradeoff once allocation is low. 3. Remove the Shape.CopyTo [ThreadStatic] clone-map pool, keeping the value-added inline mapping build (no second GetNodes().Zip().ToDictionary() pass) — just a plain per-call Dictionary. 4. Revert Machine.sln to master (leftover x64/x86 platform configs) and remove three superseded planning docs (HERMITCRAB_ALLOCATION_STRATEGIES / COW_PLANS / PERF_PLAN), now consolidated in RUSTIFY.md. Kept (per request): the allocation instrumentation (MorpherStatistics, FstStatistics, probes) + both benchmarks for before/after measurement; the MaxDegreeOfParallelism API + Synthesize refactor; COW FeatureStruct; bit-packed unify; Phase 4a/4c eliminations. Gated: 801 SIL.Machine + 63 HermitCrab tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY: document Phase 4d cleanup (pooling/arena removed) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY cleanup: restore the safe Shape.CopyTo [ThreadStatic] clone-map pool Re-examination showed this pool is NOT the regressive kind removed elsewhere. The Phase-1b parallel regression came from objects retained ACROSS a word (promoted to Gen2). Shape.CopyTo's CloneMapping is cleared and fully consumed WITHIN each call (contents die immediately; only a small empty buffer persists), so it cannot promote parse data to Gen2 — and it still buys a small allocation win (~0.45% on Sena; on the toy grammar removing it had pushed Word.Clone 22.4% -> 23.6% and KB/word 78.8 -> 80.2). Restored, with RUSTIFY.md Phase 4d note corrected to record it as KEPT (safe pool) rather than removed. Gated: 801 SIL.Machine + 63 HermitCrab tests green; SenaQuick KB/word back to 78.8, Word.Clone 22.4%. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY: staged implementation plan for the flat int-index shape Records the chosen direction (go flat) + corrected feasibility findings (no TOffset constraints; int-offset engine already tested; ShapeNode contained to ~95 in-repo refs), the accepted cost (ShapeNode -> handle, value identity), and the 3-stage plan: (1) array-backed Shape + ShapeNode handle + array-copy Clone, (2) int FST offset + unmanaged Span/stackalloc traversal (the parallel unlock), (3) migrate rule sites to indices. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY: real Sena measurement obtained — reshapes flat-shape priorities Generated sena-hc.xml from the Sena 3 FieldWorks backup (GenerateHCConfig.exe) and extracted sena-words.txt (7,121 words) from the project's seh running text. SenaQuick (400 words, MaxUnapp=5) now gives the clone-heavy numbers the spike needs: clones/word=345 (estimate ~276 confirmed), KB/word=14,116 Scaffold 42.2% (per-Transduce Register[,] arrays) <- biggest bucket Word.Clone 21.9%, Other 24.8% Key finding: Scaffold (managed Register<ShapeNode>[,] per Transduce) is ~2x Word.Clone, so the flat int-index foundation's biggest payoff is Stage 2 (Register<int> -> stackalloc/Span, zero heap), bigger than the Word.Clone bucket the goal named, and unlocked by the same change. Confirms flat over COW (COW cannot touch the traversal scaffold). Benchmark assets untracked in samples/data. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY Stage 1: array-backed Shape + ShapeNode handle (flat-shape foundation) Re-represents the shape as a flat, int-indexed backing — the data-model foundation the whole flat-shape plan (Phase 3b-impl) stands on, and the prerequisite for Stage 2's Fst<Word,int> register-scaffold win (the measured 42% bucket) and Stage 3's Word.Clone cut (22%). - Shape no longer inherits OrderedBidirList<ShapeNode>; it owns its nodes in flat arrays (_next/_prev int links = in-array doubly-linked list, per-node frozen flag, canonical handle) addressed by a stable ShapeNode.Index, and reimplements IOrderedBidirList/IOrderedBidirListNode over them. - ShapeNode becomes a handle (Owner + Index); links/frozen delegate to the owner arrays. The added node IS retained as the canonical one-per-slot handle, so reference identity (==, dict keys, Range<ShapeNode> endpoints) is unchanged. - Tag deliberately stays on the node so it survives a node moving between shapes (AddAfter sets the new tag before detaching from the old owner). The tag-relabel order maintenance, Freeze/Clone/CopyTo and annotation interactions are preserved. Gate: 803 SIL.Machine + 63 HermitCrab tests green (incl. concurrent-determinism). Measured neutral on en-hc toy (SenaQuick): KB/word 80.5 -> 80.5, clones/word 2, gen0 2 — exactly the plan's "Stage 1 ~= 0" prediction; payoff is unlocked by, not realized in, this increment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> RUSTIFY Stage 2 substrate: freeze-time NodeAt int<->node bridge + design blueprint Adds Shape.NodeAt(int) backed by a dense _byPos[] table built in Freeze (content nodes already get dense Tag 0..N-1 there), the int-offset -> ShapeNode bridge the Fst<Word,int> binding will resolve against. Additive and behavior-preserving; 803 SIL.Machine tests green. Records the resolved Stage 2 blueprint in RUSTIFY.md: offset = dense frozen tag (HC always freezes before traversal), half-open [t,t+1) ranges reusing IntegerRangeFactory (provably identical ordering/Overlaps/Contains to the inclusive ShapeNode form for a one-unit-per-node model), Word/Shape become IAnnotatedData<int> via a freeze-time AnnotationList<int> projection, rules resolve int->node via NodeAt, and Register<int> goes unmanaged (the 42% Scaffold payoff). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> RUSTIFY Stage 2 blueprint: correct it after reading the rule-application flow Reading IterativePhonologicalPatternRule.Apply + the rewrite SubruleSpecs + the semantic-site catalog overturned two blueprint assumptions: - Rewrite rules MUTATE match.Input.Shape in place while UNFROZEN and re-match repeatedly, so the traversed shape's tags are SPARSE, not dense 0..N-1. => offset must be the raw ordered Tag (the [Tag,Tag+1) half-open mapping is still provably correct for sparse tags: Tag+1 always lands at/<= the next tag). - NodeAt must therefore work on unfrozen shapes => a Tag->node map maintained incrementally (AddAfter/Remove/Relabel), not a freeze-only dense array. Records the real hazards to design for: End.Tag==int.MaxValue overflows [t,t+1) (anchors are in the annotation list and ordered on add); the int annotation projection must stay in sync with the live mutated ShapeNode list (not build-once at freeze); and ~30 offset-navigation sites (match.Range.End.Next etc.) must route through shape.NodeAt(tag).Next?.Tag preserving null-at-boundary. Net: the flip is larger/subtler than a mechanical generic swap — a multi-session spike, each sub-piece behind the byte-identical gate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> RUSTIFY Stage 2: empirically validate the int-offset range-mapping thesis Adds a parity test proving the assumption the whole TOffset=ShapeNode -> int flip rests on: mapping each annotation [startNode,endNode] to the half-open int range [startNode.Tag, endNode.Tag+1] preserves the range relationships the FST traversal depends on — CompareTo ordering, Overlaps, Contains — for SPARSE tags (appended unfrozen shape, as rewrite rules see it) and dense tags (frozen). Pairwise over all annotations of a shape with a spanning (start!=end) annotation. Both cases green. This de-risks the riskiest design point before any code is built on it: now 805 SIL.Machine (+2) + 63 HC tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> RUSTIFY Stage 1: measure the flat Word on the REAL Sena grammar (the clone-heavy case) The toy-grammar SenaQuick (2 clones/word) was too clone-light to judge an API-breaking rewrite. Ran SenaQuick against the real Sena grammar (sena-hc.xml, 400 words, HC_MAX_UNAPP=5) where the ~345 clones/word payoff lives, vs the pre-Stage-1 baseline recorded in 2fd1a2d3: clones/word 345 -> 345 (byte-identical AND behavior-identical at scale) KB/word 14116 -> 14583 (+3.3%) gen0 442 -> 457 Scaffold 42.2% -> 42.7% Word.Clone 21.9% -> 22.3% (split reproduced) Findings: (1) the flat data model produces an identical clone count on the pathological grammar, confirming correctness beyond the toy tests; (2) Stage 1 in isolation costs +3.3% allocation -- the four per-Shape backing arrays (_nodes/_next/_prev/_frozen) x 138k clones -- exactly the plan's "Stage 1 ~= 0 or slightly negative" prediction, hidden by the toy grammar's 2 clones/word. The cost is the investment: the 42.7% Scaffold (Stage 2 Register<int>) and 22.3% Word.Clone (Stage 3) are what the same flat foundation unlocks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> RUSTIFY Stage 2: int-offset annotation projection on Shape (the Fst<Word,int> bridge) Adds the linchpin infrastructure for the FST flip, additively (Shape still IAnnotatedData<ShapeNode>; nothing flipped yet, full suite green): - AnnotationList<T> gains an internal Version counter (bumped on Add/Remove/Clear) so derived views can detect staleness cheaply. - Shape builds a lazy, version-gated int-offset projection: AnnotationList<int> IntAnnotations (each [s,e] -> half-open [s.Tag, e.Tag+1], End margin clamped to avoid +1 overflow), Range<int> IntRange, and Dictionary<int,ShapeNode> for NodeAt(offset) (now works frozen AND unfrozen, via Tag). FeatureStruct is shared by reference so in-place rule edits stay visible. The projection is rebuilt only when the annotation Version or frozen-state changes, so a stable/frozen shape builds it once and reuses it across thousands of Transduce calls per word. Tests: IntAnnotationProjection_MirrorsShapeNodeAnnotations verifies the projection mirrors the ShapeNode tree (ranges, FeatureStruct identity, optional, children), NodeAt round-trips every node by Tag, and the cache invalidates on mutation. 805 (+2) SIL.Machine green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> WIP RUSTIFY Stage 2: flip HC FST to Fst<Word,int> (57/63 HC green) The full TOffset ShapeNode->int flip across HermitCrab (~71 files): Word is now IAnnotatedData<int>; Shape exposes a lazy, version-gated int-offset annotation projection with DENSE node positions (0..N+1) as offsets — dense (not sparse Tag) to avoid the Range<int>.Null=-1 collision, +1 overflow at the End margin, and empty anchors. NodeAt/OffsetOf/MatchStartOffset bridge int<->node; rule RHS code resolves match/group int ranges back to nodes (half-open [off, off+1), so leftmost = NodeAt(Start), rightmost = NodeAt(End-1)). MatchStartOffset(node,dir) handles the inclusive->half-open asymmetry for right-to-left match-start offsets. Down from 23 failures to 6 (all now logic, not crashes): metathesis SimpleRule/ ComplexRule, DeletionRules/MultipleDeletionRules, EpenthesisRules, ReduplicationRules — node insertion/deletion/movement + group-capture rules still need correctness work. Register<int> stackalloc (the payoff) comes after these are byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> RUSTIFY Stage 2: fix analysis under-generation (63/63 HC green) Two byte-identical fixes to the int-offset projection, restoring the 4 analysis-direction rewrite tests (Epenthesis/Deletion/MultipleDeletion/ Reduplication) that the Fst<Word,int> flip broke: 1. Annotation.Optional must invalidate the projection. The Shape int projection copies Optional by value and caches against the annotation list Version, but the Optional setter is a non-structural change that never bumped Version. So once analysis flipped Optional=true on existing nodes, the matcher kept reading the stale Optional=false projection and never forked the optional-skip instances. The setter now bumps the root list's version (new AnnotationList.IncrementVersion). Fixes Epenthesis. 2. IntRange must be the half-open image of the inclusive [Begin, End], i.e. [off(Begin), off(End)+1) — not [off(Begin), off(End)]. The only consumer is Matcher.GetStartAnnotation via Range.GetStart(dir); a RtL match starts at GetStart(RtL)==End. The End anchor's dense range is [off(End), off(End)+1), whose RtL start coordinate is off(End)+1, so without the +1 a RtL match began at the last content node and skipped any edit adjacent to End (e.g. inserting a deleted segment after the final vowel). Fixes the deletion/reduplication cases. Adds two regression tests guarding both invariants. Also keeps the prior working-tree Stage-2 fixes in IterativePhonologicalPatternRule and SynthesisMetathesisRuleSpec (resolve int offsets to ShapeNode refs before mutating the shape). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY Stage 2: record generic-flip-green milestone + post-flip measurement The <Word,ShapeNode>->-<Word,int> generic flip is byte-identical green (63/63 HC + 808 SIL.Machine, full Release solution builds clean). Document the two int-model correctness bugs found bringing it from 57/63 to green (Optional cache invalidation + IntRange half-open End-anchor mapping), why 59 tests masked them, and the post-flip en-hc baseline (KB/word 78.8 -> 86.1, the projection "investment" before the Register<int> payoff). Records the remaining Stage 2 payoff target. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY Stage 2: refute the register-payoff hypothesis with real-grammar data Wire Indonesian (the classic HC nasalization demo, variable-light, ~150 clones/word) alongside Sena (~345 clones/word) as the two measurement grammars, and use them to investigate the Stage-2 thesis that Register<int> being unmanaged unlocks a stackalloc cut of the 42% Scaffold bucket. Measurement refutes it: - Registers.Clone (the escaping accept snapshots the redesign targets) = 0.2% on Sena. Not where the bytes are. - Converting the per-push dedup-key Tuple<State,int,Register[,][,Output[]]> to an inline `readonly struct TraversalKey` in both nondeterministic traversal methods moved allocation ~0% (Sena KB/word 14588->14579; Indonesian flat). Kept anyway: zero-risk, byte-identical, removes a real per-push heap object, CPU-positive (single Add vs Contains+Add), consistent with Phase-4c micro- eliminations. - The Scaffold 38.5% IS the clone explosion: it contains Word.Clone (22.4%, via the per-instance Output=Data.Clone() in InitializeStack) + the per-instance Mappings dictionary + Output graph. The int flip's allocation payoff is therefore Stage 3 (flat-shape clone), not a register trick. Full suite green (808 SIL.Machine + 63 HC, incl. concurrent-determinism). RUSTIFY.md records the finding + how to regenerate both grammars. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY Stage 3: localize the clone cost — inherent per-node materialization Two-phase allocation probe on Shape.CopyTo (Sena): Word.Clone's 22% splits into CopyTo node-phase 11.4% (node.Clone + per-node dest.Add) + annotation- phase 4.1% + ~6.9% Word/Shape ctor. The node-phase prize is inherent per-node object materialization (ShapeNode + Annotation + COW FS + AnnotationList skip- list entry per node), not intermediate churn. Two incremental attacks measured ~0/negative and were reverted: - pre-size the backing arrays vs AddAfter doubling: 666->688 MB (worse; source Count over-sizes partial-range CopyTo, doubling was never the cost). - the per-push dedup Tuple->struct (prior commit): ~0. Conclusion: the flat-clone payoff requires the deep redesign (lazy ShapeNode handles + bulk AnnotationList clone + index-addressed annotations), the Stage-1- deferred "Clone = Array.Copy" end-state — a multi-session foundational rewrite needing a go/no-go. No incremental win exists short of it. Recorded in RUSTIFY.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY Stage 3: design + sequencing doc for the flat-shape clone spike Per the plan-then-proceed go/no-go: RUSTIFY-stage3-design.md lays out the foundational Word.Clone rewrite before any code goes red. - Goal: kill the inherent per-node materialization (node-phase 11.4% + anns-phase 4.1% of Word.Clone) by making Shape.Clone an Array.Copy. - Entanglement: ShapeNode reference-identity + annotations-hold-handles + skip-list-tower-per-annotation must be undone together. - Key resolution: materialize-on-touch two-state shape. A clone is a flat snapshot (no handles/Annotation objects); the int projection (Stage 2) reads it for the hot frozen-traverse path so nothing materializes; any ShapeNode/Annotation request or in-place mutation materializes lazily, one-per-slot, restoring exact reference identity. This resolves the dense-index-vs-mutation tension: frozen-read pays nothing, unfrozen-mutate pays the old price (far colder). - Byte-identical risk register, I->V sub-increment order (I FeatureStruct flat array + II flat AnnotationList = gateable green and keepable alone; III lazy materialization + Array.Copy clone = the red phase; IV triage the 189 HC ShapeNode refs frozen-read vs mutate; V re-validate + measure), and rollback to dbef327a. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY Stage 3: advisor review + measure II first — towers are 7.4% (resequence) Advisor review of the design front-loaded a read-only verification gate before the risky III: the linchpin (int projection must rebuild from flat records, handle-free), the make-or-break premise (UpdateOutput touches O(few) per FST- transduction clone), result-consumer audit, and the I detached-FS caveat. Folded into the design doc. Then executed the advisor's "measure II before III" with a temporary tower- allocation probe on Sena: annotation skip-list towers = 7.4% of total alloc (~432 MB, 6.31M arrays) — a THIRD of Word.Clone (22.4%), two-thirds of node-phase(11.4%)+anns-phase(4.1%). Resequences the spike: increment II (flatten the BidirList tower arrays into list-owned flat backing) is now the headline — ~7.4%, byte-identical, gateable GREEN, zero laziness risk, independently keepable. Increment III's lazy-handle materialization is downgraded to optional/gated: it buys only the residual ~8% (the ShapeNode/Annotation objects) and carries the reference-identity risk. The towers were the cheap two-thirds hiding behind the "inherent objects" framing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY Stage 3 II-a: grow skip-list margins on demand (Word.Clone -123MB Sena) First positive allocation increment of the flat-clone spike, byte-identical green. Every BidirList ctor Init'd both Begin/End margin nodes at the 33-level skip-list maximum (new TNode[33] x2) regardless of actual list height. Since lists almost always stay shallow, that eager margin tower was a large slice of the per- AnnotationList tower allocation that dominates Word.Clone. Now: - margins start at level 0 (Init(1) + link level 0); - GrowMargins ensures capacity + links Begin<->End at a new level only when a node first reaches it (EnsureLevelCapacity right-sizes; geometric growth was measured slightly worse - it over-allocates the shallow majority); - Clear resets to level 0, higher levels relink lazily on regrowth. Measured (SenaQuick, Release): Sena Word.Clone 1,306,476 -> 1,182,940 KB (-123 MB, -9.5% of Word.Clone, stable across runs; total KB/word -~2% under GC noise); Indonesian Word.Clone -~0.5pt similarly. Full SIL.Machine (808) + HC (63, incl. concurrent-determinism) green. Contained to BidirList/BidirListNode (used by AnnotationList x2, SkipList, TreeBidirList); does not touch ShapeNode/Annotation reference identity, so it is independently keepable regardless of the later II-b / III increments. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY Stage 3 II-b: inline skip-list level 0 (Word.Clone -54MB more, Sena) Second positive flat-clone increment, byte-identical green. Level 0 (the only level ~50% of skip-list nodes have) moves from the per-node _next[0]/_prev[0] arrays into inline _next0/_prev0 fields, so level-0 nodes allocate NO tower array at all and every taller node's array is one slot shorter (levels 1.. in _nextHigh/_prevHigh, null when Levels<=1). Touches the hottest skip-list accessors (GetNext/SetNext/GetPrev/SetPrev/Next/ Prev/Init/Clear/EnsureLevelCapacity); gated on the full SIL.Machine (808) + HC (63, incl. concurrent-determinism) suites - green, so the level<->field-or-array dispatch is byte-identical. Measured (SenaQuick, Release): Sena Word.Clone 1,182,940 -> 1,128,660 KB (-54 MB on top of II-a); Indonesian 222,491 -> 212,911 KB (-9.6 MB). Cumulative II-a + II-b vs pre-Stage-3: Sena Word.Clone -177 MB (-13.6%), total allocation -4.2% (KB/word 14,556 -> 13,942); Indonesian total -4.1%. Pure allocation reduction, no retention, independently keepable regardless of III. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY: index the Stage 3 II-a/II-b green increments (-4.2% allocation) Record in the main plan that two byte-identical flat-clone increments landed (margin grow-on-demand + inline level 0), banking the cheap skip-list tower wins: Sena Word.Clone -177 MB (-13.6%), total -4.2%; Indonesian -4.1%. Points to RUSTIFY-stage3-design.md for the residual III go/no-go. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY Stage 3: III feasibility measured (41% Sena clones never mutated) + choose copy-on-write Shape mechanism Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY Stage 3 III: copy-on-write Shape — Word.Clone -59.6% (Sena), byte-identical The flat-clone payoff. A clone of a *frozen* shape now stores _cowSource and copies nothing. The asymmetry that makes this cheap + safe: the FST matcher (the hot read path) consumes a clone only through the int-offset projection (IntAnnotations/IntRange), which is served from the frozen source; while every path that could mutate first hands out a ShapeNode/Annotation handle. So: - serve IntAnnotations/IntRange/Count/GetFrozenHashCode/Freeze from the source while copy-on-write; - gate EnsureInflated() (= the real CopyTo, then re-freeze if frozen-by-sharing) on the flat-backing link accessors, First/Last/enumeration, NodeAt/OffsetOf/ MatchStartOffset/Annotations/GetNodes/CopyTo/ValueEquals, and every mutator. A clone that is only traversed (matcher carrier) never inflates -> costs a shell instead of N nodes + N annotations + their skip-list towers. Thread-safety (the doc's non-negotiable): a frozen shape's int projection is now built eagerly at Freeze() (single-threaded), so the new pattern of several parse threads' COW clones delegating to one shared frozen grammar shape always hits a complete cache rather than racing a lazy first build. Measured (SenaQuick, Release): Sena Word.Clone 1,128,660 -> 528,071 KB (-53% on top of II; 20.2% -> 9.9% of total); Indonesian 212,911 -> 85,566 KB (-60%). Cumulative Stage 3 (II-a+II-b+III) vs pre-Stage-3: Sena Word.Clone -778 MB (-59.6%), share 22.4% -> 9.9%; Indonesian -62%. Word.Clone is no longer a top bucket. Full SIL.Machine (808) + HC (63, incl. concurrent-determinism) green; full Release solution builds clean; SenaParallel scaling unregressed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY Stage 3 III: verify byte-identical on real grammars + COW invariant tests Validation the toy suite can't give (en-hc is ~2 clones/word; COW's never- inflated path runs ~170x hotter on Sena at 345 clones/word): - Added RustifyBenchmark.Signature ([Explicit], not CI): emits a deterministic per-word analysis signature (sorted set of Category|root|glosses per WordAnalysis) to HC_SIG_OUT. Diffed HEAD vs the pre-Stage-3 baseline (dbef327a, isolating II+III) on BOTH grammars via a worktree: Sena (400 words) and Indonesian (121 words, 100 non-empty) signatures are IDENTICAL. The COW change is byte-identical where it actually runs hot, not just on the toy grammar. - Added 3 CI-running COW-invariant regression tests (AnnotationTests): never-inflated clone serves the source's projection/range/count; mutating a clone inflates it and leaves the frozen source uncorrupted; frozen-by-sharing hash equals the source and stays stable across forced inflation. Full SIL.Machine (811) + HC (63) green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY lever 2: lazily allocate Word's morphological-rule bookkeeping maps _mrulesUnapplied / _mrulesApplied / _disjunctiveAllomorphIndices stay empty through the phonological-analysis cascade (where ~345 clones/word happen) but were cloned eagerly per candidate. Now null = empty, created on first write, copied only when the source is non-empty. Byte-identical (63 HC green). Measured (SenaQuick): Word.Clone 527,987 -> 499,121 KB (-29 MB), Word.ctor 184,858 -> 177,387 KB; total 5,267 -> 5,216 MB (~-1%). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY lever 1: hoist the initial-register scaffold out of the Transduce loop Fst.Transduce allocated a fresh Register<TOffset>[regCount,2] per outer (start- position) iteration. Traverse only Array.Copy's it into the initial instances and never retains it, so it can be allocated once and Array.Clear'd per start position - byte-identical, and AllMatches (analysis) runs one iteration per start, so this removes (starts-1) register-array allocations per matcher call. Measured (SenaQuick): Scaffold 2,264,486 -> 2,241,441 KB (-23 MB). Full suite (811 SIL.Machine + 63 HC) green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY: record levers 1+2 (lean Word + hoisted register scaffold, ~-1%, byte-identical) and why the 42% Scaffold prize stays blocked Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY lever 1: replace per-instance Visited HashSet with an inline value bitset Profiling showed the 42% Scaffold is instance churn: ~2,927 traversal instances created per Sena word (only ~20% reused — the pool is per-Transduce, thrown away each call, and pooling across calls re-triggers the Phase-1b Gen2 parallel regression). So the fix is leaner instances, not pooling. Each nondeterministic instance carried a HashSet<State> to avoid epsilon loops. States have a dense Index, so this is now a value-type VisitedStates bitset: states 0-63 in an inline ulong field (zero heap — HC rule FSTs are tiny), a lazy ulong[] overflow only for 64+ state FSTs. The set is now part of the instance object, not a separate ~1.17M/word heap allocation. Byte-identical (same dedup semantics over state identity == Index). Measured (SenaQuick): Scaffold 2,269,759 -> 2,169,001 KB (-100 MB), total 5,242 -> 5,145 MB (~-2%). Full suite (811 SIL.Machine + 63 HC) green. The remaining per-instance allocation (the Register[,] array, ~1.17M/word) is the bigger prize but is blocked here: the `traversed` dedup key holds each instance's register array BY REFERENCE, so a shared register arena (slices reused across instances) would corrupt dedup. Cutting it needs the deep de-iterator + snapshot- dedup rewrite, not a drop-in. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY lever 1 (deep): de-iterator Advance/Initialize into a reusable buffer The core of the scaffold rewrite. Advance was a yield-based iterator and Initialize allocated a fresh List per call (both recursive), so each of the ~2,482 Transduce/word -> millions of Advance calls minted an iterator state machine / List. Both now fill ONE reusable per-method result buffer instead. Safety: the buffer is a per-method (per-Transduce) field, so it carries no cross-word retention (the Phase-1b Gen2 parallel regression) and cannot be a thread-static (CheckAccepting's Acceptable predicate can re-enter Transduce on the same thread). Initialize fills it once at the start of Traverse and the caller fully consumes it building the work stack before the main loop's first Advance reuses it, so the two never overlap (one buffer serves both). Advance is not re-entrant within a method. Byte-identical: same results, same order. Measured (SenaQuick): total 5,145 -> 5,029 MB (-116 MB, ~-2.3%); the per-call iterator state machines (Scaffold -147 MB) replaced by one buffer List/method (+~39 MB in TraversalMethod after merging the two buffers into one). Full suite (811 SIL.Machine + 63 HC, incl. concurrent-determinism) green. NOTE on the register stackalloc premise: it does NOT apply to the nondeterministic matcher (the hot path). The `traversed` dedup retains a per-config register snapshot during each Transduce, so the registers are not transient stack values - they're the evolving, snapshotted match state. The achievable scaffold wins are therefore the iterator garbage (this commit) + the Visited HashSet (prior), not stackalloc'd registers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> RUSTIFY: record lever-1 deep rewrite (Visited bitset + de-iterator, ~-6% Sena, byte-identical) + the register-stackalloc constraint Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Comparing changes

Open a pull request

Uh oh!

Commits on Jun 30, 2026

This comparison is taking too long to generate.

Uh oh!