Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: sillsdev/machine
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: master
Choose a base ref
...
head repository: sillsdev/machine
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: hc-rustify
Choose a head ref
Checking mergeability… Don’t worry, you can still create the pull request.
  • 1 commit
  • 129 files changed
  • 2 contributors

Commits on Jun 30, 2026

  1. HC: single-threaded runtime option + allocation instrumentation/profi…

    …ling
    
    Add Morpher.MaxDegreeOfParallelism (1 = fully single-threaded), replacing the
    dead compile-time SINGLE_THREADED flag with a runtime knob across all three
    within-word parallel sites (synthesis, Unordered analysis cascade, affix-template
    unapplication). This lets a caller (FieldWorks "Parse All Words") parallelize
    across words without nested oversubscription.
    
    Add MorpherStatistics (opt-in, zero overhead when disabled): Word.Clone count,
    analysis/synthesis phase timing, parallel-section counter (proves the sequential
    path runs under degree-1), and a corpus benchmark (Explicit) that reports
    GC.GetTotalAllocatedBytes + Gen0/1/2 against a real FLEx-exported grammar.
    
    Profiling the real Sena grammar showed ~8,793 Word.Clone and ~371 MB allocated
    per word (the combinatorial unapplication search). First allocation win:
    Shape.CopyTo builds the src->dest node map inline instead of
    .Zip().ToDictionary() + double re-enumeration (-2.3% alloc/word, fewer Gen0).
    
    Tests: 62 HermitCrab + 790 SIL.Machine pass.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    HC: memoize multiApp cascade re-expansion; measure GC under parallel load
    
    CombinationRuleCascade: in multiApp mode a word's expansion depends only on the
    word, so memoize already-expanded words and skip re-descending them (collapses the
    combinatorial re-exploration to a DAG; output set unchanged). Output-identical:
    62 HC + 790 core tests pass. Measured ~0% on short Sena words (their clones come
    from the phonological/synthesis layers, not morphological re-expansion) but it
    bounds pathological re-expansion blow-up at no correctness cost.
    
    Benchmark: measure GC (allocated bytes + Gen0/Gen2) under the parallel-ACROSS-words
    load and report Server vs Workstation GC — this is where alloc/GC contention
    actually bites, unlike a single-threaded run.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    HC perf plan: record Sena optimization results + Server-GC dominance finding
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    HC: COW design study — 3 scoped plans to cut the FeatureStruct clone firehose
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    HC: add copy-on-write safety-net tests before the COW refactor
    
    - FeatureStruct: clone-of-frozen + mutate-clone leaves source unchanged, for every
      mutator incl. nested-child recursion (PriorityUnion/Union/Subtract/AddValue/RemoveValue/
      Clear), plus clone-is-mutable, never-mutated-clone equality, re-entrancy sharing, and
      ReplaceVariables isolation. Asserts the SOURCE is unchanged (not just "no throw").
    - Shape: clone + mutate a cloned node's FeatureStruct leaves the source shape unchanged.
    - Morpher: concurrent repeated parsing is deterministic (guards COW under parallel load).
    
    All pin CURRENT behavior (801 core + 63 HC pass) so the COW refactor can't silently regress.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    Plan A: copy-on-write FeatureStruct.Clone for frozen structs
    
    Clone() of a FROZEN feature struct now returns a shell that borrows the source's
    immutable backing dictionary; the first mutation (EnsureWritable, replacing
    CheckFrozen) inflates a private deep copy via the existing CloneImpl, so neither the
    mutation nor any recursion into children can touch shared frozen data. Clone() of an
    unfrozen FS still deep-copies. Single-file change; no public API change.
    
    Most cloned feature structs are never mutated, so they stay O(1) shells. Measured on
    the real Sena grammar: -11% managed allocation/word and ~-29% wall on the 16-way
    parallel pass (less GC contention). 801 core + 63 HC tests pass, including the new
    COW safety-net tests.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    HC COW doc: record Plan A result (-11% alloc) and Plan B subsumed/blocked finding
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    HC: out-of-process Server-GC parser (worker host + reusable client)
    
    New SIL.Machine.Morphology.HermitCrab.Server project lets a host application get
    Server-GC parsing throughput WITHOUT changing its own GC mode, by running the morpher
    in a child process:
    
    - HermitCrabServerHost: loads a compiled HC config, serves analyze requests over
      stdin/stdout (newline-delimited JSON), parses each word single-threaded with
      parallelism across the batch. Launched with DOTNET_gcServer=1.
    - HermitCrabServerClient: reusable IMorphologicalAnalyzer that launches/manages the
      worker, drives the batch protocol, and returns WordAnalysis. Morphemes cross the
      boundary as DTOs that implement IMorpheme, so the client needs no grammar load.
    - Shared protocol DTOs guarantee the two ends agree.
    
    Unlike XAmple (native, in-process, no managed GC), HC is managed, and GC mode is fixed
    at process startup — so a worker subprocess is the only way to scope Server GC to the
    parser. Grammar-config-driven, so any Machine HC consumer can use it; FieldWorks adds a
    thin IParser adapter mapping morph Properties -> LCM.
    
    End-to-end on the real Sena grammar: out-of-process results match in-process; worker
    runs Server GC while the host runs Workstation GC (verified). 63 HC tests pass.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    HC: apply CSharpier formatting + braces to satisfy CI (formatting + code-style)
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    HC: address Copilot review comments
    
    - CombinationRuleCascade: seed the memoization set with the initial input so a cycle
      back to it (A->B->A) doesn't re-expand it.
    - Morpher.ParseWord: drop the redundant origAnalyses copy (analyses is already
      materialized and Synthesize no longer drains it).
    - Server host/client: handle null JsonSerializer.Deserialize results with a clear
      protocol error instead of an NRE.
    - MorpherBenchmark: clamp across-word degree-of-parallelism to >= 1 so it doesn't
      throw on single-core (ProcessorCount-1 == 0) or when HC_ACROSS_DOP=0.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    Remove copy-on-write FeatureStruct; keep deep-clone
    
    Revert FeatureStruct.Clone and Shape.CopyTo to the upstream deep-clone behavior.
    The copy-on-write FeatureStruct (clone-of-frozen shares backing, inflate on first
    write) measured ~-11% allocation but is being held back from this performance PR
    to keep it scoped to the single-threaded option + instrumentation + out-of-process
    Server-GC parser. COW can return as its own focused change.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    HC: address Copilot review comments (round 2)
    
    - Honor Morpher.MaxDegreeOfParallelism cap in the two within-word parallel
      sites that previously ran at the default scheduler degree:
      ParallelCombinationRuleCascade (new MaxDegreeOfParallelism property, wired
      from AnalysisStratumRule) and AnalysisAffixTemplateRule.ParallelApplySlots.
    - Server host: catch JsonException on a malformed request line and reply with
      an empty response instead of terminating the worker.
    - Server client: kill+dispose the worker process if it fails to report READY
      (no leaked process on the startup-failure path).
    - Server client: validate the worker returns exactly one result per requested
      word; fail fast with a clear error instead of misaligning/indexing out.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY: plan for data-oriented C# perf work on HermitCrab
    
    Capture Rust's memory-architecture wins (pooling, struct-of-arrays, Span,
    indices-not-pointers) in C# to attack the measured allocation/GC bottleneck,
    piece by piece with a measurement after each change. One engine, no native lib.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY Phase 2: copy-on-write FeatureStruct (-20% bytes/word on en-hc)
    
    Re-apply the COW FeatureStruct (reverts 892816f2): Clone() of a frozen feature
    struct borrows the immutable backing and inflates (deep-copies) only on first
    mutation. Inflate only reads the shared frozen backing, so it is thread-safe;
    guarded by AnalyzeWord_ConcurrentRepeatedParsing_IsDeterministic.
    
    Measured (en-hc toy grammar, 439 forms): managed allocated 106.5 -> 84.8 KB/word
    (-20%), Gen0 3 -> 2, single-thread 91 -> 79 ms. 63 HC tests green.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY: record Sena-too-slow measurement finding + harness strategy
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY: fast single-pass Sena allocation probe (SenaQuick)
    
    Budget-bounded, Console-flushed, single-pass probe usable on the real Sena
    grammar (2789 words/20s) where the multi-pass MorpherBenchmark is too slow.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY: COW confirmed on real Sena grammar (-14% bytes/word, +9.5% throughput)
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY Phase 1/4: pool the per-clone Shape.CopyTo mapping dictionary
    
    Reuse a [ThreadStatic] src->dest node map across CopyTo calls instead of
    allocating one per Word.Clone. The map is fully consumed before CopyTo returns
    and CopyTo is not reentrant, so per-thread reuse is safe.
    
    Measured (Sena, SenaQuick): 11,997 -> 11,943 KB/word, Gen0 2621 -> 2561.
    Small (the per-clone ShapeNode/Annotation objects, not the map, are the bulk);
    kept as a safe step. 63 HC tests green.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY: finish the plan — Phase 6 decision gate + status
    
    Record the mapping-pool result, mark phase statuses (Phase 2 done; Phase 1
    partial/scoped; Phase 5 deferred to FW integration), and write the Phase 6
    decision: continue capturing Rust's memory architecture in C# (COW shipped at
    -14% Sena / -20% en-hc; next chunk = per-thread pooling of Word/ShapeNode/FST
    buffers, now measurable via SenaQuick) rather than adopt Rust's runtime.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY: measure 16-thread throughput (SenaParallel) — the parallel answer
    
    Add SenaParallel: one shared serial-within-word morpher, same word set at
    dop=1/4/8/16, wall-clock words/sec + scaling.
    
    Measured (800 Sena words, 20-core box):
      Workstation GC: 3.4x @4, 3.55x @8 (peak), 3.14x @16 -> REGRESSES (GC ceiling,
        gen0 ~580 regardless of threads). Allocation is the parallel ceiling.
      Server GC:      5.7x @4, 8.1x @8, 10.3x @16 (gen0 ~88) -> ~11x vs 1-thread WS.
    
    Confirms: the out-of-process Server-GC worker (PR #438) already delivers the
    16-thread win; RUSTIFY pooling is what lifts the in-process Workstation curve.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY: pinpoint allocation split — 20% Word.Clone, 80% FST traversal
    
    Add an opt-in per-thread AllocationProbe hook (set from the net10 test via
    GC.GetAllocatedBytesForCurrentThread) to attribute Word.Clone's allocation.
    
    Measured (Sena, SenaQuick): of ~11.8 MB/word, ~20% is Word.Clone (Shape deep
    copy) and ~80% is the FST traversal/cascade (a fresh TraversalMethod + List +
    Queue + register snapshots + FstResults per rule application). Redirects the
    plan: the FST traversal (esp. reusing the instance cache across Transduce
    calls) is the high-ROI lever, Word/Shape pooling the secondary one.
    
    Probe is zero-overhead when disabled and behavior-identical when no probe is
    set. 63 HC tests green.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY Phase 1 (FST): pool traversal method per-thread under Server GC
    
    Reuse one traversal method per thread per Fst (via a new Reset()) so its
    instance free-list survives across the thousands of Transduce calls per parse,
    instead of allocating + discarding a fresh traversal method + instance pool on
    every rule application (measured: ~80% of parse allocation is the FST traversal).
    
    Gated to Server GC (cached GCSettings.IsServerGC), because pooling trades
    transient garbage for a larger LIVE working set: under Workstation GC that
    triggers stop-the-world Gen2 pauses that serialize threads and REGRESS parallel
    scaling (16T 3.1x -> 1.5x). Under Server GC it is a clear win.
    
    Measured (Sena, 800 words, SenaParallel):
      Server GC 16T: 10.3x -> 11.2x; allocation -16% (7.0->5.9 GB); Gen0 88 -> 42.
      Workstation 16T: 3.16x unchanged (per-call path retained).
    803 SIL.Machine + 63 HermitCrab tests green.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY: conclusion — GC no longer dominates at 16 threads (Server GC, 11.2x)
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY: back out FST traversal pooling (restore pre-pooling FST)
    
    Removing the per-thread traversal-method pool: it only paid off under Server GC
    and complicates the engine. Reverting to the original allocate-per-call FST
    before restructuring to bit-packed feature vectors (the better lever).
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY Phase 3: bit-packed feature-vector unify fast path
    
    Add a flat ulong-per-feature vector to FeatureStruct and a bitwise IsUnifiable
    fast path in Input.Matches for the common phonological case (no defaults, no
    negation, fully-symbolic arc input). Gated so the arc INPUT must be fully
    bit-packable while the SEGMENT may carry ignorable non-symbolic features (FLEx
    stamps a StringFeatureValue on every segment); FlatIndex is globally unique
    across feature systems and assigned lazily. FeatureStruct.FlatUnifyEnabled
    toggles it for A/B.
    
    Correct: 63 HermitCrab + 806 SIL.Machine tests pass; parity assertion found zero
    divergence on en/Sena/Indonesian.
    
    Measured (single-thread, SenaQuick):
      Indonesian: 12,463 -> 11,268 KB/word (-9.7%), Gen0 44->40, 100% fast coverage.
      Sena:        9,053 ->  9,018 KB/word (neutral), 22% coverage (Bantu agreement
                   uses variable arcs that fall back) -- no regression.
    
    Next lever for variable-heavy grammars: bit-pack variable bindings too.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY: remove the out-of-process Server-GC worker/client architecture
    
    Delete SIL.Machine.Morphology.HermitCrab.Server (worker host, HermitCrabServerClient,
    protocol, Program) + its tests, and the .sln / test-project references. It was a
    workaround for the in-process Workstation-GC parallel ceiling (separate Server-GC
    process: ~100 MB worker, .NET 10 runtime dependency, a richer protocol + FieldWorks
    adapter still to build). The RUSTIFY direction supersedes it: drive allocation low
    enough in-process (COW + bit-packed unify + arena work) that plain .NET needs no
    Server GC. Server GC stays available as a runtimeconfig flag if ever wanted.
    
    63 HermitCrab tests pass; solution builds without the project.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY: per-word FST-traversal arena (off by default) + key parallel finding
    
    Add a per-thread arena that reuses traversal methods + instance free-lists across
    a word (FstThreadPool, reset per word from Morpher.ParseWord) via a Reset() on the
    traversal methods. Gated by Fst.TraversalPoolEnabled, DEFAULT OFF.
    
    Measured (Sena, A/B same load): single-thread allocation -13%, BUT 16-thread
    scaling collapses 2.87x -> 1.29x. Confirmed across 4 pooling variants. Cause:
    under Workstation GC, pooled objects live across the word -> survive Gen0 ->
    promote -> stop-the-world Gen2 serializes the threads. Short-lived (Gen0-only)
    allocation is actually BETTER for parallel. So object-pooling is the wrong tool
    for the no-Server-GC-at-16-threads goal; the right arena is struct/Span/stackalloc
    (no GC retention). Kept off-by-default as a single-thread/Server-GC opt-in.
    
    63 HermitCrab + 806 SIL.Machine tests pass.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY: struct/Span FST traversal is blocked on the data model (record path)
    
    Verified blocker: the FST offset type for HermitCrab is ShapeNode (a class), so
    Register<ShapeNode> and the traversal instances are managed -> cannot stackalloc
    or hold in a stack Span, and pooling them recreates the Phase 1b Gen2 regression
    (Advance is also an iterator, forbidding stackalloc). The struct/Span no-GC
    traversal therefore requires the foundational change: represent the shape as a
    flat array with int-index offsets so Register<int> is unmanaged -> value-type
    register/instance buffers, zero GC-heap allocation in the traversal, Gen0
    pressure drops, parallel scales without Server GC. Large, foundational rewrite.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY Phase 3c: FstStatistics per-category allocation breakdown harness
    
    Adds FstStatistics (SIL.Machine) to decompose the \"80% FST scaffolding\" into
    four named buckets — VarBindings.Clone, Registers.Clone, per-Transduce Scaffold,
    and TraversalMethod creation — so the flat-buffer investment can be gated on real
    numbers from Sena (not theory).
    
    Key findings from en-hc + WEB-PT run (439 words, 82.9 KB/word):
      Word.Clone         21%
      Pure scaffold       1%  (Register[], HashSet, List per Transduce)
      VarBindings         1%  (negligible on English; will be larger on Sena)
      Registers           0.1%
      TraversalMethod     0.7%
      Other (cascade)    55%  (MarkMorph/Annotation/stratum-rule overhead, NOT FST)
    
    Flat-buffer addresses ~22% on the toy grammar; Sena breakdown needed to decide
    whether to pursue the full int-offset Shape rewrite. See RUSTIFY.md § Phase 3c.
    
    RustifyBenchmark now falls back to en-hc + WEB-PT when HC_GRAMMAR/HC_WORDS are
    not set, so the breakdown harness is immediately runnable without a FLEx grammar.
    63 HC tests green.
    
    Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
    
    HC: expand cascade breakdown harness (Segment, Word.ctor, MarkMorph, analysis window)
    
    Adds four new allocation probes to fully decompose the 55% 'Other' bucket:
    - MorpherStatistics.SegmentBytes: wraps Segment() (initial Shape/ShapeNode creation)
    - MorpherStatistics.WordCtorBytes: wraps new Word(stratum, shape) construction
    - MorpherStatistics.MarkMorphBytes: wraps Word.MarkMorph() annotation allocation
    - MorpherStatistics.AnalysisCascadeBytes: wraps _analysisRule.Apply().ToList() (superset)
    
    English toy grammar result (439 words, 35 MB total):
      Segment (initial Shape)    7.2%   Scaffold (pure FST) ≈ 0%
      Word.ctor(new)             9.6%   Rule-chain machinery ~40.7%
      Word.Clone                21.3%   Synthesis + other  ~18.8%
      Scaffold (incl. clones)   21.9%
    → analysis window superset  64.4%
    
    Key finding: MarkMorph ≈ 0%; pure FST scaffold ≈ 0%; dominant costs are
    Word.Clone (21%), Word.ctor+Segment (17%), and rule-chain LINQ/FstResult (~41%).
    
    Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
    
    RUSTIFY: move Word.ctor allocation probe into the Word constructor
    
    The Word.ctor probe lived in Morpher.AnalyzeWord, so it only measured the
    single initial construction per word, not the cascade-created Words. Move it
    into Word(Stratum, Shape) itself (gated on MorpherStatistics.Enabled, off in
    production) and add WordCtorCount so the breakdown reports calls as well as
    bytes. Harness-only; no production-path behavior change.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    RUSTIFY Phase 4a: hot-loop allocation eliminations (safe, no retention)
    
    Four pure-elimination changes in the FST traversal + analysis cascade. Each
    removes an allocation outright without extending any object lifetime, so none
    can trigger the Phase-1b parallel regression (pooling promotes to Gen2 ->
    serializes threads). All validated: 803 SIL.Machine + 63 HermitCrab tests
    green; SenaParallel scaling unchanged.
    
    1. ITraversalMethod.Traverse returns List<FstResult> (was IEnumerable) -> drop
       the redundant .ToList() in Fst.Transduce. All four concrete Traverse impls
       already return the curResults List; the interface was needlessly widened.
    2. Remove redundant .Distinct(FreezableEqualityComparer<Word>.Default) x2 in
       AnalysisStratumRule.ApplyMorphologicalRules/ApplyTemplates. Both _mrulesRule
       and _templatesRule are built with that same comparer and return a HashSet
       already deduped by it, so the Distinct pass is a no-op DistinctIterator.
    3. Skip the DistinctIterator for trivial result sets in Fst.Transduce:
       (allMatches && resultList.Count > 1) ? resultList.Distinct() : resultList.
       resultList is non-null Count>=1 there; Count==1 Distinct is identity.
    4. TraversalMethodBase.Reset: replace the per-Transduce GetNodesDepthFirst
       yield iterator (heap state machine per top annotation, thousands/word) with
       the allocation-free PreorderTraverse(action) form; delegate cached as a
       field (allocated once in ctor, not per call).
    
    Measured (en-hc, SenaQuick, 439 words): Other 38.5% -> 36.2%
    (14,145KB -> 12,884KB), KB/word 83.6 -> 81.1. Toy-grammar deltas are small;
    real magnitude needs the Sena grammar. See RUSTIFY.md Phase 4a/4b (4b documents
    the rejected scaffold-buffer ThreadStatic pooling: re-entrant via
    acceptInfo.Acceptable, lifetime extension against the thesis, unmeasurable).
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    RUSTIFY Phase 4c: single-hash traversed.Add in nondet FST traversal
    
    In NondeterministicFstTraversalMethod.Traverse (both epsilon and input-match
    branches), the dedup check hashed the expensive structural key (state +
    annotation index + register array + outputs array) twice: once for
    Contains(key), then again for Add(key). HashSet.Add already returns false when
    the element is present, so collapse to `if (traversed.Add(key)) Push(newInst);`
    — a single hash/lookup in the innermost traversal loop. Byte-identical.
    
    CPU-only cleanup (no allocation change), so KB/word is flat on the toy grammar;
    the structural-hash cost it removes is not resolvable there. Gated:
    803 SIL.Machine + 63 HermitCrab tests green; SenaQuick no regression.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY Phase 4c: drop per-call List<int> in TraversalMethodBase.Advance
    
    Advance collected the same-offset annotation window into `var anns = new
    List<int>()` and then iterated it. That window is a contiguous index range
    [nextIndex, annsEnd) (the build loop adds every consecutive i whose start
    offset matches), so track the end bound and iterate the range directly,
    eliminating one List allocation per arc match (one of the hottest paths in
    the traversal). cloneOutputs/first flow is unchanged.
    
    Measured (en-hc toy, SenaQuick, 439 words): KB/word 80.3 -> 79.0, totalMB
    34 -> 33. Gated: 803 SIL.Machine + 63 HermitCrab tests green; toy under-
    measures so treat the delta as directional, no-regression is the bar.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY Phase 4c: collapse identity-map LINQ in TraversalInstance.CopyTo
    
    Both Deterministic and Nondeterministic CopyTo built an `outputMappings`
    dictionary by zipping this.Output's node sequence with ITSELF — a
    deterministic Queue-based BFS enumeration paired element-for-element, i.e. the
    identity map — then projected _mappings through it. Since outputMappings[v]==v,
    the entire block reduces to copying _mappings unchanged. Replace with
    `other.Mappings.AddRange(_mappings)`, removing a Dictionary + two
    SelectMany(GetNodesBreadthFirst, each allocating a Queue + yield iterator) +
    Zip + Select per instance copy. CopyTo runs on every branch of nondeterministic
    traversal, so this is allocation-heavy at scale (Sena ~276 clones/word) though
    the toy grammar (2 clones/word, few branches) can't resolve it.
    
    Byte-identical (provable identity-map reduction); other.Mappings is empty pre-
    AddRange (GetCachedInstance -> Clear). Removed now-unused usings (System.Linq,
    SIL.Machine.DataStructures) to satisfy IDE0005-as-error.
    
    Gated: 803 SIL.Machine + 63 HermitCrab tests green; SenaQuick no regression.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY Phase 4c: paired-walk clone mapping in InitializeStack (Det + Nondet)
    
    Both InitializeStack methods built inst.Mappings (source annotation -> clone)
    by zipping two BFS node sequences:
      Data.Annotations.SelectMany(GetNodesBreadthFirst)
        .Zip(inst.Output.Annotations.SelectMany(GetNodesBreadthFirst), KVP)
    allocating, per Transduce, a Queue per top annotation (BFS), two SelectMany
    state machines, a Zip state machine, and one KeyValuePair per node.
    
    Data and inst.Output are isomorphic (inst.Output = Data.Clone()), and the
    resulting dictionary is independent of traversal order, so replace with a new
    allocation-light helper DataStructuresExtensions.PairedPreorderTraverse that
    walks the two forests in lockstep (preorder) and writes pairs straight into the
    dict via a static (closure-free) callback. Debug.Asserts guard the isomorphism
    invariant (root/leaf/child-count) so any future violation fails loudly instead
    of silently truncating like Zip.
    
    Runs once per Transduce (thousands/word). Toy grammar has tiny annotation trees
    so the delta is below its resolution; the win compounds on Sena's long words.
    Removed now-unused usings (System.Linq, SIL.Extensions) in the deterministic
    method to satisfy IDE0005-as-error.
    
    Gated: 803 SIL.Machine + 63 HermitCrab tests green (incl. the concurrent-
    determinism test); SenaQuick no regression.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY Phase 4c: precompute initializer partition at Fst.Freeze
    
    Fst.Transduce rebuilt a List<TagMapCommand> on every call (and every outer
    annIndex iteration), filtering _initializers into Dest!=0 (the per-call cmds
    list) vs Dest==0 (which drive a per-annotation SetOffset). The partition is
    identical every call for a frozen FST, and cmds is read-only downstream
    (Initialize -> ExecuteCommands only iterate it).
    
    Partition _initializers once in Freeze() into _zeroDestInitializers /
    _nonZeroDestInitializers (built into locals, gating field published last so a
    reader never sees a half-filled list). Transduce reuses the shared read-only
    _nonZeroDestInitializers as cmds and walks _zeroDestInitializers for the
    SetOffsets, eliminating the per-call list allocation + filter loop. When the FST
    isn't frozen the fields are null and Transduce falls back to the exact inline
    build, so unfrozen callers are unaffected. The frozen FST is shared read-only
    across parsing threads, so concurrent reads of the shared cmds list are safe.
    
    Measured (en-hc toy, SenaQuick, 439 words): Scaffold 22.4% -> 21.0%
    (7897KB -> 7261KB), KB/word 79.0 -> 78.8. SenaParallel: scaling and allocation
    unchanged (no parallel regression from sharing the list). Gated: 803
    SIL.Machine + 63 HermitCrab tests green, incl. the concurrent-determinism test.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY: document Phase 4c (five safe no-retention FST eliminations)
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY cleanup: strip unrelated USFM work + remove dead GC/pooling machinery
    
    Audit of master..hc-rustify identified work to drop now that allocation is
    driven down and the branch is being squashed:
    
    1. Strip the unrelated USFM/versification change set (3 commits: #430, #432,
       normalization port) — restored src/SIL.Machine/Corpora,
       src/SIL.Machine/PunctuationAnalysis and their tests to master, removed the
       USFM-added files. None of it is perf work; it belongs in its own PR.
    
    2. Remove the dead FST traversal pooling (measured to REGRESS parallel parsing,
       Phase 1b): the FstThreadPool class, Fst.TraversalPoolEnabled, the pooling
       branch in Fst.Transduce, FstThreadPool.Reset() in Morpher.ParseWord, and the
       HC_ARENA toggle in RustifyBenchmark. Transduce now always uses a fresh
       (die-in-Gen0) traversal method, which is the right tradeoff once allocation
       is low.
    
    3. Remove the Shape.CopyTo [ThreadStatic] clone-map pool, keeping the
       value-added inline mapping build (no second GetNodes().Zip().ToDictionary()
       pass) — just a plain per-call Dictionary.
    
    4. Revert Machine.sln to master (leftover x64/x86 platform configs) and remove
       three superseded planning docs (HERMITCRAB_ALLOCATION_STRATEGIES /
       COW_PLANS / PERF_PLAN), now consolidated in RUSTIFY.md.
    
    Kept (per request): the allocation instrumentation (MorpherStatistics,
    FstStatistics, probes) + both benchmarks for before/after measurement; the
    MaxDegreeOfParallelism API + Synthesize refactor; COW FeatureStruct; bit-packed
    unify; Phase 4a/4c eliminations.
    
    Gated: 801 SIL.Machine + 63 HermitCrab tests green.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY: document Phase 4d cleanup (pooling/arena removed)
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY cleanup: restore the safe Shape.CopyTo [ThreadStatic] clone-map pool
    
    Re-examination showed this pool is NOT the regressive kind removed elsewhere.
    The Phase-1b parallel regression came from objects retained ACROSS a word
    (promoted to Gen2). Shape.CopyTo's CloneMapping is cleared and fully consumed
    WITHIN each call (contents die immediately; only a small empty buffer persists),
    so it cannot promote parse data to Gen2 — and it still buys a small allocation
    win (~0.45% on Sena; on the toy grammar removing it had pushed Word.Clone
    22.4% -> 23.6% and KB/word 78.8 -> 80.2). Restored, with RUSTIFY.md Phase 4d
    note corrected to record it as KEPT (safe pool) rather than removed.
    
    Gated: 801 SIL.Machine + 63 HermitCrab tests green; SenaQuick KB/word back to
    78.8, Word.Clone 22.4%.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY: staged implementation plan for the flat int-index shape
    
    Records the chosen direction (go flat) + corrected feasibility findings (no
    TOffset constraints; int-offset engine already tested; ShapeNode contained to
    ~95 in-repo refs), the accepted cost (ShapeNode -> handle, value identity), and
    the 3-stage plan: (1) array-backed Shape + ShapeNode handle + array-copy Clone,
    (2) int FST offset + unmanaged Span/stackalloc traversal (the parallel unlock),
    (3) migrate rule sites to indices.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY: real Sena measurement obtained — reshapes flat-shape priorities
    
    Generated sena-hc.xml from the Sena 3 FieldWorks backup (GenerateHCConfig.exe)
    and extracted sena-words.txt (7,121 words) from the project's seh running text.
    SenaQuick (400 words, MaxUnapp=5) now gives the clone-heavy numbers the spike
    needs:
    
      clones/word=345 (estimate ~276 confirmed), KB/word=14,116
      Scaffold 42.2% (per-Transduce Register[,] arrays)  <- biggest bucket
      Word.Clone 21.9%, Other 24.8%
    
    Key finding: Scaffold (managed Register<ShapeNode>[,] per Transduce) is ~2x
    Word.Clone, so the flat int-index foundation's biggest payoff is Stage 2
    (Register<int> -> stackalloc/Span, zero heap), bigger than the Word.Clone bucket
    the goal named, and unlocked by the same change. Confirms flat over COW (COW
    cannot touch the traversal scaffold). Benchmark assets untracked in samples/data.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY Stage 1: array-backed Shape + ShapeNode handle (flat-shape foundation)
    
    Re-represents the shape as a flat, int-indexed backing — the data-model
    foundation the whole flat-shape plan (Phase 3b-impl) stands on, and the
    prerequisite for Stage 2's Fst<Word,int> register-scaffold win (the measured
    42% bucket) and Stage 3's Word.Clone cut (22%).
    
    - Shape no longer inherits OrderedBidirList<ShapeNode>; it owns its nodes in
      flat arrays (_next/_prev int links = in-array doubly-linked list, per-node
      frozen flag, canonical handle) addressed by a stable ShapeNode.Index, and
      reimplements IOrderedBidirList/IOrderedBidirListNode over them.
    - ShapeNode becomes a handle (Owner + Index); links/frozen delegate to the
      owner arrays. The added node IS retained as the canonical one-per-slot handle,
      so reference identity (==, dict keys, Range<ShapeNode> endpoints) is unchanged.
    - Tag deliberately stays on the node so it survives a node moving between shapes
      (AddAfter sets the new tag before detaching from the old owner). The tag-relabel
      order maintenance, Freeze/Clone/CopyTo and annotation interactions are preserved.
    
    Gate: 803 SIL.Machine + 63 HermitCrab tests green (incl. concurrent-determinism).
    Measured neutral on en-hc toy (SenaQuick): KB/word 80.5 -> 80.5, clones/word 2,
    gen0 2 — exactly the plan's "Stage 1 ~= 0" prediction; payoff is unlocked by, not
    realized in, this increment.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    RUSTIFY Stage 2 substrate: freeze-time NodeAt int<->node bridge + design blueprint
    
    Adds Shape.NodeAt(int) backed by a dense _byPos[] table built in Freeze (content
    nodes already get dense Tag 0..N-1 there), the int-offset -> ShapeNode bridge the
    Fst<Word,int> binding will resolve against. Additive and behavior-preserving;
    803 SIL.Machine tests green.
    
    Records the resolved Stage 2 blueprint in RUSTIFY.md: offset = dense frozen tag
    (HC always freezes before traversal), half-open [t,t+1) ranges reusing
    IntegerRangeFactory (provably identical ordering/Overlaps/Contains to the
    inclusive ShapeNode form for a one-unit-per-node model), Word/Shape become
    IAnnotatedData<int> via a freeze-time AnnotationList<int> projection, rules resolve
    int->node via NodeAt, and Register<int> goes unmanaged (the 42% Scaffold payoff).
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    RUSTIFY Stage 2 blueprint: correct it after reading the rule-application flow
    
    Reading IterativePhonologicalPatternRule.Apply + the rewrite SubruleSpecs +
    the semantic-site catalog overturned two blueprint assumptions:
    
    - Rewrite rules MUTATE match.Input.Shape in place while UNFROZEN and re-match
      repeatedly, so the traversed shape's tags are SPARSE, not dense 0..N-1.
      => offset must be the raw ordered Tag (the [Tag,Tag+1) half-open mapping is
      still provably correct for sparse tags: Tag+1 always lands at/<= the next tag).
    - NodeAt must therefore work on unfrozen shapes => a Tag->node map maintained
      incrementally (AddAfter/Remove/Relabel), not a freeze-only dense array.
    
    Records the real hazards to design for: End.Tag==int.MaxValue overflows [t,t+1)
    (anchors are in the annotation list and ordered on add); the int annotation
    projection must stay in sync with the live mutated ShapeNode list (not build-once
    at freeze); and ~30 offset-navigation sites (match.Range.End.Next etc.) must route
    through shape.NodeAt(tag).Next?.Tag preserving null-at-boundary. Net: the flip is
    larger/subtler than a mechanical generic swap — a multi-session spike, each
    sub-piece behind the byte-identical gate.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    RUSTIFY Stage 2: empirically validate the int-offset range-mapping thesis
    
    Adds a parity test proving the assumption the whole TOffset=ShapeNode -> int flip
    rests on: mapping each annotation [startNode,endNode] to the half-open int range
    [startNode.Tag, endNode.Tag+1] preserves the range relationships the FST traversal
    depends on — CompareTo ordering, Overlaps, Contains — for SPARSE tags (appended
    unfrozen shape, as rewrite rules see it) and dense tags (frozen). Pairwise over
    all annotations of a shape with a spanning (start!=end) annotation. Both cases green.
    
    This de-risks the riskiest design point before any code is built on it: now 805
    SIL.Machine (+2) + 63 HC tests green.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    RUSTIFY Stage 1: measure the flat Word on the REAL Sena grammar (the clone-heavy case)
    
    The toy-grammar SenaQuick (2 clones/word) was too clone-light to judge an
    API-breaking rewrite. Ran SenaQuick against the real Sena grammar (sena-hc.xml,
    400 words, HC_MAX_UNAPP=5) where the ~345 clones/word payoff lives, vs the
    pre-Stage-1 baseline recorded in 2fd1a2d3:
    
      clones/word 345 -> 345  (byte-identical AND behavior-identical at scale)
      KB/word     14116 -> 14583  (+3.3%)
      gen0        442 -> 457
      Scaffold    42.2% -> 42.7%   Word.Clone 21.9% -> 22.3%  (split reproduced)
    
    Findings: (1) the flat data model produces an identical clone count on the
    pathological grammar, confirming correctness beyond the toy tests; (2) Stage 1
    in isolation costs +3.3% allocation -- the four per-Shape backing arrays
    (_nodes/_next/_prev/_frozen) x 138k clones -- exactly the plan's "Stage 1 ~= 0 or
    slightly negative" prediction, hidden by the toy grammar's 2 clones/word. The
    cost is the investment: the 42.7% Scaffold (Stage 2 Register<int>) and 22.3%
    Word.Clone (Stage 3) are what the same flat foundation unlocks.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    RUSTIFY Stage 2: int-offset annotation projection on Shape (the Fst<Word,int> bridge)
    
    Adds the linchpin infrastructure for the FST flip, additively (Shape still
    IAnnotatedData<ShapeNode>; nothing flipped yet, full suite green):
    
    - AnnotationList<T> gains an internal Version counter (bumped on Add/Remove/Clear)
      so derived views can detect staleness cheaply.
    - Shape builds a lazy, version-gated int-offset projection: AnnotationList<int>
      IntAnnotations (each [s,e] -> half-open [s.Tag, e.Tag+1], End margin clamped to
      avoid +1 overflow), Range<int> IntRange, and Dictionary<int,ShapeNode> for
      NodeAt(offset) (now works frozen AND unfrozen, via Tag). FeatureStruct is shared
      by reference so in-place rule edits stay visible. The projection is rebuilt only
      when the annotation Version or frozen-state changes, so a stable/frozen shape
      builds it once and reuses it across thousands of Transduce calls per word.
    
    Tests: IntAnnotationProjection_MirrorsShapeNodeAnnotations verifies the projection
    mirrors the ShapeNode tree (ranges, FeatureStruct identity, optional, children),
    NodeAt round-trips every node by Tag, and the cache invalidates on mutation.
    805 (+2) SIL.Machine green.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    WIP RUSTIFY Stage 2: flip HC FST to Fst<Word,int> (57/63 HC green)
    
    The full TOffset ShapeNode->int flip across HermitCrab (~71 files): Word is now
    IAnnotatedData<int>; Shape exposes a lazy, version-gated int-offset annotation
    projection with DENSE node positions (0..N+1) as offsets — dense (not sparse Tag)
    to avoid the Range<int>.Null=-1 collision, +1 overflow at the End margin, and
    empty anchors. NodeAt/OffsetOf/MatchStartOffset bridge int<->node; rule RHS code
    resolves match/group int ranges back to nodes (half-open [off, off+1), so leftmost
    = NodeAt(Start), rightmost = NodeAt(End-1)). MatchStartOffset(node,dir) handles the
    inclusive->half-open asymmetry for right-to-left match-start offsets.
    
    Down from 23 failures to 6 (all now logic, not crashes): metathesis SimpleRule/
    ComplexRule, DeletionRules/MultipleDeletionRules, EpenthesisRules, ReduplicationRules
    — node insertion/deletion/movement + group-capture rules still need correctness work.
    Register<int> stackalloc (the payoff) comes after these are byte-identical.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    RUSTIFY Stage 2: fix analysis under-generation (63/63 HC green)
    
    Two byte-identical fixes to the int-offset projection, restoring the 4
    analysis-direction rewrite tests (Epenthesis/Deletion/MultipleDeletion/
    Reduplication) that the Fst<Word,int> flip broke:
    
    1. Annotation.Optional must invalidate the projection. The Shape int
       projection copies Optional by value and caches against the annotation
       list Version, but the Optional setter is a non-structural change that
       never bumped Version. So once analysis flipped Optional=true on existing
       nodes, the matcher kept reading the stale Optional=false projection and
       never forked the optional-skip instances. The setter now bumps the root
       list's version (new AnnotationList.IncrementVersion). Fixes Epenthesis.
    
    2. IntRange must be the half-open image of the inclusive [Begin, End], i.e.
       [off(Begin), off(End)+1) — not [off(Begin), off(End)]. The only consumer
       is Matcher.GetStartAnnotation via Range.GetStart(dir); a RtL match starts
       at GetStart(RtL)==End. The End anchor's dense range is [off(End),
       off(End)+1), whose RtL start coordinate is off(End)+1, so without the +1
       a RtL match began at the last content node and skipped any edit adjacent
       to End (e.g. inserting a deleted segment after the final vowel). Fixes
       the deletion/reduplication cases.
    
    Adds two regression tests guarding both invariants. Also keeps the prior
    working-tree Stage-2 fixes in IterativePhonologicalPatternRule and
    SynthesisMetathesisRuleSpec (resolve int offsets to ShapeNode refs before
    mutating the shape).
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY Stage 2: record generic-flip-green milestone + post-flip measurement
    
    The <Word,ShapeNode>->-<Word,int> generic flip is byte-identical green (63/63
    HC + 808 SIL.Machine, full Release solution builds clean). Document the two
    int-model correctness bugs found bringing it from 57/63 to green (Optional
    cache invalidation + IntRange half-open End-anchor mapping), why 59 tests
    masked them, and the post-flip en-hc baseline (KB/word 78.8 -> 86.1, the
    projection "investment" before the Register<int> payoff). Records the
    remaining Stage 2 payoff target.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY Stage 2: refute the register-payoff hypothesis with real-grammar data
    
    Wire Indonesian (the classic HC nasalization demo, variable-light, ~150
    clones/word) alongside Sena (~345 clones/word) as the two measurement
    grammars, and use them to investigate the Stage-2 thesis that Register<int>
    being unmanaged unlocks a stackalloc cut of the 42% Scaffold bucket.
    
    Measurement refutes it:
    - Registers.Clone (the escaping accept snapshots the redesign targets) = 0.2%
      on Sena. Not where the bytes are.
    - Converting the per-push dedup-key Tuple<State,int,Register[,][,Output[]]> to
      an inline `readonly struct TraversalKey` in both nondeterministic traversal
      methods moved allocation ~0% (Sena KB/word 14588->14579; Indonesian flat).
      Kept anyway: zero-risk, byte-identical, removes a real per-push heap object,
      CPU-positive (single Add vs Contains+Add), consistent with Phase-4c micro-
      eliminations.
    - The Scaffold 38.5% IS the clone explosion: it contains Word.Clone (22.4%, via
      the per-instance Output=Data.Clone() in InitializeStack) + the per-instance
      Mappings dictionary + Output graph. The int flip's allocation payoff is
      therefore Stage 3 (flat-shape clone), not a register trick.
    
    Full suite green (808 SIL.Machine + 63 HC, incl. concurrent-determinism).
    RUSTIFY.md records the finding + how to regenerate both grammars.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY Stage 3: localize the clone cost — inherent per-node materialization
    
    Two-phase allocation probe on Shape.CopyTo (Sena): Word.Clone's 22% splits
    into CopyTo node-phase 11.4% (node.Clone + per-node dest.Add) + annotation-
    phase 4.1% + ~6.9% Word/Shape ctor. The node-phase prize is inherent per-node
    object materialization (ShapeNode + Annotation + COW FS + AnnotationList skip-
    list entry per node), not intermediate churn.
    
    Two incremental attacks measured ~0/negative and were reverted:
    - pre-size the backing arrays vs AddAfter doubling: 666->688 MB (worse;
      source Count over-sizes partial-range CopyTo, doubling was never the cost).
    - the per-push dedup Tuple->struct (prior commit): ~0.
    
    Conclusion: the flat-clone payoff requires the deep redesign (lazy ShapeNode
    handles + bulk AnnotationList clone + index-addressed annotations), the Stage-1-
    deferred "Clone = Array.Copy" end-state — a multi-session foundational rewrite
    needing a go/no-go. No incremental win exists short of it. Recorded in RUSTIFY.md.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY Stage 3: design + sequencing doc for the flat-shape clone spike
    
    Per the plan-then-proceed go/no-go: RUSTIFY-stage3-design.md lays out the
    foundational Word.Clone rewrite before any code goes red.
    
    - Goal: kill the inherent per-node materialization (node-phase 11.4% +
      anns-phase 4.1% of Word.Clone) by making Shape.Clone an Array.Copy.
    - Entanglement: ShapeNode reference-identity + annotations-hold-handles +
      skip-list-tower-per-annotation must be undone together.
    - Key resolution: materialize-on-touch two-state shape. A clone is a flat
      snapshot (no handles/Annotation objects); the int projection (Stage 2)
      reads it for the hot frozen-traverse path so nothing materializes; any
      ShapeNode/Annotation request or in-place mutation materializes lazily,
      one-per-slot, restoring exact reference identity. This resolves the
      dense-index-vs-mutation tension: frozen-read pays nothing, unfrozen-mutate
      pays the old price (far colder).
    - Byte-identical risk register, I->V sub-increment order (I FeatureStruct
      flat array + II flat AnnotationList = gateable green and keepable alone;
      III lazy materialization + Array.Copy clone = the red phase; IV triage the
      189 HC ShapeNode refs frozen-read vs mutate; V re-validate + measure),
      and rollback to dbef327a.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY Stage 3: advisor review + measure II first — towers are 7.4% (resequence)
    
    Advisor review of the design front-loaded a read-only verification gate before
    the risky III: the linchpin (int projection must rebuild from flat records,
    handle-free), the make-or-break premise (UpdateOutput touches O(few) per FST-
    transduction clone), result-consumer audit, and the I detached-FS caveat. Folded
    into the design doc.
    
    Then executed the advisor's "measure II before III" with a temporary tower-
    allocation probe on Sena:
    
      annotation skip-list towers = 7.4% of total alloc (~432 MB, 6.31M arrays) —
      a THIRD of Word.Clone (22.4%), two-thirds of node-phase(11.4%)+anns-phase(4.1%).
    
    Resequences the spike: increment II (flatten the BidirList tower arrays into
    list-owned flat backing) is now the headline — ~7.4%, byte-identical, gateable
    GREEN, zero laziness risk, independently keepable. Increment III's lazy-handle
    materialization is downgraded to optional/gated: it buys only the residual ~8%
    (the ShapeNode/Annotation objects) and carries the reference-identity risk. The
    towers were the cheap two-thirds hiding behind the "inherent objects" framing.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY Stage 3 II-a: grow skip-list margins on demand (Word.Clone -123MB Sena)
    
    First positive allocation increment of the flat-clone spike, byte-identical green.
    
    Every BidirList ctor Init'd both Begin/End margin nodes at the 33-level skip-list
    maximum (new TNode[33] x2) regardless of actual list height. Since lists almost
    always stay shallow, that eager margin tower was a large slice of the per-
    AnnotationList tower allocation that dominates Word.Clone. Now:
    - margins start at level 0 (Init(1) + link level 0);
    - GrowMargins ensures capacity + links Begin<->End at a new level only when a
      node first reaches it (EnsureLevelCapacity right-sizes; geometric growth was
      measured slightly worse - it over-allocates the shallow majority);
    - Clear resets to level 0, higher levels relink lazily on regrowth.
    
    Measured (SenaQuick, Release): Sena Word.Clone 1,306,476 -> 1,182,940 KB
    (-123 MB, -9.5% of Word.Clone, stable across runs; total KB/word -~2% under GC
    noise); Indonesian Word.Clone -~0.5pt similarly. Full SIL.Machine (808) + HC (63,
    incl. concurrent-determinism) green.
    
    Contained to BidirList/BidirListNode (used by AnnotationList x2, SkipList,
    TreeBidirList); does not touch ShapeNode/Annotation reference identity, so it is
    independently keepable regardless of the later II-b / III increments.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY Stage 3 II-b: inline skip-list level 0 (Word.Clone -54MB more, Sena)
    
    Second positive flat-clone increment, byte-identical green. Level 0 (the only
    level ~50% of skip-list nodes have) moves from the per-node _next[0]/_prev[0]
    arrays into inline _next0/_prev0 fields, so level-0 nodes allocate NO tower array
    at all and every taller node's array is one slot shorter (levels 1.. in
    _nextHigh/_prevHigh, null when Levels<=1).
    
    Touches the hottest skip-list accessors (GetNext/SetNext/GetPrev/SetPrev/Next/
    Prev/Init/Clear/EnsureLevelCapacity); gated on the full SIL.Machine (808) + HC
    (63, incl. concurrent-determinism) suites - green, so the level<->field-or-array
    dispatch is byte-identical.
    
    Measured (SenaQuick, Release): Sena Word.Clone 1,182,940 -> 1,128,660 KB
    (-54 MB on top of II-a); Indonesian 222,491 -> 212,911 KB (-9.6 MB).
    
    Cumulative II-a + II-b vs pre-Stage-3: Sena Word.Clone -177 MB (-13.6%), total
    allocation -4.2% (KB/word 14,556 -> 13,942); Indonesian total -4.1%. Pure
    allocation reduction, no retention, independently keepable regardless of III.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY: index the Stage 3 II-a/II-b green increments (-4.2% allocation)
    
    Record in the main plan that two byte-identical flat-clone increments landed
    (margin grow-on-demand + inline level 0), banking the cheap skip-list tower
    wins: Sena Word.Clone -177 MB (-13.6%), total -4.2%; Indonesian -4.1%. Points
    to RUSTIFY-stage3-design.md for the residual III go/no-go.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY Stage 3: III feasibility measured (41% Sena clones never mutated) + choose copy-on-write Shape mechanism
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY Stage 3 III: copy-on-write Shape — Word.Clone -59.6% (Sena), byte-identical
    
    The flat-clone payoff. A clone of a *frozen* shape now stores _cowSource and
    copies nothing. The asymmetry that makes this cheap + safe: the FST matcher (the
    hot read path) consumes a clone only through the int-offset projection
    (IntAnnotations/IntRange), which is served from the frozen source; while every
    path that could mutate first hands out a ShapeNode/Annotation handle. So:
    - serve IntAnnotations/IntRange/Count/GetFrozenHashCode/Freeze from the source
      while copy-on-write;
    - gate EnsureInflated() (= the real CopyTo, then re-freeze if frozen-by-sharing)
      on the flat-backing link accessors, First/Last/enumeration, NodeAt/OffsetOf/
      MatchStartOffset/Annotations/GetNodes/CopyTo/ValueEquals, and every mutator.
    A clone that is only traversed (matcher carrier) never inflates -> costs a shell
    instead of N nodes + N annotations + their skip-list towers.
    
    Thread-safety (the doc's non-negotiable): a frozen shape's int projection is now
    built eagerly at Freeze() (single-threaded), so the new pattern of several parse
    threads' COW clones delegating to one shared frozen grammar shape always hits a
    complete cache rather than racing a lazy first build.
    
    Measured (SenaQuick, Release): Sena Word.Clone 1,128,660 -> 528,071 KB (-53% on
    top of II; 20.2% -> 9.9% of total); Indonesian 212,911 -> 85,566 KB (-60%).
    Cumulative Stage 3 (II-a+II-b+III) vs pre-Stage-3: Sena Word.Clone -778 MB
    (-59.6%), share 22.4% -> 9.9%; Indonesian -62%. Word.Clone is no longer a top
    bucket. Full SIL.Machine (808) + HC (63, incl. concurrent-determinism) green;
    full Release solution builds clean; SenaParallel scaling unregressed.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY Stage 3 III: verify byte-identical on real grammars + COW invariant tests
    
    Validation the toy suite can't give (en-hc is ~2 clones/word; COW's never-
    inflated path runs ~170x hotter on Sena at 345 clones/word):
    
    - Added RustifyBenchmark.Signature ([Explicit], not CI): emits a deterministic
      per-word analysis signature (sorted set of Category|root|glosses per
      WordAnalysis) to HC_SIG_OUT. Diffed HEAD vs the pre-Stage-3 baseline (dbef327a,
      isolating II+III) on BOTH grammars via a worktree: Sena (400 words) and
      Indonesian (121 words, 100 non-empty) signatures are IDENTICAL. The COW change
      is byte-identical where it actually runs hot, not just on the toy grammar.
    
    - Added 3 CI-running COW-invariant regression tests (AnnotationTests):
      never-inflated clone serves the source's projection/range/count; mutating a
      clone inflates it and leaves the frozen source uncorrupted; frozen-by-sharing
      hash equals the source and stays stable across forced inflation.
    
    Full SIL.Machine (811) + HC (63) green.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY lever 2: lazily allocate Word's morphological-rule bookkeeping maps
    
    _mrulesUnapplied / _mrulesApplied / _disjunctiveAllomorphIndices stay empty
    through the phonological-analysis cascade (where ~345 clones/word happen) but
    were cloned eagerly per candidate. Now null = empty, created on first write,
    copied only when the source is non-empty. Byte-identical (63 HC green).
    
    Measured (SenaQuick): Word.Clone 527,987 -> 499,121 KB (-29 MB), Word.ctor
    184,858 -> 177,387 KB; total 5,267 -> 5,216 MB (~-1%).
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY lever 1: hoist the initial-register scaffold out of the Transduce loop
    
    Fst.Transduce allocated a fresh Register<TOffset>[regCount,2] per outer (start-
    position) iteration. Traverse only Array.Copy's it into the initial instances
    and never retains it, so it can be allocated once and Array.Clear'd per start
    position - byte-identical, and AllMatches (analysis) runs one iteration per
    start, so this removes (starts-1) register-array allocations per matcher call.
    
    Measured (SenaQuick): Scaffold 2,264,486 -> 2,241,441 KB (-23 MB). Full suite
    (811 SIL.Machine + 63 HC) green.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY: record levers 1+2 (lean Word + hoisted register scaffold, ~-1%, byte-identical) and why the 42% Scaffold prize stays blocked
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY lever 1: replace per-instance Visited HashSet with an inline value bitset
    
    Profiling showed the 42% Scaffold is instance churn: ~2,927 traversal instances
    created per Sena word (only ~20% reused — the pool is per-Transduce, thrown away
    each call, and pooling across calls re-triggers the Phase-1b Gen2 parallel
    regression). So the fix is leaner instances, not pooling.
    
    Each nondeterministic instance carried a HashSet<State> to avoid epsilon loops.
    States have a dense Index, so this is now a value-type VisitedStates bitset:
    states 0-63 in an inline ulong field (zero heap — HC rule FSTs are tiny), a lazy
    ulong[] overflow only for 64+ state FSTs. The set is now part of the instance
    object, not a separate ~1.17M/word heap allocation. Byte-identical (same dedup
    semantics over state identity == Index).
    
    Measured (SenaQuick): Scaffold 2,269,759 -> 2,169,001 KB (-100 MB), total
    5,242 -> 5,145 MB (~-2%). Full suite (811 SIL.Machine + 63 HC) green.
    
    The remaining per-instance allocation (the Register[,] array, ~1.17M/word) is the
    bigger prize but is blocked here: the `traversed` dedup key holds each instance's
    register array BY REFERENCE, so a shared register arena (slices reused across
    instances) would corrupt dedup. Cutting it needs the deep de-iterator + snapshot-
    dedup rewrite, not a drop-in.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY lever 1 (deep): de-iterator Advance/Initialize into a reusable buffer
    
    The core of the scaffold rewrite. Advance was a yield-based iterator and
    Initialize allocated a fresh List per call (both recursive), so each of the
    ~2,482 Transduce/word -> millions of Advance calls minted an iterator state
    machine / List. Both now fill ONE reusable per-method result buffer instead.
    
    Safety: the buffer is a per-method (per-Transduce) field, so it carries no
    cross-word retention (the Phase-1b Gen2 parallel regression) and cannot be a
    thread-static (CheckAccepting's Acceptable predicate can re-enter Transduce on
    the same thread). Initialize fills it once at the start of Traverse and the
    caller fully consumes it building the work stack before the main loop's first
    Advance reuses it, so the two never overlap (one buffer serves both). Advance is
    not re-entrant within a method. Byte-identical: same results, same order.
    
    Measured (SenaQuick): total 5,145 -> 5,029 MB (-116 MB, ~-2.3%); the per-call
    iterator state machines (Scaffold -147 MB) replaced by one buffer List/method
    (+~39 MB in TraversalMethod after merging the two buffers into one). Full suite
    (811 SIL.Machine + 63 HC, incl. concurrent-determinism) green.
    
    NOTE on the register stackalloc premise: it does NOT apply to the nondeterministic
    matcher (the hot path). The `traversed` dedup retains a per-config register
    snapshot during each Transduce, so the registers are not transient stack values -
    they're the evolving, snapshotted match state. The achievable scaffold wins are
    therefore the iterator garbage (this commit) + the Visited HashSet (prior), not
    stackalloc'd registers.
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    
    RUSTIFY: record lever-1 deep rewrite (Visited bitset + de-iterator, ~-6% Sena, byte-identical) + the register-stackalloc constraint
    
    Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
    johnml1135 and claude committed Jun 30, 2026
    Configuration menu
    Copy the full SHA
    ea4538c View commit details
    Browse the repository at this point in the history
Loading