Skip to content

RUSTIFY: allocation-optimized HermitCrab parsing (flat/COW Shape + FST)#446

Open
johnml1135 wants to merge 1 commit into
masterfrom
hc-rustify
Open

RUSTIFY: allocation-optimized HermitCrab parsing (flat/COW Shape + FST)#446
johnml1135 wants to merge 1 commit into
masterfrom
hc-rustify

Conversation

@johnml1135

@johnml1135 johnml1135 commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

What this is

Allocation optimization of HermitCrab morphological parsing so bulk "Parse All Words" scales further. Parse results are unchanged (byte-identical) — guarded by the existing HermitCrab regression suite plus new unit tests. Rebased onto latest master; 1 commit, 100 files (+3180/-1405), purely the optimization (no unrelated work, no profiling scaffolding).

Review by change group

Reviewers can approve group-by-group; each maps to a coherent slice of the diff.

# Group Files What & why
A Flat Shape/ShapeNode backing Annotations/Shape.cs, ShapeNode.cs, Annotation.cs, DataStructures/DataStructuresExtensions.cs Shape owns nodes in parallel int-linked backing arrays instead of an OrderedBidirList; ShapeNode becomes an (Owner, Index) handle. Foundation for array-copy clone + int FST offset.
B Copy-on-write Shape + BidirList flattening Annotations/Shape.cs (COW), AnnotationList.cs, DataStructures/BidirList.cs, BidirListNode.cs Cloning a frozen shape shares the source via the int projection, inflating lazily only on handle/mutation (cuts Word.Clone ~60%). Skip-list towers grow on demand + inline level 0.
C FeatureStruct ulong flat-unify FeatureModel/FeatureStruct.cs, SymbolicFeature.cs, SymbolicFeatureValue.cs, UlongSymbolicFeatureValueFlags.cs, FiniteState/Input.cs, VariableBindings.cs Frozen structs lazily cache a bit-packed ulong[] vector; Input.Matches takes a bitwise fast path for the simple/no-variable case, falling back to the original engine otherwise (→ byte-identical).
D int-offset FST traversal FiniteState/Fst.cs, TraversalMethodBase.cs, the 4 traversal method/instance files, ITraversalMethod.cs, VisitedStates.cs (new) FST binds as Fst<Word,int> (dense offset via the lazy AnnotationList<int> projection), with a reusable per-Transduce register scaffold and an inline-ulong VisitedStates epsilon-loop set.
E HermitCrab rule-spec adaptations ~46 files under Morphology.HermitCrab/ (MorphologicalRules/*, PhonologicalRules/*, Word.cs, HermitCrabExtensions.cs, loaders…) Flip every Pattern/Match/Matcher from ShapeNode to int offsets; resolve int → ShapeNode at rule-RHS navigation sites. Lands atomically (can't be green mid-flip).
G Single-thread option Morpher.cs, AnalysisStratumRule.cs, AnalysisAffixTemplateRule.cs, Rules/CombinationRuleCascade.cs, ParallelCombinationRuleCascade.cs MaxDegreeOfParallelism == 1 selects the serial cascade; the parallel cascade honors the DOP cap.
I Tests Annotations/AnnotationTests.cs (new), FeatureModel/FeatureStructTests.cs, and rule-test updates New COW/flat-structure + 64-symbol-mask regression + single-thread-vs-parallel equivalence + concurrent-determinism tests; existing rule tests migrated to the int-offset API.

Tests — adequate to merge

The two highest-risk defects have targeted, fix-sensitive regression tests: the 64-symbol ulong-mask boundary (fails without the fix) and COW source-corruption (asserts the frozen source is byte-for-byte unchanged after clone mutation). Plus a 50×250 concurrent-determinism stress test. The 63 migrated rule tests assert exact analyses/generations through the new int-offset API (in-CI output-equivalence). 825 SIL.Machine + 63 HermitCrab tests pass; CI green (ubuntu + windows).

Fast-follow test gaps (not blockers): direct integrity assertions for flat-array growth past capacity, skip-list margin growth, and a committed golden-signature equivalence file.

⚠️ Breaking API change

The morphological-rule pattern type changed Pattern<Word, ShapeNode>Pattern<Word, int> (affects AffixProcessAllomorph.Lhs, AnalysisMorphologicalTransform, MorphologicalTransform.GenerateShape, Match<Word,ShapeNode>). This breaks only code that constructs HC rules programmatically (FieldWorks' HCLoader.cs — a mechanical ShapeNode→int migration when it bumps the package). The Morpher/Language/ParseWord/AnalyzeWord and XML grammar-load APIs are unchanged, so load-and-parse consumers are unaffected. Shape/ShapeNode also moved from deriving the concrete OrderedBidirList(Node) to implementing the pre-existing interface (all members preserved; nothing depends on the concrete base) — practically non-breaking. Suggest releasing as a new major version.

Pared back for this PR

Removed the profiling apparatus and scaffolding that the optimization was developed with (preserved on a local branch): allocation-instrumentation (MorpherStatistics/FstStatistics + hot-path call sites), the [Explicit] benchmarks/harnesses (RustifyBenchmark/MorpherBenchmark/CompareBench/WorkerSlice), and the RUSTIFY dev-journal docs. FeatureStruct.FlatUnifyEnabled privatized to internal. The out-of-process HermitCrab worker that was prototyped here moved to the FieldWorks repo.

🤖 Generated with Claude Code


This change is Reviewable

@codecov-commenter

codecov-commenter commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 83.92089% with 187 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.74%. Comparing base (7835067) to head (e910a55).

Files with missing lines Patch % Lines
src/SIL.Machine/Annotations/Shape.cs 79.05% 51 Missing and 11 partials ⚠️
src/SIL.Machine/FiniteState/VisitedStates.cs 19.35% 23 Missing and 2 partials ⚠️
...Machine/DataStructures/DataStructuresExtensions.cs 57.50% 16 Missing and 1 partial ⚠️
.../FiniteState/NondeterministicFstTraversalMethod.cs 67.39% 15 Missing ⚠️
src/SIL.Machine/Annotations/ShapeNode.cs 63.63% 8 Missing and 4 partials ⚠️
src/SIL.Machine/FiniteState/Fst.cs 83.01% 8 Missing and 1 partial ⚠️
...hine.Morphology.HermitCrab/HermitCrabExtensions.cs 80.55% 5 Missing and 2 partials ⚠️
src/SIL.Machine/FeatureModel/FeatureStruct.cs 94.79% 4 Missing and 1 partial ⚠️
.../FiniteState/NondeterministicFsaTraversalMethod.cs 80.76% 4 Missing and 1 partial ⚠️
...initeState/NondeterministicFstTraversalInstance.cs 44.44% 5 Missing ⚠️
... and 12 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #446      +/-   ##
==========================================
+ Coverage   73.30%   73.74%   +0.43%     
==========================================
  Files         443      444       +1     
  Lines       37203    37776     +573     
  Branches     5110     5234     +124     
==========================================
+ Hits        27272    27858     +586     
+ Misses       8805     8785      -20     
- Partials     1126     1133       +7     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@johnml1135 johnml1135 force-pushed the hc-rustify branch 2 times, most recently from ee43e52 to 6c91b71 Compare July 1, 2026 16:31
…t-offset FST)

Cuts per-parse allocation in HermitCrab morphological parsing so bulk "Parse All
Words" scales further, without changing parse results (byte-identical - guarded by the
existing HermitCrab regression suite plus new unit tests).

Core changes:
- Flat, array-backed Shape/ShapeNode: Shape owns nodes in parallel int-linked backing
  arrays instead of an OrderedBidirList; ShapeNode becomes an (Owner, Index) handle.
  Shape/ShapeNode now implement IOrderedBidirList(Node)<ShapeNode> rather than deriving
  the concrete OrderedBidirList(Node) class - all members/constructors preserved.
- Copy-on-write Shape: cloning a frozen shape shares the source via the int projection
  and inflates lazily only when a handle/mutator is touched - cuts Word.Clone ~60%.
- BidirList skip-list towers grow on demand and inline level 0, so shallow nodes
  allocate no tower array.
- FeatureStruct ulong flat-unify: frozen structs lazily cache a bit-packed ulong[]
  vector and Input.Matches takes a bitwise fast path for the simple/no-variable case,
  falling back to the original engine otherwise.
- FST traversal binds as Fst<Word, int> (dense offset via the lazy AnnotationList<int>
  projection), with a reusable per-Transduce register scaffold and an inline-ulong
  VisitedStates epsilon-loop set.
- HermitCrab morphological/phonological rule specs adapted from <Word, ShapeNode> to
  <Word, int>, resolving int -> ShapeNode at the rule-RHS navigation sites.

Tests: new copy-on-write and flat-structure unit tests (AnnotationTests; FeatureStruct
COW tests + a 64-symbol ulong-mask regression), a single-threaded-vs-parallel analysis
equivalence test, and a concurrent-determinism stress test; existing HermitCrab rule
tests migrated to the int-offset API.

Note: the morphological-rule pattern type changed from Pattern<Word, ShapeNode> to
Pattern<Word, int> - a breaking change for code that constructs HC rules programmatically
(the Morpher/Language/ParseWord/AnalyzeWord and grammar-load APIs are unchanged).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@johnml1135

Copy link
Copy Markdown
Collaborator Author

Added tests to cover the largest patch-coverage gaps in the rewritten FST code (no coverage regression):

  • FstTests.TransduceNondeterministic_MatchesDeterminized — transduces a raw (non-determinized) FST directly and asserts it accepts the same outputs as its Determinize()d form. This exercises NondeterministicFstTraversalMethod (was 0% → ~71%) and NondeterministicFstTraversalInstance (0% → ~86%), which no existing test hit because every other transduce path determinizes first.
  • AnnotationTests.LargeShape_GrowsBackingArrays_PreservesLinkAndCloneIntegrity — builds a 50-node shape (crossing several backing-array doublings), then asserts forward/backward link integrity, NodeAt/OffsetOf handle round-trips, mid-list insert/remove, and clone value-equality.

Project coverage was already non-regressing (+0.07% vs base); these target the new/rewritten lines specifically. Note: raw nondeterministic transduce of an epsilon-input FST diverges from the determinized result (production always determinizes first), so that path is intentionally left out of the equivalence oracle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants