RUSTIFY: allocation-optimized HermitCrab parsing (flat/COW Shape + FST)#446
Open
johnml1135 wants to merge 1 commit into
Open
RUSTIFY: allocation-optimized HermitCrab parsing (flat/COW Shape + FST)#446johnml1135 wants to merge 1 commit into
johnml1135 wants to merge 1 commit into
Conversation
ee43e52 to
6c91b71
Compare
…t-offset FST) Cuts per-parse allocation in HermitCrab morphological parsing so bulk "Parse All Words" scales further, without changing parse results (byte-identical - guarded by the existing HermitCrab regression suite plus new unit tests). Core changes: - Flat, array-backed Shape/ShapeNode: Shape owns nodes in parallel int-linked backing arrays instead of an OrderedBidirList; ShapeNode becomes an (Owner, Index) handle. Shape/ShapeNode now implement IOrderedBidirList(Node)<ShapeNode> rather than deriving the concrete OrderedBidirList(Node) class - all members/constructors preserved. - Copy-on-write Shape: cloning a frozen shape shares the source via the int projection and inflates lazily only when a handle/mutator is touched - cuts Word.Clone ~60%. - BidirList skip-list towers grow on demand and inline level 0, so shallow nodes allocate no tower array. - FeatureStruct ulong flat-unify: frozen structs lazily cache a bit-packed ulong[] vector and Input.Matches takes a bitwise fast path for the simple/no-variable case, falling back to the original engine otherwise. - FST traversal binds as Fst<Word, int> (dense offset via the lazy AnnotationList<int> projection), with a reusable per-Transduce register scaffold and an inline-ulong VisitedStates epsilon-loop set. - HermitCrab morphological/phonological rule specs adapted from <Word, ShapeNode> to <Word, int>, resolving int -> ShapeNode at the rule-RHS navigation sites. Tests: new copy-on-write and flat-structure unit tests (AnnotationTests; FeatureStruct COW tests + a 64-symbol ulong-mask regression), a single-threaded-vs-parallel analysis equivalence test, and a concurrent-determinism stress test; existing HermitCrab rule tests migrated to the int-offset API. Note: the morphological-rule pattern type changed from Pattern<Word, ShapeNode> to Pattern<Word, int> - a breaking change for code that constructs HC rules programmatically (the Morpher/Language/ParseWord/AnalyzeWord and grammar-load APIs are unchanged). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
Added tests to cover the largest patch-coverage gaps in the rewritten FST code (no coverage regression):
Project coverage was already non-regressing (+0.07% vs base); these target the new/rewritten lines specifically. Note: raw nondeterministic transduce of an epsilon-input FST diverges from the determinized result (production always determinizes first), so that path is intentionally left out of the equivalence oracle. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this is
Allocation optimization of HermitCrab morphological parsing so bulk "Parse All Words" scales further. Parse results are unchanged (byte-identical) — guarded by the existing HermitCrab regression suite plus new unit tests. Rebased onto latest
master; 1 commit, 100 files (+3180/-1405), purely the optimization (no unrelated work, no profiling scaffolding).Review by change group
Reviewers can approve group-by-group; each maps to a coherent slice of the diff.
Annotations/Shape.cs,ShapeNode.cs,Annotation.cs,DataStructures/DataStructuresExtensions.csShapeowns nodes in parallel int-linked backing arrays instead of anOrderedBidirList;ShapeNodebecomes an(Owner, Index)handle. Foundation for array-copy clone + int FST offset.Annotations/Shape.cs(COW),AnnotationList.cs,DataStructures/BidirList.cs,BidirListNode.csWord.Clone~60%). Skip-list towers grow on demand + inline level 0.FeatureModel/FeatureStruct.cs,SymbolicFeature.cs,SymbolicFeatureValue.cs,UlongSymbolicFeatureValueFlags.cs,FiniteState/Input.cs,VariableBindings.csulong[]vector;Input.Matchestakes a bitwise fast path for the simple/no-variable case, falling back to the original engine otherwise (→ byte-identical).FiniteState/Fst.cs,TraversalMethodBase.cs, the 4 traversal method/instance files,ITraversalMethod.cs,VisitedStates.cs(new)Fst<Word,int>(dense offset via the lazyAnnotationList<int>projection), with a reusable per-Transduceregister scaffold and an inline-ulongVisitedStatesepsilon-loop set.Morphology.HermitCrab/(MorphologicalRules/*,PhonologicalRules/*,Word.cs,HermitCrabExtensions.cs, loaders…)Pattern/Match/MatcherfromShapeNodetointoffsets; resolveint → ShapeNodeat rule-RHS navigation sites. Lands atomically (can't be green mid-flip).Morpher.cs,AnalysisStratumRule.cs,AnalysisAffixTemplateRule.cs,Rules/CombinationRuleCascade.cs,ParallelCombinationRuleCascade.csMaxDegreeOfParallelism == 1selects the serial cascade; the parallel cascade honors the DOP cap.Annotations/AnnotationTests.cs(new),FeatureModel/FeatureStructTests.cs, and rule-test updatesTests — adequate to merge
The two highest-risk defects have targeted, fix-sensitive regression tests: the 64-symbol
ulong-mask boundary (fails without the fix) and COW source-corruption (asserts the frozen source is byte-for-byte unchanged after clone mutation). Plus a 50×250 concurrent-determinism stress test. The 63 migrated rule tests assert exact analyses/generations through the new int-offset API (in-CI output-equivalence). 825 SIL.Machine + 63 HermitCrab tests pass; CI green (ubuntu + windows).Fast-follow test gaps (not blockers): direct integrity assertions for flat-array growth past capacity, skip-list margin growth, and a committed golden-signature equivalence file.
The morphological-rule pattern type changed
Pattern<Word, ShapeNode>→Pattern<Word, int>(affectsAffixProcessAllomorph.Lhs,AnalysisMorphologicalTransform,MorphologicalTransform.GenerateShape,Match<Word,ShapeNode>). This breaks only code that constructs HC rules programmatically (FieldWorks'HCLoader.cs— a mechanicalShapeNode→intmigration when it bumps the package). TheMorpher/Language/ParseWord/AnalyzeWordand XML grammar-load APIs are unchanged, so load-and-parse consumers are unaffected.Shape/ShapeNodealso moved from deriving the concreteOrderedBidirList(Node)to implementing the pre-existing interface (all members preserved; nothing depends on the concrete base) — practically non-breaking. Suggest releasing as a new major version.Pared back for this PR
Removed the profiling apparatus and scaffolding that the optimization was developed with (preserved on a local branch): allocation-instrumentation (
MorpherStatistics/FstStatistics+ hot-path call sites), the[Explicit]benchmarks/harnesses (RustifyBenchmark/MorpherBenchmark/CompareBench/WorkerSlice), and the RUSTIFY dev-journal docs.FeatureStruct.FlatUnifyEnabledprivatized tointernal. The out-of-process HermitCrab worker that was prototyped here moved to the FieldWorks repo.🤖 Generated with Claude Code
This change is