Port 'Adjust placement of paragraph markers' from machine.py#435
Conversation
|
@copilot The build is failing. |
Fixed in the latest commit — the two new files ( |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #435 +/- ##
==========================================
+ Coverage 73.20% 73.31% +0.10%
==========================================
Files 442 443 +1
Lines 37017 37190 +173
Branches 5091 5110 +19
==========================================
+ Hits 27100 27264 +164
- Misses 8797 8800 +3
- Partials 1120 1126 +6 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Enkidu93
left a comment
There was a problem hiding this comment.
@Enkidu93 reviewed 4 files and all commit messages, and made 1 comment.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on ddaspit).
|
@copilot Rebase the PR on the latest changes. |
Done — merged the latest master into the branch (commit |
2749287 to
bfa9987
Compare
ddaspit
left a comment
There was a problem hiding this comment.
@ddaspit reviewed 4 files and all commit messages, and made 1 comment.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on copilot[bot]).
Ports machine.py#298 — after alignment-based placement of paragraph markers, apply small boundary adjustments to produce more natural splits (e.g. keeping a trailing comma with its sentence rather than letting it open the next paragraph).
New:
SegmentBoundaryAdjusterTwo new classes in
SegmentBoundaryAdjuster.cs:TokenRejoiner— reconstructs token lists into strings with correct punctuation spacing (no space before,/./closing quotes, no space after opening brackets/quotes).SegmentBoundaryAdjuster— adjusts a segment boundary by:,;.?!closing quotes/brackets) from the head of the next segment to the tail of the current oneAdjustTokenizedSegmentPairBoundaries(int boundary, IReadOnlyList<string> tokens)— token-index variant used by the handlerChange:
PlaceMarkersUsfmUpdateBlockHandlerAfter
PredictMarkerLocation, paragraph markers now go throughAdjustTokenizedSegmentPairBoundariesbefore their string index is resolved:Before: alignment places
\pbefore,→ paragraph opens with, y esta prueba…After: comma stays in the preceding paragraph →
Este texto está en inglés,/\p y esta prueba…This change is