GH-343: Fix BaseVariableWidthVector and BaseLargeVariableWidthVector offset buffer serialization#989
Conversation
This comment has been minimized.
This comment has been minimized.
jbonofre
left a comment
There was a problem hiding this comment.
It's the similar fix as for valueBuffer afair. It sounds good to me.
|
@Yicong-Huang sorry to be a pain, can you please rebase ? Thanks ! |
Thanks @jbonofre . I did rebase but I think the root error in this change is |
|
@Yicong-Huang can you please rebase again ? Sorry about that. Else I can do the rebase for you. |
Thanks. Just merged the latest master back in. |
|
@Yicong-Huang thanks a lot ! I would like to include this fix on Arrow Java 19.0.0 release 😄 |
|
Closes #343 |
Thanks that'd be nice! I can work with you closely on fixing this. However I see it is still failing CI, I think the original issue is still persist
Do you have any suggestions? |
|
I did a new pass on the PR and I don't think it's correct.
I think the problem is that the PR changes |
jbonofre
left a comment
There was a problem hiding this comment.
See my previous comment about the tests.
Keep constructor using allocator.getEmpty() instead of allocateOffsetBuffer(OFFSET_WIDTH) to avoid memory leaks in tests where vectors are created without being closed. The setReaderAndWriterIndex() reallocation logic handles the offset buffer sizing on demand. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…idth vectors Per Arrow spec, offset buffers must contain at least (valueCount + 1) entries. For empty vectors (valueCount=0), the offset buffer must still have one entry [0]. Changes: - BaseVariableWidthVector/BaseLargeVariableWidthVector: allocate offset buffer with OFFSET_WIDTH in constructor instead of using empty buffer - BaseLargeVariableWidthVector: set offset writerIndex in setReaderAndWriterIndex - NonNullableStructVector: close replaced vectors in initializeChildrenFromFields to prevent memory leaks when CONFLICT_REPLACE policy removes old vectors - Update test assertions to account for the additional 4 bytes from offset buffer - Fix memory leaks in tests by adding proper try-with-resources Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
What's Changed
Fix
BaseVariableWidthVector/BaseLargeVariableWidthVectorIPC serialization whenvalueCountis 0.Problem
When
valueCount == 0,setReaderAndWriterIndex()was settingoffsetBuffer.writerIndex(0), which meansreadableBytes() == 0. IPC serializer usesreadableBytes()to determine buffer size, so 0 bytes were written to the IPC stream. This crashes IPC readers in other libraries because Arrow spec requires offset buffer to have at least one entry[0].This is a follow-up to #967 which fixed the same issue in
ListVector/LargeListVector.Fix
Modify
setReaderAndWriterIndex()to always use(valueCount + 1) * OFFSET_WIDTHfor the offset buffer'swriterIndex, moved outside the if/else branch. When the offset buffer capacity is insufficient (e.g., empty buffer from constructor or loaded vialoadFieldBuffers()), it reallocates a properly sized buffer on demand.Testing
Added tests for empty
VarCharVectorandLargeVarCharVectorverifying offset buffer has correctreadableBytes()aftersetValueCount(0).Closes #343