#712 Read Variant metadata offset-size from the correct header bits#716
Open
gunnarmorling wants to merge 4 commits into
Open
#712 Read Variant metadata offset-size from the correct header bits#716gunnarmorling wants to merge 4 commits into
gunnarmorling wants to merge 4 commits into
Conversation
The metadata header's offset_size_minus_one field lives in bits 6-7, but it was read from bits 5-6. Every existing fixture uses offset_size=1, where both readings yield 0, so the bug stayed invisible across the whole suite; any Variant whose dictionary string section exceeds 255 bytes (needing offset_size >= 2) was misparsed into garbage. A 4-byte dictionary_size with the high bit set also reads back as a negative int; guard against it so a corrupt size fails with a clear message instead of driving later arithmetic. The regression fixture is a real file generated via simple-datagen with a 320-byte dictionary (offset_size=2), read end-to-end through the VARIANT row API; both new tests fail against the pre-fix shift.
iifawzi
reviewed
Jun 27, 2026
| // bit 4: sorted_strings flag | ||
| // bit 5-6: offset_size_minus_one (0..3 → 1..4 bytes) | ||
| // bit 7: unused | ||
| // bit 5: unused |
Contributor
There was a problem hiding this comment.
noice noice, i was confused as it's not specified in the docs, but found the correction here apache/parquet-format#574
iifawzi
approved these changes
Jun 27, 2026
iifawzi
left a comment
Contributor
There was a problem hiding this comment.
Looks good to me, nice catch, seems like it was gunnar vs variant today, seeing a couple of bugs!
Collaborator
Author
e3c96b5 to
ca13218
Compare
Point both spec references on VariantBinary at the rendered spec page: the bit-layout comment links the "Metadata encoding" section (offset_size in bits 6-7, bit 5 reserved), and the class @see now uses the same rendered page. Also fix a stale "bit 5-6" reference in the VariantMetadata layout javadoc.
The class javadoc claimed field lookup uses a binary search over numerically sorted ids, but object field ids are ordered by field name, not numerically; the code correctly does a linear scan (as indexOf's own javadoc explains). Align the class javadoc so the misleading claim can't invite a broken "optimization".
ca13218 to
ef422dc
Compare
Pin the exact exception text for the negative dictionary_size guard instead of a substring, and drop the mechanical bit-level commentary on the guard and the offset-size fixture test in favour of what they actually convey.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #712.
Problem
The Variant metadata header's
offset_size_minus_onefield lives in bits 6-7, butVariantBinaryread it from bits 5-6 (METADATA_OFFSET_SIZE_SHIFT = 5). Every existing fixture usesoffset_size = 1, where both readings yield0, so the bug was invisible across the whole suite (125 byte-exact shredded-variant cases + the comparison sweep). Any Variant whose dictionary string section exceeds 255 bytes — needingoffset_size >= 2— was misparsed into garbage.It surfaced via the
negative_dictionary_sizefixture from apache/parquet-testing#113, which usesoffset_size = 4.Fix
METADATA_OFFSET_SIZE_SHIFT5 → 6 (+ corrected the layout comment).dictionary_sizethat reads back as a negativeint: reject with a clear "not a valid unsigned int" message rather than letting the bogus size drive later arithmetic. (The(dictSize+1)*offsetSizeoverflow widening is intentionally left to Unguarded 32-bit overflow in Variant object/array offset arithmetic #713 to avoid overlap.)Tests
tools/simple-datagen.py:variant_metadata_offset_size2.parquet— a VARIANT object whose metadata dictionary is 320 bytes, forcingoffset_size = 2.VariantLogicalTypeTest.readsVariantWhoseMetadataUsesOffsetSizeTworeads it through the VARIANT row API and resolves both long-named fields ('a'*160→ INT8 5,'b'*160→ BOOLEAN_TRUE).VariantMetadataTest.negativeDictionarySizeRejectedfor the guard.Notes
internalpackages — no public API or config change, so no usage-docs update.