Skip to content

#712 Read Variant metadata offset-size from the correct header bits#716

Open
gunnarmorling wants to merge 4 commits into
mainfrom
712-variant-metadata-offset-size
Open

#712 Read Variant metadata offset-size from the correct header bits#716
gunnarmorling wants to merge 4 commits into
mainfrom
712-variant-metadata-offset-size

Conversation

@gunnarmorling

Copy link
Copy Markdown
Collaborator

Fixes #712.

Problem

The Variant metadata header's offset_size_minus_one field lives in bits 6-7, but VariantBinary read it from bits 5-6 (METADATA_OFFSET_SIZE_SHIFT = 5). Every existing fixture uses offset_size = 1, where both readings yield 0, so the bug was invisible across the whole suite (125 byte-exact shredded-variant cases + the comparison sweep). Any Variant whose dictionary string section exceeds 255 bytes — needing offset_size >= 2 — was misparsed into garbage.

It surfaced via the negative_dictionary_size fixture from apache/parquet-testing#113, which uses offset_size = 4.

Fix

  • METADATA_OFFSET_SIZE_SHIFT 5 → 6 (+ corrected the layout comment).
  • Guard against a 4-byte dictionary_size that reads back as a negative int: reject with a clear "not a valid unsigned int" message rather than letting the bogus size drive later arithmetic. (The (dictSize+1)*offsetSize overflow widening is intentionally left to Unguarded 32-bit overflow in Variant object/array offset arithmetic #713 to avoid overlap.)

Tests

  • Real end-to-end fixture via tools/simple-datagen.py: variant_metadata_offset_size2.parquet — a VARIANT object whose metadata dictionary is 320 bytes, forcing offset_size = 2. VariantLogicalTypeTest.readsVariantWhoseMetadataUsesOffsetSizeTwo reads it through the VARIANT row API and resolves both long-named fields ('a'*160 → INT8 5, 'b'*160 → BOOLEAN_TRUE).
  • VariantMetadataTest.negativeDictionarySizeRejected for the guard.
  • Both new tests fail against the pre-fix shift (verified) and pass with the fix.
  • No regressions: the 125 shredded-variant byte-exact tests plus the variant decoder/encoder, row-API, and conversion tests are green.

Notes

The metadata header's offset_size_minus_one field lives in bits 6-7, but it was
read from bits 5-6. Every existing fixture uses offset_size=1, where both
readings yield 0, so the bug stayed invisible across the whole suite; any
Variant whose dictionary string section exceeds 255 bytes (needing
offset_size >= 2) was misparsed into garbage.

A 4-byte dictionary_size with the high bit set also reads back as a negative
int; guard against it so a corrupt size fails with a clear message instead of
driving later arithmetic.

The regression fixture is a real file generated via simple-datagen with a
320-byte dictionary (offset_size=2), read end-to-end through the VARIANT row
API; both new tests fail against the pre-fix shift.
// bit 4: sorted_strings flag
// bit 5-6: offset_size_minus_one (0..3 → 1..4 bytes)
// bit 7: unused
// bit 5: unused

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noice noice, i was confused as it's not specified in the docs, but found the correction here apache/parquet-format#574

@iifawzi iifawzi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, nice catch, seems like it was gunnar vs variant today, seeing a couple of bugs!

@gunnarmorling

Copy link
Copy Markdown
Collaborator Author

Hehe, yeah; #711 has the overview. This was triggered by a mail on parquet-dev where I learned about a pending addition to parquet-testing for a range of malformed VARIANT files (context). One of the fixtures revealed that parsing bug here.

@gunnarmorling gunnarmorling force-pushed the 712-variant-metadata-offset-size branch 2 times, most recently from e3c96b5 to ca13218 Compare June 27, 2026 19:00
Point both spec references on VariantBinary at the rendered spec page: the
bit-layout comment links the "Metadata encoding" section (offset_size in bits
6-7, bit 5 reserved), and the class @see now uses the same rendered page. Also
fix a stale "bit 5-6" reference in the VariantMetadata layout javadoc.
The class javadoc claimed field lookup uses a binary search over numerically
sorted ids, but object field ids are ordered by field name, not numerically;
the code correctly does a linear scan (as indexOf's own javadoc explains).
Align the class javadoc so the misleading claim can't invite a broken
"optimization".
@gunnarmorling gunnarmorling force-pushed the 712-variant-metadata-offset-size branch from ca13218 to ef422dc Compare June 27, 2026 19:04
Pin the exact exception text for the negative dictionary_size guard instead of
a substring, and drop the mechanical bit-level commentary on the guard and the
offset-size fixture test in favour of what they actually convey.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Variant metadata offset-size read off by one bit (bits 6-7, not 5-6)

2 participants