GH-45948: [C++][Parquet] Variant shredding#50232
Open
qzyu999 wants to merge 3 commits into
Open
Conversation
This was referenced Jun 23, 2026
c92cb11 to
034ff49
Compare
Author
|
Force-pushed the shredding implementation built on top of the refactored decoding/encoding layers. This uses the idiomatic C++ view classes and RAII builder from the parent PRs. Highlights:
335 tests pass with |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
Implements variant shredding/unshredding for C++ (GH-45948), part of the GH-45937 umbrella. Enables decomposing variant binary columns into native typed Arrow columns for Parquet statistics-based predicate pushdown. Depends on #50121 (decoding) and #50122 (encoding).
Note: This PR depends on #50121 and #50122. Please review/merge those first.
What changes are included in this PR?
Adds
variant_shredding.h/variant_shredding.ccimplementing:VariantShreddingSchema— tree structure defining shredding targets (Primitive,Object, Array). C++ equivalent of Rust's
ShreddedSchemaBuilder.IsVariantCompatibleWithType()— strict type compatibility with safe int widening,Float→Double widening, timestamp unit+timezone matching, and decimal scale matching.
ShredVariantColumn()— column-level shredding producing{metadata, value, typed_value}.Template-refactored loops (
ShredPrimitiveLoop<>,ShredBinaryLoop<>) for all15+ supported Arrow target types.
ReconstructVariantColumn()— column-level reconstruction reassembling shreddedcolumns back to variant binary. Supports all list-like typed_value types (List,
LargeList, FixedSizeList, ListView, LargeListView).
Extends
VariantBuilderwith 3 methods for shredding support:BuildWithoutMeta()— produce value bytes without metadata (for primitives)UnsafeAppendEncoded()— zero-copy raw byte appendSetAllowDuplicates(true)— last-value-wins dedup for reconstruction safetySupported shredding targets (Rust parity):
Bool, Int8, Int16, Int32, Int64, Float, Double, String, LargeString, StringView,
Binary, LargeBinary, BinaryView, Date32, Timestamp(Micro/Nano, TZ/NTZ), Time64(Micro),
FixedSizeBinary(16) (UUID), Decimal128 (scale-matched)
Variant::Null semantics (Rust parity): Variant::Null (0x00) is stored in the
value column, NOT the typed_value column. Distinguishes variant-null from SQL NULL.
NullBuffer output (Rust parity): Optional
out_null_bitmapparameter onReconstructVariantColumnfor SQL NULL disambiguation (bit=0 where both value andtyped_value are null).
Known gaps (documented TODOs for follow-up PRs):
{metadata, typed_value}withoutvalue)Are these changes tested?
335 total tests (114 new shredding + 221 prior) pass with
BUILD_WARNING_LEVEL=CHECKINcovering: schema definition, type compatibility, primitive round-trip for all supported
types, object shredding (full/partial/fallback), array shredding (recursive elements),
typed round-trip (Decimal128, UUID, all timestamps, Float→Double, Int8/Int16, LargeString,
LargeBinary, StringView, BinaryView), all list-like reconstruction, error cases, and
NullBitmap semantics.
Are there any user-facing changes?
New public API in
arrow/extension/variant_shredding.h:VariantShreddingSchema,IsVariantCompatibleWithType(),ShredVariantColumn(),ReconstructVariantColumn().New methods on
VariantBuilder:BuildWithoutMeta(),UnsafeAppendEncoded(),SetAllowDuplicates().AI Disclosure: AI coding assistants were used during development for scaffolding,
test generation, and review iteration. All code has been reviewed, debugged, and
verified by the author who owns and understands the changes.