Skip to content

GH-45948: [C++][Parquet] Variant shredding#50232

Open
qzyu999 wants to merge 3 commits into
apache:mainfrom
qzyu999:variant-shredding-impl
Open

GH-45948: [C++][Parquet] Variant shredding#50232
qzyu999 wants to merge 3 commits into
apache:mainfrom
qzyu999:variant-shredding-impl

Conversation

@qzyu999

@qzyu999 qzyu999 commented Jun 20, 2026

Copy link
Copy Markdown

PR Stack (merge in order):

  1. GH-45946: [C++][Parquet] Variant decoding #50121 Variant decoding
  2. GH-45947 : [C++][Parquet] Variant encoding #50122 Variant encoding
  3. GH-45948: [C++][Parquet] Variant shredding #50232 ← YOU ARE HERE Variant shredding (this PR, depends on GH-45946: [C++][Parquet] Variant decoding #50121 and GH-45947 : [C++][Parquet] Variant encoding #50122)

All three PRs are part of the GH-45937 umbrella (Add variant support to C++ Parquet).

Rationale for this change

Implements variant shredding/unshredding for C++ (GH-45948), part of the GH-45937 umbrella. Enables decomposing variant binary columns into native typed Arrow columns for Parquet statistics-based predicate pushdown. Depends on #50121 (decoding) and #50122 (encoding).

Note: This PR depends on #50121 and #50122. Please review/merge those first.

What changes are included in this PR?

Adds variant_shredding.h / variant_shredding.cc implementing:

  • VariantShreddingSchema — tree structure defining shredding targets (Primitive,
    Object, Array). C++ equivalent of Rust's ShreddedSchemaBuilder.
  • IsVariantCompatibleWithType() — strict type compatibility with safe int widening,
    Float→Double widening, timestamp unit+timezone matching, and decimal scale matching.
  • ShredVariantColumn() — column-level shredding producing {metadata, value, typed_value}.
    Template-refactored loops (ShredPrimitiveLoop<>, ShredBinaryLoop<>) for all
    15+ supported Arrow target types.
  • ReconstructVariantColumn() — column-level reconstruction reassembling shredded
    columns back to variant binary. Supports all list-like typed_value types (List,
    LargeList, FixedSizeList, ListView, LargeListView).

Extends VariantBuilder with 3 methods for shredding support:

  • BuildWithoutMeta() — produce value bytes without metadata (for primitives)
  • UnsafeAppendEncoded() — zero-copy raw byte append
  • SetAllowDuplicates(true) — last-value-wins dedup for reconstruction safety

Supported shredding targets (Rust parity):
Bool, Int8, Int16, Int32, Int64, Float, Double, String, LargeString, StringView,
Binary, LargeBinary, BinaryView, Date32, Timestamp(Micro/Nano, TZ/NTZ), Time64(Micro),
FixedSizeBinary(16) (UUID), Decimal128 (scale-matched)

Variant::Null semantics (Rust parity): Variant::Null (0x00) is stored in the
value column, NOT the typed_value column. Distinguishes variant-null from SQL NULL.

NullBuffer output (Rust parity): Optional out_null_bitmap parameter on
ReconstructVariantColumn for SQL NULL disambiguation (bit=0 where both value and
typed_value are null).

Known gaps (documented TODOs for follow-up PRs):

  • Recursive Object/Array sub-schema shredding in object fields (primitives only currently)
  • CastOptions cross-type coercion (Uint, Float16, Decimal32/64, TimestampSecond/Milli)
  • FixedSizeList/ListView as shredding output targets (reconstruction accepts all)
  • Value-absent schemas ({metadata, typed_value} without value)
  • DECIMAL256 shredding target (compatibility check exists but shred/reconstruct not wired)

Are these changes tested?

335 total tests (114 new shredding + 221 prior) pass with BUILD_WARNING_LEVEL=CHECKIN
covering: schema definition, type compatibility, primitive round-trip for all supported
types, object shredding (full/partial/fallback), array shredding (recursive elements),
typed round-trip (Decimal128, UUID, all timestamps, Float→Double, Int8/Int16, LargeString,
LargeBinary, StringView, BinaryView), all list-like reconstruction, error cases, and
NullBitmap semantics.

Are there any user-facing changes?

New public API in arrow/extension/variant_shredding.h: VariantShreddingSchema,
IsVariantCompatibleWithType(), ShredVariantColumn(), ReconstructVariantColumn().
New methods on VariantBuilder: BuildWithoutMeta(), UnsafeAppendEncoded(),
SetAllowDuplicates().

AI Disclosure: AI coding assistants were used during development for scaffolding,
test generation, and review iteration. All code has been reviewed, debugged, and
verified by the author who owns and understands the changes.

@qzyu999 qzyu999 force-pushed the variant-shredding-impl branch from c92cb11 to 034ff49 Compare June 27, 2026 04:55
@qzyu999

qzyu999 commented Jun 27, 2026

Copy link
Copy Markdown
Author

Force-pushed the shredding implementation built on top of the refactored decoding/encoding layers. This uses the idiomatic C++ view classes and RAII builder from the parent PRs.

Highlights:

  • Template-refactored shredding loops (ShredPrimitiveLoop<>, ShredBinaryLoop<>) eliminate per-type code duplication
  • Recursive array element shredding (Rust parity)
  • All 5 list-like types supported in reconstruction (List, LargeList, FixedSizeList, ListView, LargeListView)
  • out_null_bitmap parameter for SQL NULL disambiguation (Rust NullBuffer parity)
  • Metadata caching in reconstruction path (avoids redundant DecodeMetadata per row)
  • Object sub-field native extraction (primitives only; Object/Array sub-schemas deferred)

335 tests pass with BUILD_WARNING_LEVEL=CHECKIN (114 shredding-specific).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant