Skip to content

GH-45946: [C++][Parquet] Variant decoding#50121

Open
qzyu999 wants to merge 1 commit into
apache:mainfrom
qzyu999:variant-decoding
Open

GH-45946: [C++][Parquet] Variant decoding#50121
qzyu999 wants to merge 1 commit into
apache:mainfrom
qzyu999:variant-decoding

Conversation

@qzyu999

@qzyu999 qzyu999 commented Jun 8, 2026

Copy link
Copy Markdown

PR Stack (merge in order):

  1. GH-45946: [C++][Parquet] Variant decoding #50121 ← YOU ARE HERE Variant decoding (this PR)
  2. GH-45947 : [C++][Parquet] Variant encoding #50122 Variant encoding (depends on this PR)
  3. GH-45948: [C++][Parquet] Variant shredding #50232 Variant shredding (depends on GH-45947 : [C++][Parquet] Variant encoding #50122)

All three PRs are part of the GH-45937 umbrella (Add variant support to C++ Parquet).

Rationale for this change

Implements full Variant binary decoding per the VariantEncoding spec. Part of GH-45937 (Add variant support to C++).

What changes are included in this PR?

Adds variant.h (public API) and variant.cc (implementation) providing:

  • View classes (VariantView, VariantObjectView, VariantArrayView): zero-copy,
    stack-allocated views that pre-parse headers at construction and provide type-safe
    access thereafter. O(log n) object field lookup via binary search always (no threshold).
  • SAX-style visitor (VariantVisitor): recursive traversal interface for full tree
    processing, following Arrow C++ conventions (TypeVisitor, ArrayVisitor).
  • Metadata decoding (DecodeMetadata, FindMetadataKey): string dictionary parsing
    with binary search for sorted dictionaries.
  • Numeric coercion (as_int64_coerced, as_int32_coerced, as_double_coerced):
    widening accessors matching Rust's as_i64() / as_f64() pattern.
  • Recursive validation (ValidateVariant): deep structural validation for untrusted
    input (validates all offsets, field IDs, nesting depth ≤128).
  • Shared internal utility (variant_internal_util.h): endian-safe ReadLE helpers
    used by both decoding and shredding implementations. NOT installed (internal only).

Design decisions:

  • Parse once, query many (views pre-parse headers, subsequent access is O(1))
  • Zero-copy (string_view into source buffers, no heap allocation for reads)
  • Recursion depth limit (kMaxNestingDepth = 128) — security hardening for C++ stack
  • Binary search always — no threshold heuristic (pre-parsed header makes it optimal)
  • std::optional for not-found semantics (idiomatic C++)
  • Validated factories (Make()) ensure bounds-safe subsequent access
  • static_assert on view class sizes (≤32/80/64 bytes — cache-friendly)

Are these changes tested?

134 variant-specific tests pass with BUILD_WARNING_LEVEL=CHECKIN covering: all 21
primitive types, short/long strings, objects (including 3-byte offsets), arrays
(including is_large), nesting, depth limits, metadata edge cases, error paths,
view API, numeric coercion, recursive validation, and visitor traversal.

Are there any user-facing changes?

New public API in arrow/extension/variant.h: VariantView, VariantObjectView,
VariantArrayView, VariantVisitor, VariantMetadata, DecodeMetadata,
FindMetadataKey, ValueSize, ValidateVariant, and associated types/enums.
All in namespace arrow::extension::variant.

AI Disclosure: AI coding assistants were used during development for scaffolding,
test generation, and review iteration. All code has been reviewed, debugged, and
verified by the author who owns and understands the changes.

@misiek1984 misiek1984 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments.

/// Searches the field IDs in the object, resolving each against the
/// metadata dictionary. Per spec, field IDs are in lexicographic order
/// of their corresponding key names, enabling binary search for large
/// objects (>=32 fields). For smaller objects, linear scan is used.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was the 32 threshold determined?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this, it is a fair question about the original design. The threshold was inherited from the Go implementation which uses 32 as the cutoff between linear scan and binary search. After reflecting on the review feedback, I realized this was a case where I was carrying over Go's pattern without questioning whether it made sense for C++.

In the refactored design, the threshold is eliminated entirely. The new VariantObjectView pre-parses the object header at construction time (field count, ID array start, offset array start, data start), so subsequent field lookups are just binary search through a pre-computed structure, O(log n) for all n, with no per-access parsing overhead. This makes a threshold unnecessary because the cost that justified linear scan for small objects (re-parsing the header each time) no longer exists.

The pre-parsed approach is similar to how arrow-rs's VariantObject works, it validates structure upfront and provides O(1) indexed access and O(log n) name lookup thereafter.


/// \brief Basic type codes from bits 0-1 of the value header byte.
///
/// Variant Encoding Spec §3: "Value encoding"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The current version of the spec does not contain paragraph §3 and §3.1. I would just add a link to the section with tables: https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. All enum comments now reference the canonical spec location: https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types

/// Implements parsing logic per the Variant Encoding Spec:
/// https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
///
/// The "internal" in the filename refers to the binary encoding internals

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion here. But maybe instead of explaining in the comment what "internal" means it would be better to rename a file e.g. to variant_binary_encoding, variant_internal_encoding etc.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the naming was confusing. In the refactored layout:

  • variant.h: the main public API (views, builder, visitor, types). Clear name.
  • variant_internal_util.h: a small (~71 line) file with shared ReadLE utilities. Genuinely internal (not installed), and "internal" in the name is accurate since it's excluded from install_headers() by CMake's glob filter.
  • variant_internal_test_util.h: test-only header with RecordingVisitor. Also excluded from install (has "internal" in name).

The original variant_internal.h that contained the full public API (confusingly named) no longer exists.


class VariantIntegrationTest : public ::testing::Test {};

TEST_F(VariantIntegrationTest, FullRoundTrip) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add more tests demonstrating how to use all these new functions together. For example, let's assume we have the following Variant:

{
  "name": "Alice",
  "age": 30,
  "addresses": {
    "postal": {
      "country": "USA",
      "city": "New York"
    },
    "billing": {
      "country": "USA",
      "city": "Chicago"
    }
  }
}

If we want to find the city for the postal address, we would first need to use FindObjectField to find "addresses", then "postal", and finally "city". After that, we would read the value of the "city" field.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. The refactored view classes support composable navigation:

auto obj = view.as_object();
auto inner = obj->get("address")->as_object();
auto city = inner->get("city")->as_string();

Tests exercise this chaining pattern with multi-level nesting (object -> object -> value, object -> array -> value, etc.).

/// the visitor. Returns the number of bytes consumed.
///
/// This is the core recursive function.
Status DecodeValueAt(const VariantMetadata& metadata, const uint8_t* data, int64_t length,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function should be public. Let's assume I want to read the value of a specific nested field from a Variant using a path (e.g., field_1.field_2.field_3).

My current understanding is that I would first need to call FindObjectField to locate "field_1". If it exists, I then have to find "field_2", and finally "field_3". However, I have to implement the last step—reading the actual value—on my own because DecodeValueAt is not public, and DecodeVariantValue only allows for decoding the entire Variant.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the refactored design, this use case is covered by VariantView::Make(metadata, data + offset, size), you can construct a view at any byte offset within a buffer. There's no separate DecodeValueAt because the view factory IS the decode-at-offset operation.

For object fields specifically, VariantObjectView::locate(name) returns an optional<FieldLocation> with offset + size without constructing the inner view, which is useful for zero-copy byte transfer (used by the shredding path).

/// \return Status::OK on success, Status::Invalid on malformed input
///
/// \note The data buffer must remain valid for the duration of the call.
ARROW_EXPORT Status DecodeVariantValue(const VariantMetadata& metadata,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a plan to also support reading/decoding shredded variants?

@qzyu999 qzyu999 Jun 27, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in #50232 (the shredding PR in this stack). ReconstructVariantColumn() handles the "unshredding" path, reassembling typed Parquet columns back into variant binary.

@qzyu999 qzyu999 force-pushed the variant-decoding branch from b0c2298 to 162d503 Compare June 27, 2026 04:53
@qzyu999

qzyu999 commented Jun 27, 2026

Copy link
Copy Markdown
Author

I've force-pushed a refactored version that addresses all review feedback. After carefully reviewing the comments, it became clear that several design choices in the earlier iteration stemmed from initially trying to follow Go's implementation patterns (free functions, manual buffer management, linear/binary threshold). I've since reworked the implementation from scratch with C++ ergonomics and idiom as the guiding principle.

Key changes in this force-push:

  • View classes (VariantView, VariantObjectView, VariantArrayView) replace the previous free-function API, parse headers once at construction, O(log n) binary search always (no threshold), std::optional for not-found semantics
  • Numeric coercion accessors (as_int64_coerced, as_double_coerced) for Rust parity
  • Recursive validation (ValidateVariant) for untrusted input
  • Shared internal utility (variant_internal_util.h) consolidates ReadLE helpers
  • Previous variant_internal.h naming confusion resolved, main API is variant.h
  • Test utility renamed to variant_internal_test_util.h (ensures it's not installed)

All 335 tests pass end-to-end with BUILD_WARNING_LEVEL=CHECKIN (134 decoder-specific).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants