GH-45946: [C++][Parquet] Variant decoding#50121
Conversation
| /// Searches the field IDs in the object, resolving each against the | ||
| /// metadata dictionary. Per spec, field IDs are in lexicographic order | ||
| /// of their corresponding key names, enabling binary search for large | ||
| /// objects (>=32 fields). For smaller objects, linear scan is used. |
There was a problem hiding this comment.
How was the 32 threshold determined?
There was a problem hiding this comment.
Thanks for raising this, it is a fair question about the original design. The threshold was inherited from the Go implementation which uses 32 as the cutoff between linear scan and binary search. After reflecting on the review feedback, I realized this was a case where I was carrying over Go's pattern without questioning whether it made sense for C++.
In the refactored design, the threshold is eliminated entirely. The new VariantObjectView pre-parses the object header at construction time (field count, ID array start, offset array start, data start), so subsequent field lookups are just binary search through a pre-computed structure, O(log n) for all n, with no per-access parsing overhead. This makes a threshold unnecessary because the cost that justified linear scan for small objects (re-parsing the header each time) no longer exists.
The pre-parsed approach is similar to how arrow-rs's VariantObject works, it validates structure upfront and provides O(1) indexed access and O(log n) name lookup thereafter.
|
|
||
| /// \brief Basic type codes from bits 0-1 of the value header byte. | ||
| /// | ||
| /// Variant Encoding Spec §3: "Value encoding" |
There was a problem hiding this comment.
nit: The current version of the spec does not contain paragraph §3 and §3.1. I would just add a link to the section with tables: https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types
There was a problem hiding this comment.
Fixed. All enum comments now reference the canonical spec location: https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types
| /// Implements parsing logic per the Variant Encoding Spec: | ||
| /// https://github.com/apache/parquet-format/blob/master/VariantEncoding.md | ||
| /// | ||
| /// The "internal" in the filename refers to the binary encoding internals |
There was a problem hiding this comment.
I don't have a strong opinion here. But maybe instead of explaining in the comment what "internal" means it would be better to rename a file e.g. to variant_binary_encoding, variant_internal_encoding etc.
There was a problem hiding this comment.
Agreed, the naming was confusing. In the refactored layout:
variant.h: the main public API (views, builder, visitor, types). Clear name.variant_internal_util.h: a small (~71 line) file with sharedReadLEutilities. Genuinely internal (not installed), and "internal" in the name is accurate since it's excluded frominstall_headers()by CMake's glob filter.variant_internal_test_util.h: test-only header withRecordingVisitor. Also excluded from install (has "internal" in name).
The original variant_internal.h that contained the full public API (confusingly named) no longer exists.
|
|
||
| class VariantIntegrationTest : public ::testing::Test {}; | ||
|
|
||
| TEST_F(VariantIntegrationTest, FullRoundTrip) { |
There was a problem hiding this comment.
I would also add more tests demonstrating how to use all these new functions together. For example, let's assume we have the following Variant:
{
"name": "Alice",
"age": 30,
"addresses": {
"postal": {
"country": "USA",
"city": "New York"
},
"billing": {
"country": "USA",
"city": "Chicago"
}
}
}
If we want to find the city for the postal address, we would first need to use FindObjectField to find "addresses", then "postal", and finally "city". After that, we would read the value of the "city" field.
There was a problem hiding this comment.
Added. The refactored view classes support composable navigation:
auto obj = view.as_object();
auto inner = obj->get("address")->as_object();
auto city = inner->get("city")->as_string();Tests exercise this chaining pattern with multi-level nesting (object -> object -> value, object -> array -> value, etc.).
| /// the visitor. Returns the number of bytes consumed. | ||
| /// | ||
| /// This is the core recursive function. | ||
| Status DecodeValueAt(const VariantMetadata& metadata, const uint8_t* data, int64_t length, |
There was a problem hiding this comment.
I think this function should be public. Let's assume I want to read the value of a specific nested field from a Variant using a path (e.g., field_1.field_2.field_3).
My current understanding is that I would first need to call FindObjectField to locate "field_1". If it exists, I then have to find "field_2", and finally "field_3". However, I have to implement the last step—reading the actual value—on my own because DecodeValueAt is not public, and DecodeVariantValue only allows for decoding the entire Variant.
There was a problem hiding this comment.
In the refactored design, this use case is covered by VariantView::Make(metadata, data + offset, size), you can construct a view at any byte offset within a buffer. There's no separate DecodeValueAt because the view factory IS the decode-at-offset operation.
For object fields specifically, VariantObjectView::locate(name) returns an optional<FieldLocation> with offset + size without constructing the inner view, which is useful for zero-copy byte transfer (used by the shredding path).
| /// \return Status::OK on success, Status::Invalid on malformed input | ||
| /// | ||
| /// \note The data buffer must remain valid for the duration of the call. | ||
| ARROW_EXPORT Status DecodeVariantValue(const VariantMetadata& metadata, |
There was a problem hiding this comment.
Do you have a plan to also support reading/decoding shredded variants?
There was a problem hiding this comment.
Implemented in #50232 (the shredding PR in this stack). ReconstructVariantColumn() handles the "unshredding" path, reassembling typed Parquet columns back into variant binary.
b0c2298 to
162d503
Compare
|
I've force-pushed a refactored version that addresses all review feedback. After carefully reviewing the comments, it became clear that several design choices in the earlier iteration stemmed from initially trying to follow Go's implementation patterns (free functions, manual buffer management, linear/binary threshold). I've since reworked the implementation from scratch with C++ ergonomics and idiom as the guiding principle. Key changes in this force-push:
All 335 tests pass end-to-end with |
Rationale for this change
Implements full Variant binary decoding per the VariantEncoding spec. Part of GH-45937 (Add variant support to C++).
What changes are included in this PR?
Adds
variant.h(public API) andvariant.cc(implementation) providing:VariantView,VariantObjectView,VariantArrayView): zero-copy,stack-allocated views that pre-parse headers at construction and provide type-safe
access thereafter. O(log n) object field lookup via binary search always (no threshold).
VariantVisitor): recursive traversal interface for full treeprocessing, following Arrow C++ conventions (TypeVisitor, ArrayVisitor).
DecodeMetadata,FindMetadataKey): string dictionary parsingwith binary search for sorted dictionaries.
as_int64_coerced,as_int32_coerced,as_double_coerced):widening accessors matching Rust's
as_i64()/as_f64()pattern.ValidateVariant): deep structural validation for untrustedinput (validates all offsets, field IDs, nesting depth ≤128).
variant_internal_util.h): endian-safe ReadLE helpersused by both decoding and shredding implementations. NOT installed (internal only).
Design decisions:
string_viewinto source buffers, no heap allocation for reads)kMaxNestingDepth = 128) — security hardening for C++ stackstd::optionalfor not-found semantics (idiomatic C++)Make()) ensure bounds-safe subsequent accessstatic_asserton view class sizes (≤32/80/64 bytes — cache-friendly)Are these changes tested?
134 variant-specific tests pass with
BUILD_WARNING_LEVEL=CHECKINcovering: all 21primitive types, short/long strings, objects (including 3-byte offsets), arrays
(including is_large), nesting, depth limits, metadata edge cases, error paths,
view API, numeric coercion, recursive validation, and visitor traversal.
Are there any user-facing changes?
New public API in
arrow/extension/variant.h:VariantView,VariantObjectView,VariantArrayView,VariantVisitor,VariantMetadata,DecodeMetadata,FindMetadataKey,ValueSize,ValidateVariant, and associated types/enums.All in namespace
arrow::extension::variant.AI Disclosure: AI coding assistants were used during development for scaffolding,
test generation, and review iteration. All code has been reviewed, debugged, and
verified by the author who owns and understands the changes.