Skip to content

GH-45947 : [C++][Parquet] Variant encoding#50122

Open
qzyu999 wants to merge 2 commits into
apache:mainfrom
qzyu999:variant-encoding
Open

GH-45947 : [C++][Parquet] Variant encoding#50122
qzyu999 wants to merge 2 commits into
apache:mainfrom
qzyu999:variant-encoding

Conversation

@qzyu999

@qzyu999 qzyu999 commented Jun 8, 2026

Copy link
Copy Markdown

PR Stack (merge in order):

  1. GH-45946: [C++][Parquet] Variant decoding #50121 Variant decoding
  2. GH-45947 : [C++][Parquet] Variant encoding #50122 ← YOU ARE HERE Variant encoding (this PR, depends on GH-45946: [C++][Parquet] Variant decoding #50121)
  3. GH-45948: [C++][Parquet] Variant shredding #50232 Variant shredding (depends on this PR)

All three PRs are part of the GH-45937 umbrella (Add variant support to C++ Parquet).

Rationale for this change

Implements Variant binary encoding (the write side of decoding from GH-45946). Part of GH-45937 (Add variant support to C++). Depends on #50121.

Note: This PR depends on #50121 (Variant decoding) and is branched from it. Please review/merge #50121 first.

What changes are included in this PR?

Adds VariantBuilder class and RAII scope helpers to variant.h / variant_builder.cc:

  • VariantBuilder: move-only encoder supporting all 21 primitive types + containers.
    Int() auto-selects smallest encoding width. String() auto-selects short-string
    (≤63 bytes) vs long-string encoding. Dictionary (key interning) preserved across
    Finish() calls for column-scan workloads.
  • ObjectScope / ListScope: RAII scopes returned by StartObject() / StartList().
    Destructor auto-rolls back the buffer if Finish() is not called — safe under
    exceptions, early returns, and scope exit.
  • [[nodiscard]] on scope-returning functions prevents accidental discard.
  • Duplicate key handling: strict rejection by default (spec-compliant). Configurable
    via SetAllowDuplicates(true) for shredding reconstruction (GH-45948).
  • Sorted-check optimization: FinishObject() skips std::sort when fields are
    already in lexicographic order (common for schema-driven insertion).
  • Transparent hasher: dict_ uses is_transparent tags for forward-compatibility
    with C++20 heterogeneous lookup (eliminates the old lookup_buf_ member variable).

Design decisions:

  • Move-only (no accidental copies of builder state)
  • RAII rollback matches C++ idiom for transactional operations
  • Low-level API (Offset/NextField/FinishObject) retained for shredding internals
  • FinishObject sorts fields in-place (documented non-const-ref parameter)
  • Dictionary preserved across Finish() — amortizes key lookup for repeated schemas
  • Scale validation on Decimal (≤38, matching spec)

Are these changes tested?

221 total tests (87 new encoder + 134 decoder) pass with BUILD_WARNING_LEVEL=CHECKIN
covering: all primitives, auto-sizing, int boundaries, short/long string boundary,
special floats (NaN, ±Inf), arrays, objects, duplicate rejection, sorting, RAII scopes,
reset/reuse, builder from existing metadata, large containers (>255 elements), and
round-trip verification via decoder views.

Are there any user-facing changes?

New public API: VariantBuilder, ObjectScope, ListScope in variant.h.
VariantBuilder::EncodedVariant return type from Finish().

AI Disclosure: AI coding assistants were used during development for scaffolding,
test generation, and review iteration. All code has been reviewed, debugged, and
verified by the author who owns and understands the changes.

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@misiek1984 misiek1984 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments.

DecodeMetadata(encoded1.metadata.data(),
static_cast<int64_t>(encoded1.metadata.size())));

// Build a new variant reusing the same metadata

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a test where we reuse existing metadata but make a mistake with the data types. For example, according to the metadata, we should write a string but we write an integer instead. I think there is currently no validation for this case in VariantBuilder—is that on purpose?

Either way, the final Variant will be malformed, so the round-trip should fail.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. After working through the refactoring, I realized this test isn't possible because of how the variant format works, the metadata dictionary contains only key names (string interning for object field names), not value types. The format is self-describing: each value carries its own type tag in its header byte.

A "type mismatch between metadata and data" is architecturally impossible because metadata doesn't encode types at all. The refactored VariantMetadata struct's docstring makes this explicit: "This is NOT a schema, it contains key names only, not value types."

What you can test (and what the validation tests cover) is structural invariant violations: field IDs that exceed the dictionary size, offsets that point out of bounds, truncated values, etc. These are exercised by the ValidateVariant tests.

class ARROW_EXPORT VariantBuilder {
public:
VariantBuilder();
explicit VariantBuilder(const VariantMetadata& existing_metadata);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be useful to pass a value buffer to VariantBuilder to initialize buffer_. This way, it will be possible to continue building an existing Variant value.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the valid point. The variant binary format is immutable by design, inserting a field into an existing object requires rewriting the header (field IDs and offsets are packed arrays, not linked structures). So "modify in place" isn't feasible at the format level.

The refactored design handles the "start from existing data" use case through:

  • VariantBuilder(const VariantMetadata& existing_metadata): constructor that pre-populates the key dictionary from an existing metadata buffer.
  • Read->rebuild pattern: VariantObjectView (read existing) + ObjectScope + UnsafeAppendEncoded (write new). This enables zero-copy field transfer between variants. It's exactly what the shredding reconstruction path uses.

This matches Rust's Variant (read-only) vs VariantBuilder (write-only) architecture, and is the standard immutable-format pattern (FlatBuffers, Cap'n Proto, etc.).

/// builder.Int(30);
/// builder.FinishObject(start, fields);
/// ARROW_ASSIGN_OR_RAISE(auto result, builder.Finish());
class ARROW_EXPORT VariantBuilder {

@misiek1984 misiek1984 Jun 19, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is great for building new variants. Did you also consider adding an API that allows modifying existing Variant values? We would need to add a function to VariantBuilder similar to FindObjectField from the decoding PR, which would "move"| the context of VariantBuilder to a specific place/field. Once called, you would then be able to override the existing value.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactored design makes this separation explicit and deliberate:

  • Views: zero-copy navigation of existing bytes (stack-allocated)
  • Builder: produces new bytes from scratch (dictionary preserved across Finish())

"Modify existing" = read the parts you want to keep via views, write them into a new builder with UnsafeAppendEncoded, add/change what you need, finish.

A higher-level mutable convenience API (think nlohmann::json-style) could be layered on top in a follow-up. For the first implementation, keeping the primitives clean and composable felt like the right foundation.

@qzyu999 qzyu999 force-pushed the variant-encoding branch from 8ab28f0 to f6b8e66 Compare June 27, 2026 04:55
@qzyu999

qzyu999 commented Jun 27, 2026

Copy link
Copy Markdown
Author

Force-pushed a refactored version. The initial implementation carried over some Go patterns (manual start/finish without safety guarantees, separate lookup buffer). After reviewing the feedback — particularly around initialization from existing buffers and the question about modifying existing variants, I reworked the builder with C++ idiom at the center of the design.

Key changes:

  • RAII scopes (ObjectScope/ListScope) with auto-rollback on destruction replace the previous unguarded start/finish pattern
  • [[nodiscard]] on scope-returning functions prevents silent discard
  • Transparent hasher on dict_ eliminates the old lookup_buf_ member variable
  • Sorted-check optimization in FinishObject (skip sort when fields already ordered)
  • Move-only builder (copy deleted) enforces single ownership

Regarding the earlier review questions about modifying existing variants and type mismatch testing, I've addressed these in replies above with architectural context from the refactored design.

All 335 tests pass end-to-end with BUILD_WARNING_LEVEL=CHECKIN (87 encoder-specific).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants