GH-48701: [C++][Parquet] Add ALPpd encoding by prtkgaur · Pull Request #48345 · apache/arrow

prtkgaur · 2025-12-05T00:23:45Z

@Reviewer : Suggested order : Outdated, will update shortly in which to look at the code while reviewing.

Rationale for this change

ALP significantly improves on the compression ratio and decompression speed over of float/double columns over other encoding/compression techniques.

Spec

Spec
This PR also contains a terse version of the spec in the file cpp/src/arrow/util/alp/ALP_Encoding_Specification_terse.md which can go in the Encodings.md

Parquet Format PR

Dataset PR (parquet-testing)

apache/parquet-testing#100

What changes are included in this PR?

This PR
Introduces ALP (pseudo-decimal) encoding into c++ arrow code.
We also provide benchmarks and dataset to prove the effectiveness of the above algorithm.

Adding above needed us to add following classes.

Alp h/cc : Houses core logic for encoding and decoding.
Sampler h/cc : Houses logic to sample and select parameters for encoding.
AlpWrapper h/cc : Binds together Alp and Sampler classes.

Integration of the above code was done in

Encoder/Decoder cc which exposes wrapper to encode buffer of data.

Are these changes tested?

We have added unit tests to test the code.
Also the benchmarks have been added that cover wide variety of floating point values from low precision to high precision.

Unit tests

alp_test.cc

Benchmark tests

encoding_benchmark.cc and encoding_alp_benchmark.cc

Are there any user-facing changes?

It's a new encoding so the only impact is query performance which we claim will only get better.

DuckDB

We did look at DuckDB's ALP implementation while we were implementing ALP and would like to give that team the desired credit.

GitHub Issue: [ALP][Parquet] Add C++ implementation of ALPpd encoder/decoder #48701

github-actions · 2025-12-05T00:24:08Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

alamb · 2025-12-08T14:04:07Z

I think the more standard place to put test data is in either arrow-testing or parquet-testing so it can be used across implementations

In this case I would recommend https://github.com/apache/parquet-testing

Makes sense. Thanks.
apache/parquet-testing#100

alamb · 2025-12-08T14:07:24Z

    DELTA_BYTE_ARRAY = 7,
    RLE_DICTIONARY = 8,
    BYTE_STREAM_SPLIT = 9,
+    ALP = 10,


https://github.com/apache/arrow/blob/main/cpp/src/parquet/parquet.thrift#L631 needs to be updated here and in parqut-format.

For parquet-format we have this PR : apache/parquet-format#557

alamb · 2025-12-08T14:07:48Z

Thanks @prtkgaur -- it is super exciting to see this movement.

Unfortunately, I am not familiar with the C/C++ codebase to give this a realistic review.

I started the CI checks on this PR and had some comments about the testing.

prtkgaur · 2025-12-08T22:16:49Z

Makes sense. Thanks.
apache/parquet-testing#100

prtkgaur · 2025-12-08T23:31:26Z

+    std::string tarball_path = std::string(__FILE__);
+    tarball_path = tarball_path.substr(0, tarball_path.find_last_of("/\\"));
+    tarball_path = tarball_path.substr(0, tarball_path.find_last_of("/\\"));
+    tarball_path += "/arrow/cpp/submodules/parquet-testing/data/floatingpoint_data.tar.gz";


@Reviewer the data sits in the parquet-testing submodule
apache/parquet-testing#100

prtkgaur · 2025-12-08T23:41:27Z


+  // Unsafe resize without initialization - use only when you will immediately
+  // overwrite the memory (e.g., before memcpy). Only safe for POD types.
+  void UnsafeResize(size_t n) {


Using this over resize gave us around 2-3% performance improvement

Co-authored-by: Dhirhan Kanesalingam <dhirhan17@gmail.com>

…racters

Also ensure that no line exceeds 90 characters

emkornfield · 2026-06-25T06:45:12Z

+  std::vector<uint8_t> packed_integers(bit_packed_size);
+  if (bit_width > 0) {  // Only execute BP if writing data.
+    // Use Arrow's BitWriter for packing (loop-based).
+    arrow::bit_util::BitWriter writer(packed_integers.data(), bit_packed_size);


Open question, do we have SIMD bit-packing anyplace? Or is it only unpacking.

I only remember seeing SIMD un-packing.

Never really thought packing would be important as insert jobs tend to have overhead at other places. But if not present might be worth thinking.
cc @pitrou

emkornfield · 2026-06-25T06:48:46Z

+    arrow::internal::unpack(packed_integers.data(), encoded_integers.data(),
+                            static_cast<int>(num_elements), for_info.bit_width());
+  } else {
+    std::memset(encoded_integers.data(), 0, num_elements * sizeof(ExactType));


Is it worth doing a memset of the ForValue here so we can skip adding it back?

memset can't fill with frame_of_reference — it's byte-wise, and frame_of_reference is 4 or 8 bytes. The replacement would have to be std::fill_n, which costs the same N stores as the current memset(0). No write savings.

emkornfield · 2026-06-25T06:51:29Z

+  /// \param[in] num_elements number of elements in this vector
+  /// \param[in] bw bits per element
+  /// \return the size in bytes of the bitpacked data
+  static int64_t GetBitPackedSize(int32_t num_elements, uint8_t bw) {


MIght have already commented but this probably belongs in a bitutil class.

emkornfield · 2026-06-25T06:59:17Z

+  // Phase 1: Compress all vectors and collect them
+  std::vector<AlpEncodedVector<T>> encoded_vectors;
+  const int64_t num_vectors =
+      (element_count + vector_size - 1) / vector_size;


I think this operation is repeated in a few places, does it pay to factor it out?

emkornfield · 2026-06-25T07:05:17Z

+// Integration Tests
+// ============================================================================
+
+TEST(AlpIntegrationTest, LargeFloatDataset) {


can we to type parameterization on these tests, to avoid the duplicate code?

Folded into TYPED_TEST(AlpIntegrationTest, RandomAndExtremes)

emkornfield · 2026-06-25T07:06:07Z

+
+TEST(AlpIntegrationTest, LargeFloatDataset) {
+  std::mt19937 rng(12345);
+  std::uniform_real_distribution<float> dist(-1000.0f, 1000.0f);


Is this a wide-enough range in practice to test of any overflow scenarios?

Added numeric_limits extremes (lowest/max/min/denorm_min/±0.0) to force exception path.

emkornfield · 2026-06-25T07:10:10Z

+  std::vector<TypeParam> output(input.size());
+  compressor.DecompressVector(encoded, AlpIntegerEncoding::kForBitPack, output.data());
+
+  EXPECT_EQ(std::memcmp(output.data(), input.data(), input.size() * sizeof(TypeParam)),


nit: I think you should be able to use EXPECT_THAT(output, ElementsAreArray(input));

Did NOT use ElementsAreArray — would silently break NegativeZero / AllNaN tests (-0.0 == 0.0. true, NaN != NaN). Wrote IsBitwiseEqual<T> matcher; ~17 sites converted; 2 byte-buffer comparisons kept as memcmp.

emkornfield · 2026-06-25T07:10:44Z

+  ASSERT_OK(AlpCodec<TypeParam>::template Decode<TypeParam>(
+      static_cast<int32_t>(kBatchSize), comp2.data(), comp_size2, output2.data()));
+
+  EXPECT_EQ(std::memcmp(output1.data(), batch1.data(), kBatchSize * sizeof(TypeParam)), 0);


same general comment on testing for equality.

emkornfield · 2026-06-25T07:12:11Z

+                                    comp_size, output.data()));
+
+  // Verify successful decode
+  EXPECT_EQ(std::memcmp(output.data(), input.data(), input.size() * sizeof(double)), 0);


where does truncated/corruption happen, how does this successfully decode?

Good catch. Rewrote to truncate

emkornfield · 2026-06-25T07:15:59Z

+
+using namespace arrow::util::alp;
+
+static void printHex(const std::string& name, const uint8_t* data, size_t len) {


emkornfield · 2026-06-25T07:16:55Z

seems like this file and the next file aren't compiled? can they be removed for now?

emkornfield · 2026-06-25T07:19:37Z

+add_parquet_benchmark(encoding_alp_benchmark)
+
+add_executable(generate-alp-parquet
+               ${PROJECT_SOURCE_DIR}/src/arrow/util/alp/generate_alp_parquet.cc)


its a little strange to have this cross directory target. I think I commented that this wasn't compiled above, could we move the here maybe if we need it?

emkornfield

OK, I think I've done a complete read through now of all the code. I think the one other thing we should figure out is how to fall back to plain encoding at some point if ALP is completely failing on the dataset (i.e. it is consistently adding pages that take more space the PLAIN encoding). I think this can be done in a follow-up.

…e TODO

…tion

github-actions Bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Dec 5, 2025

prtkgaur force-pushed the gh540-alp-pseudoDecimal-encoding branch 3 times, most recently from 1b78a5c to d563ce0 Compare December 7, 2025 15:46

alamb reviewed Dec 8, 2025

View reviewed changes

alamb mentioned this pull request Dec 8, 2025

[Parquet] Prototype ALP encoding apache/arrow-rs#8748

Open

prtkgaur changed the title ~~[Gh540] Add ALPpd encoding to parquet~~ [Gh539] Add ALPpd encoding to parquet Dec 8, 2025

prtkgaur commented Dec 8, 2025

View reviewed changes

prtkgaur changed the title ~~[Gh539] Add ALPpd encoding to parquet~~ [Gh539][Encoding] Add ALPpd encoding to parquet Dec 8, 2025

prtkgaur changed the title ~~[Gh539][Encoding] Add ALPpd encoding to parquet~~ [Gh-539][Encoding] Add ALPpd encoding to parquet Dec 8, 2025

sfc-gh-pgaur and others added 14 commits December 8, 2025 23:47

Add alp code

0d442fd

Co-authored-by: Dhirhan Kanesalingam <dhirhan17@gmail.com>

Integrate ALP with arrow

06d1e19

Add alp benchmark

a98c594

Add datasets for alp benchmarking

c297f97

Update cmake file

ab928e8

Move hpp files to h

6a95a59

Update flow digram and layout digram to use ASCII and not unicode cha…

865e46a

…racters

Rename cpp files to cc

cb6d0b6

Update documentation to align with arrow's doxygen style

496e23b

Adapt methods and variable names to arrow style

8803b52

Also ensure that no line exceeds 90 characters

Update the tests to adhere to arrow style code

31e94ec

Update callers

46c0ecc

Fuse FOR and decode loop

a70b08f

Reduce memory allocation in the decompress call

ccbb1dd

emkornfield reviewed Jun 25, 2026

View reviewed changes

sfc-gh-pgaur added 3 commits June 25, 2026 23:35

Replace trailing return types with explicit return types

8f64e08

Cite ALP paper §3.1.2 for combination tie-break rule

363d7dd

Pin AlpMode underlying type to uint8_t for serialization safety

342d69a

prtkgaur requested a review from pitrou as a code owner June 25, 2026 23:36

sfc-gh-pgaur added 13 commits June 25, 2026 23:48

Address reviewer nits: doc tweaks, sizeof(member_), incremental-decod…

0e30304

…e TODO

Convert AlpSampler uint64_t to int64_t

ac751bb

Remove apacheGH-48701 reference from incremental-decode TODO

04e4e45

Document actual crash/abort behavior in AlpCodec encode-path docs

205e600

Use bit_util::CeilDiv for vector-count computations in AlpCodec

fe5fc46

Replace memcmp with IsBitwiseEqual helper in alp_test.cc

0944b89

Type-parameterize alp_test integration tests and fix TruncatedData

562aeaa

Type-parameterize AlpEncodingAdHoc tests in encoding_test.cc

03236af

Widen RandomData test range to exercise ALP exception fallback path

89435ce

Remove uncompiled generate_alp_parquet and generate_reference_blobs

4cb0f9a

Add incremental-encode TODO and snake_case rename in AlpEncoder

4d92a98

Rename EncodeAlp parameter combinations to preset

50d5364

Use bit_util::IsPowerOf2 and BytesForBits instead of raw bit manipula…

9e4c30a

…tion


		using namespace arrow::util::alp;

		static void printHex(const std::string& name, const uint8_t* data, size_t len) {

Uh oh!

Conversation

prtkgaur commented Dec 5, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Spec

Parquet Format PR

Dataset PR (parquet-testing)

What changes are included in this PR?

Are these changes tested?

Unit tests

Benchmark tests

Are there any user-facing changes?

DuckDB

Uh oh!

github-actions Bot commented Dec 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Dec 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield left a comment

Choose a reason for hiding this comment

Uh oh!

prtkgaur commented Dec 5, 2025 •

edited by alamb

Loading