Skip to content

GH-48701: [C++][Parquet] Add ALPpd encoding#48345

Open
prtkgaur wants to merge 106 commits into
apache:mainfrom
prtkgaur:gh540-alp-pseudoDecimal-encoding
Open

GH-48701: [C++][Parquet] Add ALPpd encoding#48345
prtkgaur wants to merge 106 commits into
apache:mainfrom
prtkgaur:gh540-alp-pseudoDecimal-encoding

Conversation

@prtkgaur

@prtkgaur prtkgaur commented Dec 5, 2025

Copy link
Copy Markdown

Co-authored-by: dhirhan17@gmail.com

@Reviewer : Suggested order : Outdated, will update shortly in which to look at the code while reviewing.

Rationale for this change

ALP significantly improves on the compression ratio and decompression speed over of float/double columns over other encoding/compression techniques.

Spec

Spec
This PR also contains a terse version of the spec in the file cpp/src/arrow/util/alp/ALP_Encoding_Specification_terse.md which can go in the Encodings.md

Parquet Format PR

Dataset PR (parquet-testing)

apache/parquet-testing#100

What changes are included in this PR?

This PR
Introduces ALP (pseudo-decimal) encoding into c++ arrow code.
We also provide benchmarks and dataset to prove the effectiveness of the above algorithm.

Adding above needed us to add following classes.

  • Alp h/cc : Houses core logic for encoding and decoding.
  • Sampler h/cc : Houses logic to sample and select parameters for encoding.
  • AlpWrapper h/cc : Binds together Alp and Sampler classes.

Integration of the above code was done in

  • Encoder/Decoder cc which exposes wrapper to encode buffer of data.

Are these changes tested?

  • We have added unit tests to test the code.
  • Also the benchmarks have been added that cover wide variety of floating point values from low precision to high precision.

Unit tests

  • alp_test.cc

Benchmark tests

  • encoding_benchmark.cc and encoding_alp_benchmark.cc

Are there any user-facing changes?

  • It's a new encoding so the only impact is query performance which we claim will only get better.

DuckDB

  • We did look at DuckDB's ALP implementation while we were implementing ALP and would like to give that team the desired credit.

@github-actions

github-actions Bot commented Dec 5, 2025

Copy link
Copy Markdown

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@prtkgaur prtkgaur force-pushed the gh540-alp-pseudoDecimal-encoding branch 3 times, most recently from 1b78a5c to d563ce0 Compare December 7, 2025 15:46

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the more standard place to put test data is in either arrow-testing or parquet-testing so it can be used across implementations

In this case I would recommend https://github.com/apache/parquet-testing

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks.
apache/parquet-testing#100

Comment thread cpp/src/parquet/types.h
DELTA_BYTE_ARRAY = 7,
RLE_DICTIONARY = 8,
BYTE_STREAM_SPLIT = 9,
ALP = 10,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For parquet-format we have this PR : apache/parquet-format#557

@alamb

alamb commented Dec 8, 2025

Copy link
Copy Markdown
Contributor

Thanks @prtkgaur -- it is super exciting to see this movement.

Unfortunately, I am not familiar with the C/C++ codebase to give this a realistic review.

I started the CI checks on this PR and had some comments about the testing.

@prtkgaur prtkgaur changed the title [Gh540] Add ALPpd encoding to parquet [Gh539] Add ALPpd encoding to parquet Dec 8, 2025

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks.
apache/parquet-testing#100

std::string tarball_path = std::string(__FILE__);
tarball_path = tarball_path.substr(0, tarball_path.find_last_of("/\\"));
tarball_path = tarball_path.substr(0, tarball_path.find_last_of("/\\"));
tarball_path += "/arrow/cpp/submodules/parquet-testing/data/floatingpoint_data.tar.gz";

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Reviewer the data sits in the parquet-testing submodule
apache/parquet-testing#100


// Unsafe resize without initialization - use only when you will immediately
// overwrite the memory (e.g., before memcpy). Only safe for POD types.
void UnsafeResize(size_t n) {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this over resize gave us around 2-3% performance improvement

@prtkgaur prtkgaur changed the title [Gh539] Add ALPpd encoding to parquet [Gh539][Encoding] Add ALPpd encoding to parquet Dec 8, 2025
@prtkgaur prtkgaur changed the title [Gh539][Encoding] Add ALPpd encoding to parquet [Gh-539][Encoding] Add ALPpd encoding to parquet Dec 8, 2025
std::vector<uint8_t> packed_integers(bit_packed_size);
if (bit_width > 0) { // Only execute BP if writing data.
// Use Arrow's BitWriter for packing (loop-based).
arrow::bit_util::BitWriter writer(packed_integers.data(), bit_packed_size);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open question, do we have SIMD bit-packing anyplace? Or is it only unpacking.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only remember seeing SIMD un-packing.

Never really thought packing would be important as insert jobs tend to have overhead at other places. But if not present might be worth thinking.
cc @pitrou

arrow::internal::unpack(packed_integers.data(), encoded_integers.data(),
static_cast<int>(num_elements), for_info.bit_width());
} else {
std::memset(encoded_integers.data(), 0, num_elements * sizeof(ExactType));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth doing a memset of the ForValue here so we can skip adding it back?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memset can't fill with frame_of_reference — it's byte-wise, and frame_of_reference is 4 or 8 bytes. The replacement would have to be std::fill_n, which costs the same N stores as the current memset(0). No write savings.

/// \param[in] num_elements number of elements in this vector
/// \param[in] bw bits per element
/// \return the size in bytes of the bitpacked data
static int64_t GetBitPackedSize(int32_t num_elements, uint8_t bw) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MIght have already commented but this probably belongs in a bitutil class.

Comment thread cpp/src/arrow/util/alp/alp_codec.cc Outdated
// Phase 1: Compress all vectors and collect them
std::vector<AlpEncodedVector<T>> encoded_vectors;
const int64_t num_vectors =
(element_count + vector_size - 1) / vector_size;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this operation is repeated in a few places, does it pay to factor it out?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread cpp/src/arrow/util/alp/alp_test.cc Outdated
// Integration Tests
// ============================================================================

TEST(AlpIntegrationTest, LargeFloatDataset) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we to type parameterization on these tests, to avoid the duplicate code?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Folded into TYPED_TEST(AlpIntegrationTest, RandomAndExtremes)

Comment thread cpp/src/arrow/util/alp/alp_test.cc Outdated

TEST(AlpIntegrationTest, LargeFloatDataset) {
std::mt19937 rng(12345);
std::uniform_real_distribution<float> dist(-1000.0f, 1000.0f);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a wide-enough range in practice to test of any overflow scenarios?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added numeric_limits extremes (lowest/max/min/denorm_min/±0.0) to force exception path.

Comment thread cpp/src/arrow/util/alp/alp_test.cc Outdated
std::vector<TypeParam> output(input.size());
compressor.DecompressVector(encoded, AlpIntegerEncoding::kForBitPack, output.data());

EXPECT_EQ(std::memcmp(output.data(), input.data(), input.size() * sizeof(TypeParam)),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think you should be able to use EXPECT_THAT(output, ElementsAreArray(input));

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did NOT use ElementsAreArray — would silently break NegativeZero / AllNaN tests (-0.0 == 0.0. true, NaN != NaN). Wrote IsBitwiseEqual<T> matcher; ~17 sites converted; 2 byte-buffer comparisons kept as memcmp.

Comment thread cpp/src/arrow/util/alp/alp_test.cc Outdated
ASSERT_OK(AlpCodec<TypeParam>::template Decode<TypeParam>(
static_cast<int32_t>(kBatchSize), comp2.data(), comp_size2, output2.data()));

EXPECT_EQ(std::memcmp(output1.data(), batch1.data(), kBatchSize * sizeof(TypeParam)), 0);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same general comment on testing for equality.

Comment thread cpp/src/arrow/util/alp/alp_test.cc Outdated
comp_size, output.data()));

// Verify successful decode
EXPECT_EQ(std::memcmp(output.data(), input.data(), input.size() * sizeof(double)), 0);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does truncated/corruption happen, how does this successfully decode?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Rewrote to truncate


using namespace arrow::util::alp;

static void printHex(const std::string& name, const uint8_t* data, size_t len) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrintHex.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this file and the next file aren't compiled? can they be removed for now?

Comment thread cpp/src/parquet/CMakeLists.txt Outdated
add_parquet_benchmark(encoding_alp_benchmark)

add_executable(generate-alp-parquet
${PROJECT_SOURCE_DIR}/src/arrow/util/alp/generate_alp_parquet.cc)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its a little strange to have this cross directory target. I think I commented that this wasn't compiled above, could we move the here maybe if we need it?

@emkornfield emkornfield left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I've done a complete read through now of all the code. I think the one other thing we should figure out is how to fall back to plain encoding at some point if ALP is completely failing on the dataset (i.e. it is consistently adding pages that take more space the PLAIN encoding). I think this can be done in a follow-up.

@prtkgaur prtkgaur requested a review from pitrou as a code owner June 25, 2026 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants