GH-48701: [C++][Parquet] Add ALPpd encoding#48345
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
1b78a5c to
d563ce0
Compare
There was a problem hiding this comment.
I think the more standard place to put test data is in either arrow-testing or parquet-testing so it can be used across implementations
In this case I would recommend https://github.com/apache/parquet-testing
| DELTA_BYTE_ARRAY = 7, | ||
| RLE_DICTIONARY = 8, | ||
| BYTE_STREAM_SPLIT = 9, | ||
| ALP = 10, |
There was a problem hiding this comment.
https://github.com/apache/arrow/blob/main/cpp/src/parquet/parquet.thrift#L631 needs to be updated here and in parqut-format.
There was a problem hiding this comment.
For parquet-format we have this PR : apache/parquet-format#557
|
Thanks @prtkgaur -- it is super exciting to see this movement. Unfortunately, I am not familiar with the C/C++ codebase to give this a realistic review. I started the CI checks on this PR and had some comments about the testing. |
| std::string tarball_path = std::string(__FILE__); | ||
| tarball_path = tarball_path.substr(0, tarball_path.find_last_of("/\\")); | ||
| tarball_path = tarball_path.substr(0, tarball_path.find_last_of("/\\")); | ||
| tarball_path += "/arrow/cpp/submodules/parquet-testing/data/floatingpoint_data.tar.gz"; |
There was a problem hiding this comment.
@Reviewer the data sits in the parquet-testing submodule
apache/parquet-testing#100
|
|
||
| // Unsafe resize without initialization - use only when you will immediately | ||
| // overwrite the memory (e.g., before memcpy). Only safe for POD types. | ||
| void UnsafeResize(size_t n) { |
There was a problem hiding this comment.
Using this over resize gave us around 2-3% performance improvement
Co-authored-by: Dhirhan Kanesalingam <dhirhan17@gmail.com>
Also ensure that no line exceeds 90 characters
| std::vector<uint8_t> packed_integers(bit_packed_size); | ||
| if (bit_width > 0) { // Only execute BP if writing data. | ||
| // Use Arrow's BitWriter for packing (loop-based). | ||
| arrow::bit_util::BitWriter writer(packed_integers.data(), bit_packed_size); |
There was a problem hiding this comment.
Open question, do we have SIMD bit-packing anyplace? Or is it only unpacking.
There was a problem hiding this comment.
I only remember seeing SIMD un-packing.
Never really thought packing would be important as insert jobs tend to have overhead at other places. But if not present might be worth thinking.
cc @pitrou
| arrow::internal::unpack(packed_integers.data(), encoded_integers.data(), | ||
| static_cast<int>(num_elements), for_info.bit_width()); | ||
| } else { | ||
| std::memset(encoded_integers.data(), 0, num_elements * sizeof(ExactType)); |
There was a problem hiding this comment.
Is it worth doing a memset of the ForValue here so we can skip adding it back?
There was a problem hiding this comment.
memset can't fill with frame_of_reference — it's byte-wise, and frame_of_reference is 4 or 8 bytes. The replacement would have to be std::fill_n, which costs the same N stores as the current memset(0). No write savings.
| /// \param[in] num_elements number of elements in this vector | ||
| /// \param[in] bw bits per element | ||
| /// \return the size in bytes of the bitpacked data | ||
| static int64_t GetBitPackedSize(int32_t num_elements, uint8_t bw) { |
There was a problem hiding this comment.
MIght have already commented but this probably belongs in a bitutil class.
| // Phase 1: Compress all vectors and collect them | ||
| std::vector<AlpEncodedVector<T>> encoded_vectors; | ||
| const int64_t num_vectors = | ||
| (element_count + vector_size - 1) / vector_size; |
There was a problem hiding this comment.
I think this operation is repeated in a few places, does it pay to factor it out?
| // Integration Tests | ||
| // ============================================================================ | ||
|
|
||
| TEST(AlpIntegrationTest, LargeFloatDataset) { |
There was a problem hiding this comment.
can we to type parameterization on these tests, to avoid the duplicate code?
There was a problem hiding this comment.
Folded into TYPED_TEST(AlpIntegrationTest, RandomAndExtremes)
|
|
||
| TEST(AlpIntegrationTest, LargeFloatDataset) { | ||
| std::mt19937 rng(12345); | ||
| std::uniform_real_distribution<float> dist(-1000.0f, 1000.0f); |
There was a problem hiding this comment.
Is this a wide-enough range in practice to test of any overflow scenarios?
There was a problem hiding this comment.
Added numeric_limits extremes (lowest/max/min/denorm_min/±0.0) to force exception path.
| std::vector<TypeParam> output(input.size()); | ||
| compressor.DecompressVector(encoded, AlpIntegerEncoding::kForBitPack, output.data()); | ||
|
|
||
| EXPECT_EQ(std::memcmp(output.data(), input.data(), input.size() * sizeof(TypeParam)), |
There was a problem hiding this comment.
nit: I think you should be able to use EXPECT_THAT(output, ElementsAreArray(input));
There was a problem hiding this comment.
Did NOT use ElementsAreArray — would silently break NegativeZero / AllNaN tests (-0.0 == 0.0. true, NaN != NaN). Wrote IsBitwiseEqual<T> matcher; ~17 sites converted; 2 byte-buffer comparisons kept as memcmp.
| ASSERT_OK(AlpCodec<TypeParam>::template Decode<TypeParam>( | ||
| static_cast<int32_t>(kBatchSize), comp2.data(), comp_size2, output2.data())); | ||
|
|
||
| EXPECT_EQ(std::memcmp(output1.data(), batch1.data(), kBatchSize * sizeof(TypeParam)), 0); |
There was a problem hiding this comment.
same general comment on testing for equality.
| comp_size, output.data())); | ||
|
|
||
| // Verify successful decode | ||
| EXPECT_EQ(std::memcmp(output.data(), input.data(), input.size() * sizeof(double)), 0); |
There was a problem hiding this comment.
where does truncated/corruption happen, how does this successfully decode?
There was a problem hiding this comment.
Good catch. Rewrote to truncate
|
|
||
| using namespace arrow::util::alp; | ||
|
|
||
| static void printHex(const std::string& name, const uint8_t* data, size_t len) { |
There was a problem hiding this comment.
seems like this file and the next file aren't compiled? can they be removed for now?
| add_parquet_benchmark(encoding_alp_benchmark) | ||
|
|
||
| add_executable(generate-alp-parquet | ||
| ${PROJECT_SOURCE_DIR}/src/arrow/util/alp/generate_alp_parquet.cc) |
There was a problem hiding this comment.
its a little strange to have this cross directory target. I think I commented that this wasn't compiled above, could we move the here maybe if we need it?
emkornfield
left a comment
There was a problem hiding this comment.
OK, I think I've done a complete read through now of all the code. I think the one other thing we should figure out is how to fall back to plain encoding at some point if ALP is completely failing on the dataset (i.e. it is consistently adding pages that take more space the PLAIN encoding). I think this can be done in a follow-up.
Co-authored-by: dhirhan17@gmail.com
@Reviewer : Suggested order : Outdated, will update shortly in which to look at the code while reviewing.
Rationale for this change
ALP significantly improves on the compression ratio and decompression speed over of float/double columns over other encoding/compression techniques.
Spec
Spec
This PR also contains a terse version of the spec in the file cpp/src/arrow/util/alp/ALP_Encoding_Specification_terse.md which can go in the Encodings.md
Parquet Format PR
Dataset PR (parquet-testing)
apache/parquet-testing#100
What changes are included in this PR?
This PR
Introduces ALP (pseudo-decimal) encoding into c++ arrow code.
We also provide benchmarks and dataset to prove the effectiveness of the above algorithm.
Adding above needed us to add following classes.
Integration of the above code was done in
Are these changes tested?
Unit tests
Benchmark tests
Are there any user-facing changes?
DuckDB