feat: Add NaN-counts support for parquet metadata#8158
Conversation
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
|
Thanks for taking this on @Xuanwo, I sure didn't feel like doing this again 😅. Given that this doesn't yet address the changes to the page indexes nor yet add the new |
alamb
left a comment
There was a problem hiding this comment.
Thank you so much for helping this move along @Xuanwo
I think the code looks pretty good but is missing a few things as @etseidl describes
- The new IEEE ordering
- Testing statistics in the page index
For testing, I recomend a "end to end" style test in https://github.com/apache/arrow-rs/tree/main/parquet/tests somewhere
- Write floating point data to a real parquet file
- REad the metadata back from the file and assert the statistics
That would ensure we hooked everything up right and the end user API was working as expected
| ); | ||
| assert_eq!(stats.max_opt().unwrap(), &ByteArray::from(f16::INFINITY)); | ||
| assert_eq!(stats.nan_count_opt(), Some(2)); | ||
| } |
There was a problem hiding this comment.
I think the spec also talks about the case for all nans, so that would also be a good case to check
There was a problem hiding this comment.
Thank you for the review, I will polish the tests side.
|
Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look |
Signed-off-by: Xuanwo <github@xuanwo.io>
|
@Xuanwo do you have any time to continue working on this? If not, I'm finally willing to take a stab at merging this with my PoC for the other NaN proposal (#7408) so that we're ready to go if apache/parquet-format#514 is merged. Both versions need to be updated to work with the new thrift parsing anyway. |
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Rationale for this change
Add NaN-count parquet for parquet
Are these changes tested?
Yes, well tested. All cases mentioned in #8156 should have been tested. Let me know if there are something missed.
Are there any user-facing changes?
nan-count is added in a compatible way without any API breaking changes.
One exception is
Statistics::newin theformat.rs, but I think it's not part of the API surface.This PR was primarily authored with Claude Code using Opus 4.1 and then hand-reviewed by me. I aimed to keep it aligned with our goals, though I may have missed minor issues. I am responsible for every change made in this PR. Please flag anything that feels off, I’ll fix it quickly.