Skip to content

feat: Add NaN count support to Parquet Statistics with proper Thrift serialization#17599

Closed
mohsaka wants to merge 1 commit into
facebookincubator:mainfrom
mohsaka:iceberg_nan
Closed

feat: Add NaN count support to Parquet Statistics with proper Thrift serialization#17599
mohsaka wants to merge 1 commit into
facebookincubator:mainfrom
mohsaka:iceberg_nan

Conversation

@mohsaka

@mohsaka mohsaka commented May 22, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds comprehensive support for nan_count in Parquet Statistics, including proper Thrift serialization, deserialization, and debugging support. Also fixes parameter order inconsistencies in the Statistics API.

Testing

  • All existing tests updated and passing
  • Compilation successful across all affected files
  • Thrift serialization/deserialization properly handles the new field with field ID 7

Derived from:
IBM#1836

Co-authored By: @PingLiuPing

@netlify

netlify Bot commented May 22, 2026

Copy link
Copy Markdown

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 5e0c1ee
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/6a10d9e30c02dc0008497f1c

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 22, 2026
@github-actions

github-actions Bot commented May 22, 2026

Copy link
Copy Markdown

Build Impact Analysis

Selective Build Targets (building these covers all 63 affected)

cmake --build _build/release --target spark_aggregation_fuzzer_test velox_dwio_arrow_parquet_writer_test velox_dwio_iceberg_reader_benchmark velox_dwio_parquet_common_test velox_dwio_parquet_page_reader_test velox_dwio_parquet_reader_benchmark velox_dwio_parquet_reader_test velox_dwio_parquet_rlebp_decoder_test velox_dwio_parquet_structure_decoder_benchmark velox_dwio_parquet_structure_decoder_test velox_dwio_parquet_table_scan_test velox_dwio_parquet_tpch_test velox_exec_SpatialJoinTest velox_exec_test_group0 velox_exec_test_group1 velox_exec_test_group2 velox_exec_test_group3 velox_exec_test_group4 velox_exec_test_group5 velox_exec_test_group6 velox_exec_test_group7 velox_exec_test_group8 velox_gcs_insert_test velox_gcs_multiendpoints_test velox_hdfs_insert_test velox_hive_connector_test velox_hive_iceberg_deletion_vector_test velox_hive_iceberg_deletion_vector_writer_test velox_hive_iceberg_dwrf_insert_test velox_hive_iceberg_equality_delete_test velox_hive_iceberg_insert_test velox_hive_iceberg_test velox_hive_writer_options_adapter_test velox_parquet_e2e_filter_test velox_parquet_writer_sink_test velox_parquet_writer_test velox_query_replayer velox_s3insert_test velox_s3multiendpoints_test velox_s3read_test velox_s3registration_test velox_sort_benchmark velox_spark_query_runner_test velox_tool_trace_test velox_tpcds_benchmark velox_tpch_benchmark velox_wave_benchmark velox_wave_exec_test

Total affected: 63/580 targets

Warning: 1 file(s) could not be mapped to any target. A full build may be needed.

  • velox/dwio/parquet/thrift/parquet.thrift
Affected targets (63)

Directly changed (20)

Target Changed Files
velox_dwio_arrow_parquet_writer ParquetThriftTypes.h
velox_dwio_arrow_parquet_writer_lib Metadata.cpp, ParquetThriftTypes.h, Statistics.cpp, Statistics.h, ThriftInternal.h
velox_dwio_arrow_parquet_writer_test ParquetThriftTypes.h, Statistics.h, StatisticsTest.cpp, ThriftInternal.h
velox_dwio_arrow_parquet_writer_test_lib ParquetThriftTypes.h, Statistics.h, ThriftInternal.h
velox_dwio_native_parquet_reader ParquetThriftTypes.h
velox_dwio_parquet_common ParquetThriftTypes.h
velox_dwio_parquet_page_reader_test ParquetThriftTypes.h
velox_dwio_parquet_reader_benchmark ParquetThriftTypes.h
velox_dwio_parquet_reader_benchmark_lib ParquetThriftTypes.h
velox_dwio_parquet_reader_test ParquetThriftTypes.h
velox_dwio_parquet_table_scan_test ParquetThriftTypes.h
velox_dwio_parquet_thrift ParquetThriftTypes.cpp, ParquetThriftTypes.h
velox_dwio_parquet_writer ParquetThriftTypes.h
velox_hive_connector_test ParquetThriftTypes.h
velox_hive_iceberg_splitreader ParquetThriftTypes.h, Statistics.h
velox_parquet_e2e_filter_test ParquetThriftTypes.h
velox_parquet_writer_sink_test ParquetThriftTypes.h
velox_parquet_writer_test ParquetThriftTypes.h, Statistics.h
velox_spark_query_runner ParquetThriftTypes.h
velox_wave_benchmark ParquetThriftTypes.h

Transitively affected (43)

  • spark_aggregation_fuzzer_test
  • velox_dwio_iceberg_reader_benchmark
  • velox_dwio_parquet_common_test
  • velox_dwio_parquet_reader
  • velox_dwio_parquet_rlebp_decoder_test
  • velox_dwio_parquet_structure_decoder_benchmark
  • velox_dwio_parquet_structure_decoder_test
  • velox_dwio_parquet_tpch_test
  • velox_exec_SpatialJoinTest
  • velox_exec_test_group0
  • velox_exec_test_group1
  • velox_exec_test_group2
  • velox_exec_test_group3
  • velox_exec_test_group4
  • velox_exec_test_group5
  • velox_exec_test_group6
  • velox_exec_test_group7
  • velox_exec_test_group8
  • velox_gcs_insert_test
  • velox_gcs_multiendpoints_test
  • velox_hdfs_insert_test
  • velox_hive_iceberg_deletion_vector_test
  • velox_hive_iceberg_deletion_vector_writer_test
  • velox_hive_iceberg_dwrf_insert_test
  • velox_hive_iceberg_equality_delete_test
  • velox_hive_iceberg_insert_test
  • velox_hive_iceberg_test
  • velox_hive_writer_options_adapter_test
  • velox_query_benchmark
  • velox_query_replayer
  • velox_query_trace_replayer_base
  • velox_s3insert_test
  • velox_s3multiendpoints_test
  • velox_s3read_test
  • velox_s3registration_test
  • velox_sort_benchmark
  • velox_spark_query_runner_test
  • velox_tool_trace_test
  • velox_tpcds_benchmark
  • velox_tpcds_benchmark_lib
  • velox_tpch_benchmark
  • velox_tpch_benchmark_lib
  • velox_wave_exec_test

Fast path • Graph from main@00f1eb68ce794567ef3aea9e322082d8fddda4cc

@mohsaka mohsaka marked this pull request as ready for review May 22, 2026 21:50
@mohsaka mohsaka requested a review from majetideepak as a code owner May 22, 2026 21:50
@aditi-pandit aditi-pandit requested a review from Copilot May 22, 2026 21:53
@mohsaka mohsaka requested a review from aditi-pandit May 22, 2026 21:55

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Standardizes the placement of the nan_count argument across the Parquet writer Statistics APIs and begins wiring nan_count through the Thrift Statistics struct to support round-tripping NaN counts in Parquet metadata.

Changes:

  • Reordered nanCount to be grouped with the other count parameters in Statistics::make() and makeStatistics() (and updated call sites/tests accordingly).
  • Updated EncodedStatistics to treat nanCount as a settable/statistically significant field (isSet() + renamed setter to setNanCount()).
  • Extended Parquet Thrift Statistics struct with nan_count + __isset support and related copy/swap/operator logic.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
velox/dwio/parquet/writer/arrow/tests/StatisticsTest.cpp Updates test construction calls to match the new nanCount parameter ordering.
velox/dwio/parquet/writer/arrow/Statistics.h Reorders API parameters and updates EncodedStatistics to track/set NaN counts consistently.
velox/dwio/parquet/writer/arrow/Statistics.cpp Adjusts implementations, constructors, and encoding paths for the new nanCount ordering and setter rename.
velox/dwio/parquet/writer/arrow/Metadata.cpp Updates stats materialization from Thrift metadata to pass nanCount in the new position.
velox/dwio/parquet/thrift/ParquetThriftTypes.h Adds nan_count field + __isset flag and documents it as a Velox extension.
velox/dwio/parquet/thrift/ParquetThriftTypes.cpp Adds __set_nan_count and updates swap/copy/assignment for the new field (but see review comments re: serialization/deserialization + regeneration).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +685 to +688
void Statistics::__set_nan_count(const int64_t val) {
this->nan_count = val;
__isset.nan_count = true;
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parquet.thrift updated with new field.
Read updated

     case 7:
        if (ftype == ::apache::thrift::protocol::T_I64) {
          xfer += iprot->readI64(this->nan_count);
          this->__isset.nan_count = true;
        } else {
          xfer += iprot->skip(ftype);
        }
        break;

Comment on lines 825 to 831
swap(a.min, b.min);
swap(a.null_count, b.null_count);
swap(a.distinct_count, b.distinct_count);
swap(a.nan_count, b.nan_count);
swap(a.max_value, b.max_value);
swap(a.min_value, b.min_value);
swap(a.__isset, b.__isset);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write updated.

  if (this->__isset.nan_count) {
    xfer += oprot->writeFieldBegin(
        "nan_count", ::apache::thrift::protocol::T_I64, 7);
    xfer += oprot->writeI64(this->nan_count);
    xfer += oprot->writeFieldEnd();
  }

Comment on lines 846 to 850
min = other1.min;
null_count = other1.null_count;
distinct_count = other1.distinct_count;
nan_count = other1.nan_count;
max_value = other1.max_value;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

void Statistics::printTo(std::ostream& out) const {
  using ::apache::thrift::to_string;
  out << "Statistics(";
  out << "max=";
  (__isset.max ? (out << to_string(max)) : (out << "<null>"));
  out << ", " << "min=";
  (__isset.min ? (out << to_string(min)) : (out << "<null>"));
  out << ", " << "null_count=";
  (__isset.null_count ? (out << to_string(null_count)) : (out << "<null>"));
  out << ", " << "distinct_count=";
  (__isset.distinct_count ? (out << to_string(distinct_count))
                          : (out << "<null>"));
  out << ", " << "max_value=";
  (__isset.max_value ? (out << to_string(max_value)) : (out << "<null>"));
  out << ", " << "min_value=";
  (__isset.min_value ? (out << to_string(min_value)) : (out << "<null>"));
  out << ", " << "nan_count=";
  (__isset.nan_count ? (out << to_string(nan_count)) : (out << "<null>"));
  out << ")";
}

Comment thread velox/dwio/parquet/thrift/ParquetThriftTypes.h
Co-authored-by: Ping Liu <ping.liu.ping@gmail.com>
@mohsaka mohsaka changed the title feat: Standardize nan_count parameter position in Statistics API feat: Add NaN count support to Parquet Statistics with proper Thrift serialization May 22, 2026
@mohsaka

mohsaka commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Closing as these have been removed via
#16019

@mohsaka mohsaka closed this Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants