feat: Add NaN count support to Parquet Statistics with proper Thrift serialization#17599
feat: Add NaN count support to Parquet Statistics with proper Thrift serialization#17599mohsaka wants to merge 1 commit into
Conversation
✅ Deploy Preview for meta-velox canceled.
|
Build Impact AnalysisSelective Build Targets (building these covers all 63 affected)Total affected: 63/580 targets
Affected targets (63)Directly changed (20)
Transitively affected (43)
Fast path • Graph from main@00f1eb68ce794567ef3aea9e322082d8fddda4cc |
There was a problem hiding this comment.
Pull request overview
Standardizes the placement of the nan_count argument across the Parquet writer Statistics APIs and begins wiring nan_count through the Thrift Statistics struct to support round-tripping NaN counts in Parquet metadata.
Changes:
- Reordered
nanCountto be grouped with the other count parameters inStatistics::make()andmakeStatistics()(and updated call sites/tests accordingly). - Updated
EncodedStatisticsto treatnanCountas a settable/statistically significant field (isSet()+ renamed setter tosetNanCount()). - Extended Parquet Thrift
Statisticsstruct withnan_count+__issetsupport and related copy/swap/operator logic.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| velox/dwio/parquet/writer/arrow/tests/StatisticsTest.cpp | Updates test construction calls to match the new nanCount parameter ordering. |
| velox/dwio/parquet/writer/arrow/Statistics.h | Reorders API parameters and updates EncodedStatistics to track/set NaN counts consistently. |
| velox/dwio/parquet/writer/arrow/Statistics.cpp | Adjusts implementations, constructors, and encoding paths for the new nanCount ordering and setter rename. |
| velox/dwio/parquet/writer/arrow/Metadata.cpp | Updates stats materialization from Thrift metadata to pass nanCount in the new position. |
| velox/dwio/parquet/thrift/ParquetThriftTypes.h | Adds nan_count field + __isset flag and documents it as a Velox extension. |
| velox/dwio/parquet/thrift/ParquetThriftTypes.cpp | Adds __set_nan_count and updates swap/copy/assignment for the new field (but see review comments re: serialization/deserialization + regeneration). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| void Statistics::__set_nan_count(const int64_t val) { | ||
| this->nan_count = val; | ||
| __isset.nan_count = true; | ||
| } |
There was a problem hiding this comment.
parquet.thrift updated with new field.
Read updated
case 7:
if (ftype == ::apache::thrift::protocol::T_I64) {
xfer += iprot->readI64(this->nan_count);
this->__isset.nan_count = true;
} else {
xfer += iprot->skip(ftype);
}
break;
| swap(a.min, b.min); | ||
| swap(a.null_count, b.null_count); | ||
| swap(a.distinct_count, b.distinct_count); | ||
| swap(a.nan_count, b.nan_count); | ||
| swap(a.max_value, b.max_value); | ||
| swap(a.min_value, b.min_value); | ||
| swap(a.__isset, b.__isset); |
There was a problem hiding this comment.
Write updated.
if (this->__isset.nan_count) {
xfer += oprot->writeFieldBegin(
"nan_count", ::apache::thrift::protocol::T_I64, 7);
xfer += oprot->writeI64(this->nan_count);
xfer += oprot->writeFieldEnd();
}
| min = other1.min; | ||
| null_count = other1.null_count; | ||
| distinct_count = other1.distinct_count; | ||
| nan_count = other1.nan_count; | ||
| max_value = other1.max_value; |
There was a problem hiding this comment.
Updated
void Statistics::printTo(std::ostream& out) const {
using ::apache::thrift::to_string;
out << "Statistics(";
out << "max=";
(__isset.max ? (out << to_string(max)) : (out << "<null>"));
out << ", " << "min=";
(__isset.min ? (out << to_string(min)) : (out << "<null>"));
out << ", " << "null_count=";
(__isset.null_count ? (out << to_string(null_count)) : (out << "<null>"));
out << ", " << "distinct_count=";
(__isset.distinct_count ? (out << to_string(distinct_count))
: (out << "<null>"));
out << ", " << "max_value=";
(__isset.max_value ? (out << to_string(max_value)) : (out << "<null>"));
out << ", " << "min_value=";
(__isset.min_value ? (out << to_string(min_value)) : (out << "<null>"));
out << ", " << "nan_count=";
(__isset.nan_count ? (out << to_string(nan_count)) : (out << "<null>"));
out << ")";
}
Co-authored-by: Ping Liu <ping.liu.ping@gmail.com>
|
Closing as these have been removed via |
Summary
Adds comprehensive support for
nan_countin Parquet Statistics, including proper Thrift serialization, deserialization, and debugging support. Also fixes parameter order inconsistencies in the Statistics API.Testing
Derived from:
IBM#1836
Co-authored By: @PingLiuPing