GH-583: Introduce chronological ordering for INT96 timestamps#584
GH-583: Introduce chronological ordering for INT96 timestamps#584divjotarora wants to merge 6 commits into
Conversation
877629a to
de95300
Compare
de95300 to
98d8747
Compare
|
I have some general questions with regard to this proposal. A new column order requires change on the writer side. If we need to change the writer code anyway, isn't it simpler to just let writers emit |
|
I don't think the intent here is to support old readers. Instead, this is to aid new readers that want to be able to trust INT96 statistics generated by new writers. We've already established that most old readers should be able to tolerate an unknown column order, with arrow-rs pre-57.0 being a notable exception. |
What is the point of this? If both readers and writers are new, why not using INT64-typed timestamps instead? INT96 are marked as deprecated for years. |
@wgtmac This is a fair point, but some large engines like Spark still default to writing INT96 timestamps and not having a way to signal that stats are definitively correct is useful there. I'm also following up with Spark folks to see how we can make progress and move towards INT64 timestamps by default, but in the short term I still feel this change is targeted and useful.
This change is no more dangerous / breaking for old readers than the recent change to add the IEEE floating point order. IIRC there was a message in the mailing list about some older versions of arrow-rs failing on unknown sort orders, but I believe that's been fixed as well. |
etseidl
left a comment
There was a problem hiding this comment.
Thanks, changes look good 👍
BTW I have a rust PoC nearly ready to go.
|
Still needs to add tests, but Rust PoC is up apache/arrow-rs#10106 |
There is also no current alternative for nano timestamps that need to span the SQL time RAnge (years 0001-9999) |
As someone not really in the Spark world, doesn't the persistence of this type imply there's a need that Parquet isn't filling without it? Is it worth reconsidering its deprecation? I agree with @wgtmac that it's a bit weird to say "don't use this type! but if you do, then change your writers to emit this new sort order." |
|
For some (sometimes contentious) historical context, see
Together, they read like the stages of grief. I think this PR is "acceptance". |
|
parquet-java reference implementation: apache/parquet-java#3610 |
rdblue
left a comment
There was a problem hiding this comment.
Seems fine to me, but I would prefer not to mix serialization in the order definition.
|
@divjotarora I've verified the Rust PoC can read the file from apache/parquet-testing#115 correctly and generate identical statistics on write. I think it may be time to bring this up for a vote. |
| * - compare the last 4 bytes (days) as a little-endian 32-bit signed integer | ||
| * - if equal last 4 bytes, compare the first 8 bytes as a little-endian | ||
| * 64-bit signed integer (nanos) | ||
| * If TYPE_ORDER is used for an INT96 column, readers should ignore statistics |
There was a problem hiding this comment.
nit: do we need to make it explicit that statistics are min/max fields in the column statistics and min_values/max_values fields in the ColumnIndex?
There was a problem hiding this comment.
I don't think it's necessary because the section below for TYPE_ORDER of floating point values also says:
If TYPE_ORDER is used for floating point types, then the following
compatibility rules should be applied when reading statistics:
But I pushed a small change to clarify, let me know if it's OK
Rationale for this change
When writing INT96 timestamp columns, writers either omit stats altogether or emit stats using the
TYPE_ORDERordering. However, some writers were incorrectly emitting stats via bytewise comparisons, which does not result in chronological INT96 ordering. These stats are incorrect and must be ignored by readers. The goal of this change is to introduce a mechanism for readers to correctly determine the validity of an INT96 column's statistics and ignore them if they are potentially incorrect.What changes are included in this PR?
This PR specifies a new
INT96_TIMESTAMP_ORDERsort order specifically used for INT96 timestamp statistics. Additionally, it suggests writers use this ordering when emitting INT96 stats and that readers ignore stats for INT96TYPE_ORDER'd columns.Do these changes have PoC implementations?
In-progress
Closes #583