Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@
| datapage_v2_empty_datapage.snappy.parquet | A compressed FLOAT column with DataPageV2, a single row, value is null, the file uses Snappy compression, but there is no data for uncompression (see [related issue](https://github.com/apache/arrow-rs/issues/7388)). The zero bytes must not be attempted to be uncompressed, as this is an invalid Snappy stream. |
| unknown-logical-type.parquet | A file containing a column annotated with a LogicalType whose identifier has been set to an abitrary high value to check the behaviour of an old reader reading a file written by a new writer containing an unsupported type (see [related issue](https://github.com/apache/arrow/issues/41764)). |
| int96_from_spark.parquet | Single column of (deprecated) int96 values that originated as Apache Spark microsecond-resolution timestamps. Some values are outside the range typically representable by 64-bit nanosecond-resolution timestamps. See [int96_from_spark.md](int96_from_spark.md) for details. |
| int96_timestamp_order.parquet | Single `required int96` column written with the `INT96_TIMESTAMP_ORDER` column order ([parquet-format #584](https://github.com/apache/parquet-format/pull/584)). Values are chosen so a byte-wise comparison disagrees with the chronological order, so the min/max statistics (and column index) are only correct for a reader that honors the new order. See [int96_timestamp_order.md](int96_timestamp_order.md) for details. |
| binary_truncated_min_max.parquet | A file containing six columns with exact, fully-truncated and partially-truncated max and min statistics and with the expected is_{min/max}_value_exact. (see [note](Binary-truncated-min-and-max-statistics)).|

TODO: Document what each file is in the table above.
Expand Down
77 changes: 77 additions & 0 deletions data/int96_timestamp_order.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->

# `int96_timestamp_order.parquet`

A single `required int96` column written with the `INT96_TIMESTAMP_ORDER` column order added in
[parquet-format #584](https://github.com/apache/parquet-format/pull/584). It exercises a reader's
ability to honor the new order: the column carries min/max statistics and a column index, and the
footer's `column_orders[0]` is set to `INT96_TIMESTAMP_ORDER` (union field 3) rather than
`TYPE_ORDER`.

INT96 timestamps are 12 little-endian bytes: an 8-byte nanoseconds-within-the-day followed by a
4-byte Julian day. The defined order compares the Julian day (as a signed int32) first, then the
nanoseconds (as a signed int64) — i.e. chronological order.

## Why this file is non-trivial

The values are deliberately chosen so that a **byte-wise (lexicographic) comparison disagrees with
the chronological order**. Because the low-order nanosecond bytes come first in the little-endian
layout, a reader that compares the raw 12 bytes (or that ignores the new order) computes the wrong
min/max. A reader must implement the chronological comparison to pass.

| Value | Julian day | nanos-of-day | Timestamp | first byte |
|----------------|------------|-------------------|------------------------------------|------------|
| EARLY | 2440000 | 123 | 1968-05-23 00:00:00.000000123 | `0x7B` |
| SAME_DAY_EARLY | 2440588 | 1000 | 1970-01-01 00:00:00.000001000 | `0xE8` |
| LATE_IN_DAY | 2440588 | 86399999999999 | 1970-01-01 23:59:59.999999999 | `0xFF` |
| NEXT_DAY | 2440589 | 0 | 1970-01-02 00:00:00.000000000 | `0x00` |

Values are written to the file out of order: `LATE_IN_DAY, NEXT_DAY, EARLY, SAME_DAY_EARLY` (so that
the correct min/max are also neither the first nor the last value).

- Correct (`INT96_TIMESTAMP_ORDER`) min/max: **EARLY / NEXT_DAY**
- Byte-wise (incorrect) min/max would be: **NEXT_DAY / LATE_IN_DAY** (ordered by the leading
nanosecond byte `0x00 < 0x7B < 0xE8 < 0xFF`)

The min/max written to the statistics (and the column index) are therefore:

```
min = 0x 7B 00 00 00 00 00 00 00 40 3B 25 00 (EARLY: nanos 123, Julian day 2440000)
max = 0x 00 00 00 00 00 00 00 00 8D 3D 25 00 (NEXT_DAY: nanos 0, Julian day 2440589)
```

## How it was generated

Written by parquet-java (parquet-mr 1.18.0-SNAPSHOT) via the
`TestInt96TimestampStatistics#writeInt96TimestampOrderInteropFile` test:

```
mvn -pl parquet-hadoop test \
-Dtest='TestInt96TimestampStatistics#writeInt96TimestampOrderInteropFile' \
-Dparquet.testing.data.dir=<parquet-testing>/data
```

## Schema

```
message int96_timestamp_order {
required int96 ts;
}
```
Binary file added data/int96_timestamp_order.parquet
Binary file not shown.