Skip to content

Collation prototype: field annotation, schema round-trip, comparator#16974

Draft
laskoviymishka wants to merge 1 commit into
apache:mainfrom
laskoviymishka:prototype/collation-support
Draft

Collation prototype: field annotation, schema round-trip, comparator#16974
laskoviymishka wants to merge 1 commit into
apache:mainfrom
laskoviymishka:prototype/collation-support

Conversation

@laskoviymishka

Copy link
Copy Markdown
Contributor

Proof of concept for collation support in Java, mirroring the iceberg-go prototype and the spec change in #16972.

  • Types.NestedField carries an optional collation (String). StringType is a stateless singleton, so the collation lives on the field, not the type; it is only valid on string fields and is threaded through the builder, the asOptional/asRequired/withFieldId copies, and equals/hashCode/toString.
  • SchemaParser (de)serializes the field's "collation" attribute.
  • Comparators.charSequences(collation) returns a locale-aware comparator backed by java.text.Collator; a null or "utf8" collation yields the default UTF-8 byte-order comparator. A full implementation would use ICU for the complete modifier set, matching the spec's collation provider.

Out of scope for this POC (documented in the spec change): collation-aware bounds in data_file (collation_bounds) and metrics-evaluator pruning.

Proof of concept for collation support in Java, mirroring the iceberg-go
prototype and the spec change in apache#16972.

- Types.NestedField carries an optional collation (String). StringType is a
  stateless singleton, so the collation lives on the field, not the type; it is
  only valid on string fields and is threaded through the builder, the
  asOptional/asRequired/withFieldId copies, and equals/hashCode/toString.
- SchemaParser (de)serializes the field's "collation" attribute.
- Comparators.charSequences(collation) returns a locale-aware comparator backed
  by java.text.Collator; a null or "utf8" collation yields the default UTF-8
  byte-order comparator. A full implementation would use ICU for the complete
  modifier set, matching the spec's collation provider.

Out of scope for this POC (documented in the spec change): collation-aware
bounds in data_file (collation_bounds) and metrics-evaluator pruning.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant