Skip to content

Add support for reading Theta sketch Puffin file#3574

Draft
ebyhr wants to merge 1 commit into
apache:mainfrom
ebyhr:ebi/theta-sketches
Draft

Add support for reading Theta sketch Puffin file#3574
ebyhr wants to merge 1 commit into
apache:mainfrom
ebyhr:ebi/theta-sketches

Conversation

@ebyhr

@ebyhr ebyhr commented Jun 27, 2026

Copy link
Copy Markdown
Member

Rationale for this change

apache-datasketches-theta-v1 was already listed as a valid blob type in PuffinBlobMetadata and BlobMetadata, but pyiceberg had no way to read the actual sketch data. This implements deserialization so callers can access NDV estimates stored in statistics Puffin files written by engines like Spark.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes. New pyiceberg/table/theta_sketch.py with ThetaSketch class and theta_sketches_from_puffin_file function. New optional dependency datasketches under the datasketches extra.

@ebyhr ebyhr force-pushed the ebi/theta-sketches branch from d393e58 to 58333f2 Compare June 28, 2026 03:47
@ebyhr ebyhr force-pushed the ebi/theta-sketches branch from 58333f2 to 9e86827 Compare June 28, 2026 07:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant