Skip to content

fix(io): default https scheme for scheme-less S3 endpoint_url#7111

Open
plusplusjiajia wants to merge 3 commits into
Eventual-Inc:mainfrom
plusplusjiajia:fix-iceberg-endpoint-scheme
Open

fix(io): default https scheme for scheme-less S3 endpoint_url#7111
plusplusjiajia wants to merge 3 commits into
Eventual-Inc:mainfrom
plusplusjiajia:fix-iceberg-endpoint-scheme

Conversation

@plusplusjiajia

@plusplusjiajia plusplusjiajia commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Changes Made

Defaults the scheme of a user/catalog-provided S3 endpoint_url to https when none is present, in the S3 endpoint normalization in src/daft-io/src/s3_like.rs (the same block that adds trailing slashes, #5575).

Some Iceberg REST catalogs vend s3.endpoint as a bare host, optionally with a trailing slash (e.g. s3.example.com/). A scheme-less endpoint fails url::Url::parse, falls through to the verbatim string, and every S3 read fails:

DispatchFailure(... ResolveEndpointError { message: "Custom endpoint
`s3.example.com/` was not a valid URI",
source: Some(InvalidUri(InvalidFormat)) })

(Writes can appear to succeed -- PyArrow already treats a bare endpoint as https -- while reads fail, which makes this confusing to debug.)

The fix lives in build_s3_conf's endpoint normalization rather than in the Iceberg property conversion, so it covers every S3Config consumer (direct users and any catalog), not just Iceberg-REST-vended endpoints. The trailing-slash handling from #5575 is now reached for bare hosts too. Extracted the normalization into a pure normalize_endpoint_url helper and added unit tests covering the scheme-default and the (previously untested) #5575 trailing-slash behavior.

Notes:

  • Defaulting to https matches PyArrow's behavior. Users who need http keep passing a schemed endpoint -- those are untouched.
  • This cannot break working setups: a scheme-less endpoint always failed before.

Related Issues

Extends the endpoint normalization from #5575. Follow-up of #6993 (the bare-host endpoint was first observed via Iceberg-REST auto-config).

@github-actions github-actions Bot added the fix label Jun 11, 2026
@greptile-apps

greptile-apps Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a bug where _convert_iceberg_file_io_properties_to_io_config fails when an Iceberg REST catalog vends s3.endpoint without a URI scheme (e.g. oss-<region>-internal.aliyuncs.com/). Daft's S3 client requires a fully-qualified URI, so the fix defaults the scheme to https — matching PyArrow/PyIceberg's own behavior — when no :// is present.

  • daft/io/iceberg/_iceberg.py: Extracts s3.endpoint before constructing IOConfig; prepends https:// when the value contains no ://. Schemed endpoints (http/https/custom) pass through untouched, and the any_props_set side-effect flag is preserved correctly.
  • tests/io/iceberg/test_iceberg_io_config.py: Adds four targeted unit tests covering bare-host, bare-host+trailing-slash, already-schemed, and non-OSS contexts.

Confidence Score: 5/5

Safe to merge; the change only adds a well-bounded https:// prefix to bare-host endpoints that previously always caused a hard failure on reads, so no working configuration can regress.

The fix is a minimal, single-site change in the Iceberg IO property converter. The any_props_set side-effect tracking is unaffected because get_first_property_value is still called for s3.endpoint before the scheme check. Endpoints that already have a scheme bypass the branch entirely. The four new tests cover the critical paths — bare host, bare host with trailing slash, pre-schemed, and non-OSS — and all match the expected https:// prepend behavior.

No files require special attention.

Important Files Changed

Filename Overview
daft/io/iceberg/_iceberg.py Adds scheme defaulting for bare-host s3.endpoint values; logic is correct and any_props_set tracking is preserved.
tests/io/iceberg/test_iceberg_io_config.py Four new unit tests cover all key edge cases for the scheme-defaulting fix; assertions are correct and complete.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["_convert_iceberg_file_io_properties_to_io_config(props, location)"] --> B["get_first_property_value('s3.endpoint')"]
    B --> C{endpoint_url is not None?}
    C -- No --> E["endpoint_url = None"]
    C -- Yes --> D{"'://' in endpoint_url?"}
    D -- Yes --> F["endpoint_url unchanged\n(e.g. 'https://...', 'http://...')"]
    D -- No --> G["endpoint_url = 'https://' + endpoint_url\n(bare host defaulted to https)"]
    E --> H["Build IOConfig(s3=S3Config(endpoint_url=endpoint_url, ...))"]
    F --> H
    G --> H
    H --> I{is_oss?}
    I -- Yes --> J["Return io_config\n(force_virtual_addressing + oss->s3 alias)"]
    I -- No --> K{any_props_set?}
    K -- Yes --> L["Return io_config"]
    K -- No --> M["Return None"]
Loading

Reviews (1): Last reviewed commit: "fix(iceberg): default https scheme for s..." | Re-trigger Greptile

@plusplusjiajia plusplusjiajia force-pushed the fix-iceberg-endpoint-scheme branch 5 times, most recently from d397674 to c9f42fb Compare June 11, 2026 07:49
@plusplusjiajia plusplusjiajia force-pushed the fix-iceberg-endpoint-scheme branch from c9f42fb to 2f2a6e1 Compare June 11, 2026 08:38
@rohitkulshreshtha

Copy link
Copy Markdown
Contributor

Thanks for the clear writeup — the root cause is spot on (PyArrow defaults a bare endpoint to https, Daft's S3 client passes it straight to the AWS SDK, so writes pass via PyArrow while reads fail on InvalidUri).

On placement: you flagged normalizing scheme-less endpoints in the Rust endpoint loader as a possible follow-up — I'd actually prefer that as the fix here rather than scoping it to the Iceberg conversion. The same bare-endpoint_urlInvalidUri failure hits any S3Config consumer (direct users, other catalogs), not just Iceberg-REST-vended endpoints. And src/daft-io/src/s3_like.rs (~L620) already normalizes the endpoint in-place — it's where the trailing-slash handling from #5575 lives. Defaulting the scheme to https in that same block fixes it once for everyone and keeps endpoint normalization in one place.

Could you move the scheme-default into the s3_like endpoint normalization (with a test alongside #5575's)? Happy to re-review once it's relocated.

@rohitkulshreshtha rohitkulshreshtha self-requested a review June 12, 2026 20:05
@plusplusjiajia plusplusjiajia requested a review from a team as a code owner June 14, 2026 15:31
@plusplusjiajia plusplusjiajia changed the title fix(iceberg): default https scheme for scheme-less s3.endpoint in IOConfig conversion fix(io): default https scheme for scheme-less S3 endpoint_url Jun 14, 2026
@plusplusjiajia

Copy link
Copy Markdown
Contributor Author

Thanks for the clear writeup — the root cause is spot on (PyArrow defaults a bare endpoint to https, Daft's S3 client passes it straight to the AWS SDK, so writes pass via PyArrow while reads fail on InvalidUri).

On placement: you flagged normalizing scheme-less endpoints in the Rust endpoint loader as a possible follow-up — I'd actually prefer that as the fix here rather than scoping it to the Iceberg conversion. The same bare-endpoint_urlInvalidUri failure hits any S3Config consumer (direct users, other catalogs), not just Iceberg-REST-vended endpoints. And src/daft-io/src/s3_like.rs (~L620) already normalizes the endpoint in-place — it's where the trailing-slash handling from #5575 lives. Defaulting the scheme to https in that same block fixes it once for everyone and keeps endpoint normalization in one place.

Could you move the scheme-default into the s3_like endpoint normalization (with a test alongside #5575's)? Happy to re-review once it's relocated.

@rohitkulshreshtha Thanks — good call on the placement. Relocated to normalize_endpoint_url in s3_like.rs so it covers all S3Config consumers, with unit tests for the scheme-default and #5575's trailing slash. Iceberg change reverted.

@srilman srilman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had one quick question @plusplusjiajia

} else {
format!("https://{url_str}")
};
match url::Url::parse(&url_str) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that a prefix is appended beforehand, should we expect URL parsing to always succeed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srilman Thanks for catching that — not always: https:// + an empty or invalid host (e.g. "", "a b.com") still fails Url::parse, so the fallback stays. Added a test for exactly those cases.

@rohitkulshreshtha rohitkulshreshtha removed their request for review June 16, 2026 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants