fix(io): default https scheme for scheme-less S3 endpoint_url#7111
fix(io): default https scheme for scheme-less S3 endpoint_url#7111plusplusjiajia wants to merge 3 commits into
Conversation
Greptile SummaryThis PR fixes a bug where
Confidence Score: 5/5Safe to merge; the change only adds a well-bounded The fix is a minimal, single-site change in the Iceberg IO property converter. The No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["_convert_iceberg_file_io_properties_to_io_config(props, location)"] --> B["get_first_property_value('s3.endpoint')"]
B --> C{endpoint_url is not None?}
C -- No --> E["endpoint_url = None"]
C -- Yes --> D{"'://' in endpoint_url?"}
D -- Yes --> F["endpoint_url unchanged\n(e.g. 'https://...', 'http://...')"]
D -- No --> G["endpoint_url = 'https://' + endpoint_url\n(bare host defaulted to https)"]
E --> H["Build IOConfig(s3=S3Config(endpoint_url=endpoint_url, ...))"]
F --> H
G --> H
H --> I{is_oss?}
I -- Yes --> J["Return io_config\n(force_virtual_addressing + oss->s3 alias)"]
I -- No --> K{any_props_set?}
K -- Yes --> L["Return io_config"]
K -- No --> M["Return None"]
Reviews (1): Last reviewed commit: "fix(iceberg): default https scheme for s..." | Re-trigger Greptile |
d397674 to
c9f42fb
Compare
c9f42fb to
2f2a6e1
Compare
|
Thanks for the clear writeup — the root cause is spot on (PyArrow defaults a bare endpoint to On placement: you flagged normalizing scheme-less endpoints in the Rust endpoint loader as a possible follow-up — I'd actually prefer that as the fix here rather than scoping it to the Iceberg conversion. The same bare- Could you move the scheme-default into the s3_like endpoint normalization (with a test alongside #5575's)? Happy to re-review once it's relocated. |
@rohitkulshreshtha Thanks — good call on the placement. Relocated to normalize_endpoint_url in s3_like.rs so it covers all S3Config consumers, with unit tests for the scheme-default and #5575's trailing slash. Iceberg change reverted. |
srilman
left a comment
There was a problem hiding this comment.
Just had one quick question @plusplusjiajia
| } else { | ||
| format!("https://{url_str}") | ||
| }; | ||
| match url::Url::parse(&url_str) { |
There was a problem hiding this comment.
Now that a prefix is appended beforehand, should we expect URL parsing to always succeed?
There was a problem hiding this comment.
@srilman Thanks for catching that — not always: https:// + an empty or invalid host (e.g. "", "a b.com") still fails Url::parse, so the fallback stays. Added a test for exactly those cases.
Changes Made
Defaults the scheme of a user/catalog-provided S3
endpoint_urltohttpswhen none is present, in the S3 endpoint normalization insrc/daft-io/src/s3_like.rs(the same block that adds trailing slashes, #5575).Some Iceberg REST catalogs vend
s3.endpointas a bare host, optionally with a trailing slash (e.g.s3.example.com/). A scheme-less endpoint failsurl::Url::parse, falls through to the verbatim string, and every S3 read fails:(Writes can appear to succeed -- PyArrow already treats a bare endpoint as
https-- while reads fail, which makes this confusing to debug.)The fix lives in
build_s3_conf's endpoint normalization rather than in the Iceberg property conversion, so it covers everyS3Configconsumer (direct users and any catalog), not just Iceberg-REST-vended endpoints. The trailing-slash handling from #5575 is now reached for bare hosts too. Extracted the normalization into a purenormalize_endpoint_urlhelper and added unit tests covering the scheme-default and the (previously untested) #5575 trailing-slash behavior.Notes:
httpsmatches PyArrow's behavior. Users who needhttpkeep passing a schemed endpoint -- those are untouched.Related Issues
Extends the endpoint normalization from #5575. Follow-up of #6993 (the bare-host endpoint was first observed via Iceberg-REST auto-config).