Skip to content

feat(sql): support read_parquet ignore_corrupt_files#7133

Open
jackylee-ch wants to merge 3 commits into
Eventual-Inc:mainfrom
jackylee-ch:codex-sql-read-parquet-ignore-corrupt-files
Open

feat(sql): support read_parquet ignore_corrupt_files#7133
jackylee-ch wants to merge 3 commits into
Eventual-Inc:mainfrom
jackylee-ch:codex-sql-read-parquet-ignore-corrupt-files

Conversation

@jackylee-ch

Copy link
Copy Markdown
Contributor

Changes Made

  • Adds ignore_corrupt_files support to SQL read_parquet, forwarding the named argument into the Parquet scan config.
  • Adds SQL coverage for skipping corrupt Parquet files and reporting skipped files.

Related Issues

Closes #7132

@jackylee-ch jackylee-ch requested a review from a team as a code owner June 16, 2026 03:08
@github-actions github-actions Bot added the feat label Jun 16, 2026
@greptile-apps

greptile-apps Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR wires ignore_corrupt_files through the SQL read_parquet table function, covering the path from SQL argument parsing all the way to ParquetSourceConfig. Previously the field was hardcoded to false in ParquetScanBuilder::finish() regardless of any caller-supplied value.

  • scan_builder.rs: Adds ignore_corrupt_files as a public field on ParquetScanBuilder (default false), a matching builder method, and passes it through to ParquetSourceConfig instead of the previous hardcoded false.
  • read_parquet.rs: Parses ignore_corrupt_files from named SQL arguments, populates the struct field, and registers the name in the allowed-args list so the planner accepts it.
  • test_sql_table_functions.py: Adds an end-to-end test that writes one valid and one corrupt Parquet file, verifies that reading without the flag raises, then verifies the flag skips the corrupt file and surfaces it in skipped_corrupt_files.

Confidence Score: 4/5

Safe to merge — the Rust changes are minimal and correct, only wiring a pre-existing field through a previously unused path.

The Rust changes are mechanical and low-risk. The one concern is in the test: df.collect() is called but its return value is discarded, then df.to_pydict() and df.skipped_corrupt_files are accessed on the original reference. If collect() doesn't mutate df in place, the assertions on skipped files could silently pass against stale or empty state. The fix is a one-line change (df = df.collect()), and the test otherwise covers the feature well.

tests/sql/test_sql_table_functions.py — the df.collect() call discards its return value; the subsequent assertions on df.skipped_corrupt_files depend on collect() mutating df in place.

Important Files Changed

Filename Overview
src/daft-logical-plan/src/scan_builder.rs Adds ignore_corrupt_files field to ParquetScanBuilder (struct, default, builder method) and wires it through to ParquetSourceConfig::ignore_corrupt_files, replacing the previous hardcoded false.
src/daft-sql/src/table_provider/read_parquet.rs Parses ignore_corrupt_files from SQL function arguments with a default of false, adds it to the struct literal construction, and registers it in the allowed named-args list.
tests/sql/test_sql_table_functions.py New test creates a valid and a corrupt Parquet file, verifies that reading without ignore_corrupt_files raises, and verifies that the flag allows reading with skipped-file metadata. The df.collect() result is discarded without reassignment, which could be fragile.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant User as SQL caller
    participant Planner as SQLPlanner
    participant TF as ReadParquetFunction
    participant Builder as ParquetScanBuilder
    participant Config as ParquetSourceConfig
    participant Scan as GlobScanOperator

    User->>Planner: "SELECT * FROM read_parquet(..., ignore_corrupt_files => true)"
    Planner->>TF: plan(args)
    TF->>Builder: "TryFrom SQLFunctionArguments: ignore_corrupt_files=true"
    Builder->>Builder: finish()
    Builder->>Config: "ParquetSourceConfig { ignore_corrupt_files: true }"
    Config->>Scan: "GlobScanOperator::try_new(FileFormatConfig::Parquet(cfg))"
    Scan-->>Builder: ScanOperatorRef
    Builder-->>TF: LogicalPlanBuilder
    TF-->>User: "DataFrame: .collect() skips corrupt files"
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant User as SQL caller
    participant Planner as SQLPlanner
    participant TF as ReadParquetFunction
    participant Builder as ParquetScanBuilder
    participant Config as ParquetSourceConfig
    participant Scan as GlobScanOperator

    User->>Planner: "SELECT * FROM read_parquet(..., ignore_corrupt_files => true)"
    Planner->>TF: plan(args)
    TF->>Builder: "TryFrom SQLFunctionArguments: ignore_corrupt_files=true"
    Builder->>Builder: finish()
    Builder->>Config: "ParquetSourceConfig { ignore_corrupt_files: true }"
    Config->>Scan: "GlobScanOperator::try_new(FileFormatConfig::Parquet(cfg))"
    Scan-->>Builder: ScanOperatorRef
    Builder-->>TF: LogicalPlanBuilder
    TF-->>User: "DataFrame: .collect() skips corrupt files"
Loading

Reviews (1): Last reviewed commit: "feat(sql): support read_parquet ignore_c..." | Re-trigger Greptile

Comment thread tests/sql/test_sql_table_functions.py Outdated
@jackylee-ch jackylee-ch force-pushed the codex-sql-read-parquet-ignore-corrupt-files branch 24 times, most recently from 1a9b7d5 to 75becad Compare June 17, 2026 15:56
@jackylee-ch jackylee-ch force-pushed the codex-sql-read-parquet-ignore-corrupt-files branch from 75becad to e9a2c6c Compare June 18, 2026 02:30
@jackylee-ch jackylee-ch force-pushed the codex-sql-read-parquet-ignore-corrupt-files branch 6 times, most recently from 33e51e3 to 56aaf14 Compare June 22, 2026 02:20
@jackylee-ch jackylee-ch force-pushed the codex-sql-read-parquet-ignore-corrupt-files branch 11 times, most recently from 166bf18 to 647d58c Compare June 23, 2026 20:55
@jackylee-ch jackylee-ch force-pushed the codex-sql-read-parquet-ignore-corrupt-files branch from 647d58c to 20d9590 Compare June 23, 2026 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: support ignore_corrupt_files in SQL read_parquet

1 participant