Skip to content

feat(arrow_csv): add header validation option#10144

Merged
Jefffrey merged 2 commits into
apache:mainfrom
XiNiHa:feat/csv-header-schema
Jun 25, 2026
Merged

feat(arrow_csv): add header validation option#10144
Jefffrey merged 2 commits into
apache:mainfrom
XiNiHa:feat/csv-header-schema

Conversation

@XiNiHa

@XiNiHa XiNiHa commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

Explained in the issue.

What changes are included in this PR?

  • Adds a new ReaderBuilder and Format method .with_header_validation(bool) to enable CSV header validation
  • Implements header row validation, which verifies each value in the first CSV row against the schema to ensure the field names match the header columns.

Are these changes tested?

Corresponding tests added.

Are there any user-facing changes?

The ReaderBuilder struct gets a new method added, .with_header_validation(bool).

There is no breaking change, since the validation behavior is disabled by default.

@github-actions github-actions Bot added the arrow Changes to the arrow crate label Jun 15, 2026

@Jefffrey Jefffrey left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me; looks similar to the enforceSchema option on Spark

If it is set to true, the specified or inferred schema will be forcibly applied to datasource files, and headers in CSV files will be ignored. If the option is set to false, the schema will be validated against all headers in CSV files in the case when the header option is set to true. Field names in the schema and column names in CSV headers are checked by their positions taking into account spark.sql.caseSensitive. Though the default value is true, it is recommended to disable the enforceSchema option to avoid incorrect results. CSV built-in functions ignore this option.

}
}

#[test]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might also be a good idea to add a test with with_truncated_rows (which allows having less columns than specified)

e.g.

    #[test]
    fn test123() {
        let schema = Arc::new(Schema::new(vec![
            Field::new("a", DataType::Int32, true),
            Field::new("b", DataType::Int32, true),
        ]));

        let csv = "a\n1\n";
        let a = ReaderBuilder::new(schema.clone())
            .with_header(true)
            .with_header_validation(true)
            .with_truncated_rows(true)
            .build_buffered(Cursor::new(csv.as_bytes()))
            .unwrap()
            .next();
        dbg!(a);
    }

output

running 1 test
[arrow-csv/src/reader/mod.rs:2457:9] a = Some(
    Err(
        CsvError(
            "CSV header does not match schema at column 1: expected \"b\" but found \"\"",
        ),
    ),
)
test reader::tests::test123 ... ok

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be desirable to pass the validation in this case? Or would it make more sense to keep the current behavior?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my opinion if we're validating it would make sense to error out as it currently does

@XiNiHa XiNiHa Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XiNiHa XiNiHa force-pushed the feat/csv-header-schema branch from ee42f79 to 850443f Compare June 23, 2026 04:55
@XiNiHa XiNiHa requested a review from Jefffrey June 23, 2026 04:56
@XiNiHa XiNiHa force-pushed the feat/csv-header-schema branch from 850443f to 6c0773e Compare June 23, 2026 05:04
@Jefffrey Jefffrey merged commit 9f37683 into apache:main Jun 25, 2026
27 checks passed
@Jefffrey

Copy link
Copy Markdown
Contributor

thanks @XiNiHa

@XiNiHa XiNiHa deleted the feat/csv-header-schema branch June 25, 2026 04:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support validating CSV headers against Schema

2 participants