feat(arrow_csv): add header validation option#10144
Conversation
Jefffrey
left a comment
There was a problem hiding this comment.
makes sense to me; looks similar to the enforceSchema option on Spark
If it is set to true, the specified or inferred schema will be forcibly applied to datasource files, and headers in CSV files will be ignored. If the option is set to false, the schema will be validated against all headers in CSV files in the case when the header option is set to true. Field names in the schema and column names in CSV headers are checked by their positions taking into account spark.sql.caseSensitive. Though the default value is true, it is recommended to disable the enforceSchema option to avoid incorrect results. CSV built-in functions ignore this option.
| } | ||
| } | ||
|
|
||
| #[test] |
There was a problem hiding this comment.
might also be a good idea to add a test with with_truncated_rows (which allows having less columns than specified)
e.g.
#[test]
fn test123() {
let schema = Arc::new(Schema::new(vec![
Field::new("a", DataType::Int32, true),
Field::new("b", DataType::Int32, true),
]));
let csv = "a\n1\n";
let a = ReaderBuilder::new(schema.clone())
.with_header(true)
.with_header_validation(true)
.with_truncated_rows(true)
.build_buffered(Cursor::new(csv.as_bytes()))
.unwrap()
.next();
dbg!(a);
}output
running 1 test
[arrow-csv/src/reader/mod.rs:2457:9] a = Some(
Err(
CsvError(
"CSV header does not match schema at column 1: expected \"b\" but found \"\"",
),
),
)
test reader::tests::test123 ... okThere was a problem hiding this comment.
Would it be desirable to pass the validation in this case? Or would it make more sense to keep the current behavior?
There was a problem hiding this comment.
in my opinion if we're validating it would make sense to error out as it currently does
ee42f79 to
850443f
Compare
850443f to
6c0773e
Compare
|
thanks @XiNiHa |
Which issue does this PR close?
Rationale for this change
Explained in the issue.
What changes are included in this PR?
ReaderBuilderandFormatmethod .with_header_validation(bool) to enable CSV header validationAre these changes tested?
Corresponding tests added.
Are there any user-facing changes?
The
ReaderBuilderstruct gets a new method added,.with_header_validation(bool).There is no breaking change, since the validation behavior is disabled by default.