[AURON #2366] fix: Handle Paimon metadata columns in V2 native scan by lyne7-sc · Pull Request #2367 · apache/auron

lyne7-sc · 2026-06-26T09:44:48Z

Which issue does this PR close?

Rationale for this change

Paimon metadata columns are produced by the Paimon scan layer rather than stored as physical columns in data files. The Paimon V2 native scan was passing these columns to the native Parquet/ORC reader as file columns, which can return incorrect values.

For example:

create table paimon.db.t_metadata (id int, v string) using paimon;
insert into paimon.db.t_metadata values (1, 'a');
select id, __paimon_file_path from paimon.db.t_metadata;

The native path returned null for __paimon_file_path, while Spark/Paimon's scan path returns the actual file path.

What changes are included in this PR?

Recognize Paimon metadata columns using PaimonMetadataColumn.
Materialize supported file-level metadata columns (__paimon_file_path, __paimon_bucket) as per-file constants.
Keep unsupported Paimon metadata columns on Spark/Paimon's scan path instead of reading them from Parquet/ORC files.
Cover metadata columns both with and without table partition columns.

Are there any user-facing changes?

No API changes. This is a correctness fix for Paimon V2 native scan.

How was this patch tested?

Adds Paimon V2 integration tests

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

SteNicholas

@lyne7-sc, thanks for the fix! The overall approach is sound: materialize __paimon_file_path/__paimon_bucket as per-file constants via partitionSchema, and fall back to Spark for unsupported metadata columns. The functional Test Paimon 1.2 CI job (which runs the new integration tests) is green.

SteNicholas · 2026-06-28T08:34:57Z

+
+  private def isPaimonMetadataColumn(name: String): Boolean = {
+    containsName(PaimonMetadataColumns, name) ||
+      name.toLowerCase(Locale.ROOT).startsWith(PaimonMetadataColumnPrefix)


Style / CI blocker. spotless scalafmt rejects this line — the || continuation should be indented 4 spaces, not 6. This is one of the two violations turning the Style job red (-······name.toLowerCase → +····name.toLowerCase). mvn spotless:apply fixes it:

private def isPaimonMetadataColumn(name: String): Boolean = { containsName(PaimonMetadataColumns, name) || name.toLowerCase(Locale.ROOT).startsWith(PaimonMetadataColumnPrefix) }

SteNicholas · 2026-06-28T08:34:58Z

+      assert(df.collect().length === 1)
+    }
+  }
+


Style / CI blocker. This blank line has trailing whitespace (8 spaces), which spotless rejects (-········ → +). It is the second cause of the failing Style job. mvn spotless:apply removes it.

SteNicholas · 2026-06-28T08:34:58Z

-      }
      split.dataFiles().asScala.map { dataFile =>
        val filePath = s"${split.bucketPath()}/${dataFile.fileName()}"
+        val partitionValues = if (partitionSchema.isEmpty) {


Efficiency: partitionValues is now computed inside split.dataFiles().map, so partitionConverter.convert(split.partition()), indexByName, and the per-field DataConverter.fromPaimon conversions all run once per data file — even though everything except __paimon_file_path is constant across the files of a split (split.partition() and split.bucket() are split-level). For a split with N data files this rebuilds the whole partition row N times. Consider computing the split-invariant portion once per split and only filling the per-file file_path slot inside the loop.

SteNicholas · 2026-06-28T08:34:58Z

+    def isPartitionValueField(name: String): Boolean =
+      containsName(partitionKeys, name) || isSupportedMetadataColumn(name)
+    val partitionFields = readSchema.fields.filter(f => isPartitionValueField(f.name))
+    val fileFields = readSchema.fields.filterNot(f => isPartitionValueField(f.name))


Coverage gap worth a test: when only metadata columns are projected from a non-partitioned table (e.g. select __paimon_file_path from t), every field is classified as a partition/metadata constant, so fileFields is empty and fileSchema is empty. The native Parquet/ORC scan is then asked to read zero data columns but must still emit one row per record so the constant columns get the right cardinality. All three new tests also select id, so the empty-fileSchema path is never exercised (and there's no existing partition-only/count(*) test either). Please add a metadata-only projection test on a multi-row non-partitioned table to confirm the row count is correct on this path.

SteNicholas · 2026-06-28T08:34:58Z

-    val partitionFields = readSchema.fields.filter(f => containsName(partitionKeys, f.name))
-    val fileFields = readSchema.fields.filterNot(f => containsName(partitionKeys, f.name))
+    def isPartitionValueField(name: String): Boolean =
+      containsName(partitionKeys, name) || isSupportedMetadataColumn(name)


Minor / edge case: classification here is purely by name (resolver against __paimon_file_path/__paimon_bucket). Paimon's schema validation reserves only the _KEY_ prefix and the core system field names — not the __paimon_ prefix — so a user could in principle define a real physical column named __paimon_bucket. It would then be treated as a per-file constant and return split.bucket() instead of the stored value (a silent wrong result rather than a fallback). Very unlikely in practice, but flagging it.

SteNicholas · 2026-06-28T08:34:58Z

      s"plan should use native paimon scan:\n$plan")
  }

+  private def checkSparkAnswerAndNativePaimonScan(sqlText: String): DataFrame = {


Minor test cleanup: the DataFrame return value is ignored by both callers (and the third metadata test doesn't use this helper), so Unit would be clearer. Also var expected: Seq[Row] = Nil reassigned inside withSQLConf can be a val, since withSQLConf returns its block value:

val expected = withSQLConf("spark.auron.enable.paimon.scan" -> "false") { sql(sqlText).collect().toSeq }

lyne7-sc added 2 commits June 26, 2026 17:23

test: add paimon metadata columns suite

3995864

support paimon file-level metadata

b05f5a6

github-actions Bot added the thirdparty-paimon label Jun 26, 2026

SteNicholas requested a review from Copilot June 28, 2026 06:27

Copilot AI reviewed Jun 28, 2026

SteNicholas reviewed Jun 28, 2026

View reviewed changes

SteNicholas self-assigned this Jun 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AURON #2366] fix: Handle Paimon metadata columns in V2 native scan#2367

[AURON #2366] fix: Handle Paimon metadata columns in V2 native scan#2367
lyne7-sc wants to merge 2 commits into
apache:masterfrom
lyne7-sc:fix/paimon_meta

lyne7-sc commented Jun 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

SteNicholas left a comment •

edited

Loading

Uh oh!

SteNicholas Jun 28, 2026

Uh oh!

SteNicholas Jun 28, 2026

Uh oh!

SteNicholas Jun 28, 2026

Uh oh!

SteNicholas Jun 28, 2026

Uh oh!

SteNicholas Jun 28, 2026

Uh oh!

SteNicholas Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

lyne7-sc commented Jun 26, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

How was this patch tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

SteNicholas left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SteNicholas Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

SteNicholas Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

SteNicholas Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

SteNicholas Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

SteNicholas Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

SteNicholas Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SteNicholas left a comment •

edited

Loading