feat: add column-level lineage to the OpenLineage by mshahid6 · Pull Request #19643 · apache/druid

mshahid6 · 2026-06-30T19:08:41Z

Fixes #19314

Description

Builds on #19107 (the OpenLineage request-logger extension) to add column-level lineage for native queries. Each input dataset now carries which columns the query referenced and how they were used.

Two facets are attached per input dataset:

schema — the standard OpenLineage SchemaDatasetFacet listing the referenced input column names (names only, sorted).
druid_columnUsage — a Druid-specific dataset facet mapping each referenced column to the role(s) it was used in: PROJECTION, GROUP_BY, AGGREGATION, FILTER, JOIN.

Validated in docker:

SQL	Native type	Emitted column usage
`SELECT page, "user" FROM wikipedia WHERE countryName = '…' LIMIT 10`	`scan`	`page=PROJECTION, user=PROJECTION, countryName=FILTER`
`SELECT countryName, SUM(added) FROM wikipedia WHERE channel = '…' GROUP BY countryName`	`groupBy`	`countryName=GROUP_BY, added=AGGREGATION, channel=FILTER`
`SELECT page, SUM(added) s FROM wikipedia GROUP BY page ORDER BY s DESC LIMIT 5`	`topN`	`page=GROUP_BY, added=AGGREGATION`
`SELECT SUM(added) FROM wikipedia WHERE isRobot = 'false'`	`timeseries`	`added=AGGREGATION, isRobot=FILTER`
`SELECT w1.page, w2.channel FROM wikipedia w1 JOIN wikipedia w2 ON w1.page = w2.page WHERE w1.countryName = '…'`	`scan` (join)	`page=[PROJECTION, JOIN], channel=PROJECTION, countryName=FILTER` — right-side `w2.channel` correctly un-prefixed
`SELECT countryName FROM (SELECT countryName, SUM(added) s FROM wikipedia GROUP BY countryName) GROUP BY countryName`	`groupBy`	`countryName=GROUP_BY` — no fabricated sub-query-output columns

Example emitted facets for the join query's input dataset:

"facets": {
  "schema": {
    "_schemaURL": "https://openlineage.io/spec/facets/1-1-1/SchemaDatasetFacet.json",
    "fields": [{"name": "channel"}, {"name": "countryName"}, {"name": "page"}]
  },
  "druid_columnUsage": {
    "_schemaURL": ".../DruidColumnUsageDatasetFacet.json",
    "fields": {
      "channel":     {"usages": ["PROJECTION"]},
      "countryName": {"usages": ["FILTER"]},
      "page":        {"usages": ["PROJECTION", "JOIN"]}
    }
  }
}

Release note

The OpenLineage emitter now emits column-level lineage for native queries as schema and druid_columnUsage facets on input datasets. This can be disabled with druid.request.logging.columnLineageEnabled=false.

Key changed/added classes in this PR

OpenLineageRequestLogger
OpenLineageRequestLoggerProvider
DruidColumnUsageDatasetFacet.json

This PR has:

FrankChen021

Severity	Findings
P0	0
P1	0
P2	3
P3	0
Total	3

Severity	Findings
P0	0
P1	0
P2	3
P3	0
Total	3

Found 3 issues.

Reviewed 5 of 5 changed files.

This is an automated review by Codex GPT-5.5

jtuglu1 · 2026-07-01T17:51:19Z

Drive-by comment: if we can, I'd like to expose this column extraction logic within Druid itself. A few reasons for this:

Other extensions/use-cases can benefit from extracting the queried columns from the plan (this can be useful in column statistics plan optimization, other emitters, etc.).
Having one implementation for extracting this query information ensures that our OL emitter is never out of sync with OSS parser/planner. With the advent of column statistics being available to the planner/optimizer, knowing which columns are queried will be important and similar logic will be needed in the core files. We don't need to overly complicate the initial implementation of this, but it would be great to have a unified way of extracting this information.

+      Map<String, EnumSet<ColumnUsage>> roles
+  )
+  {
+    if (dataSource instanceof TableDataSource) {


jtuglu1 · 2026-07-02T18:46:03Z

@gianm @clintropolis would like to get your thoughts here on the column extraction logic – do we want to make use of any Calcite functionality here?

FrankChen021

Severity	Findings
P0	0
P1	0
P2	2
P3	0
Total	2

Reviewed 8 of 8 changed files.

Found 2 issues in the updated lineage implementation. The prior datasource-filter and filtered-aggregator concerns look addressed.

This is an automated review by Codex GPT-5.5

FrankChen021 · 2026-07-03T13:36:48Z

+      Map<String, EnumSet<ColumnUsage>> baseRoles = copyRoles(roles);
+      // The unnest output column is synthetic (not a base column); drop it and instead record the
+      // underlying column(s) being unnested as projected from the base.
+      baseRoles.remove(unnestColumn.getOutputName());


[P2] Preserve unnest output roles

For UnnestDataSource, roles collected on the synthetic unnest output are discarded here and the source columns are re-added only as PROJECTION; getUnnestFilter() is also ignored. Queries such as SELECT d3, COUNT(*) ... WHERE d3 = 'a' GROUP BY d3 will report the underlying array column as projected instead of GROUP_BY and FILTER. Preserve the removed role set and merge unnest-filter required columns as FILTER when mapping to unnestColumn.requiredColumns().

FrankChen021 · 2026-07-03T13:36:48Z


-    emit(buildRunEvent(queryId, queryType, requestLogLine, inputs, null));
+    Map<String, Map<String, EnumSet<QueryColumnUsageAnalyzer.ColumnUsage>>> columnsByTable =
+        columnLineageEnabled ? extractColumnsByTable(requestLogLine.getQuery()) : null;


[P2] Handle top-level UnionQuery inputs

A native top-level UnionQuery has getDataSource() deliberately throwing, but logNativeQuery always calls getDataSource().getTableNames() before the new QueryColumnUsageAnalyzer can handle UnionQuery. SQL plans with queryType union will fail request logging and emit no OpenLineage event. Special-case UnionQuery input table extraction, or derive inputs from the analyzer or branch datasources, and add a logger-level UnionQuery test.

Add column-level lineage

8a7f9ea

github-actions Bot added Area - Documentation Area - Metrics/Event Emitting labels Jun 30, 2026

github-advanced-security AI found potential problems Jun 30, 2026

View reviewed changes

Comment thread ...-emitter/src/main/java/org/apache/druid/extensions/openlineage/OpenLineageRequestLogger.java Fixed

FrankChen021 reviewed Jul 1, 2026

View reviewed changes

updates from comments

d8592d1

jtuglu1 requested review from clintropolis and gianm July 2, 2026 18:23

github-advanced-security AI found potential problems Jul 2, 2026

View reviewed changes

Comment thread processing/src/main/java/org/apache/druid/query/QueryColumnUsageAnalyzer.java

Map<String, EnumSet<ColumnUsage>> roles

)

{

if (dataSource instanceof TableDataSource) {

FrankChen021 reviewed Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add column-level lineage to the OpenLineage#19643

feat: add column-level lineage to the OpenLineage#19643
mshahid6 wants to merge 2 commits into
apache:masterfrom
mshahid6:add-column-lineage

mshahid6 commented Jun 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

FrankChen021 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jtuglu1 commented Jul 1, 2026 •

edited

Loading

Uh oh!

jtuglu1 commented Jul 2, 2026

Uh oh!

FrankChen021 left a comment

Uh oh!

FrankChen021 Jul 3, 2026

Uh oh!

FrankChen021 Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

mshahid6 commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Release note

Key changed/added classes in this PR

Uh oh!

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jtuglu1 commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jtuglu1 commented Jul 2, 2026

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

FrankChen021 Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

FrankChen021 Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mshahid6 commented Jun 30, 2026 •

edited

Loading

jtuglu1 commented Jul 1, 2026 •

edited

Loading