feat: add column-level lineage to the OpenLineage#19643
Conversation
FrankChen021
left a comment
There was a problem hiding this comment.
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 0 |
| P2 | 3 |
| P3 | 0 |
| Total | 3 |
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 0 |
| P2 | 3 |
| P3 | 0 |
| Total | 3 |
Found 3 issues.
Reviewed 5 of 5 changed files.
This is an automated review by Codex GPT-5.5
|
Drive-by comment: if we can, I'd like to expose this column extraction logic within Druid itself. A few reasons for this:
|
|
@gianm @clintropolis would like to get your thoughts here on the column extraction logic – do we want to make use of any Calcite functionality here? |
FrankChen021
left a comment
There was a problem hiding this comment.
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 0 |
| P2 | 2 |
| P3 | 0 |
| Total | 2 |
Reviewed 8 of 8 changed files.
Found 2 issues in the updated lineage implementation. The prior datasource-filter and filtered-aggregator concerns look addressed.
This is an automated review by Codex GPT-5.5
| Map<String, EnumSet<ColumnUsage>> baseRoles = copyRoles(roles); | ||
| // The unnest output column is synthetic (not a base column); drop it and instead record the | ||
| // underlying column(s) being unnested as projected from the base. | ||
| baseRoles.remove(unnestColumn.getOutputName()); |
There was a problem hiding this comment.
[P2] Preserve unnest output roles
For UnnestDataSource, roles collected on the synthetic unnest output are discarded here and the source columns are re-added only as PROJECTION; getUnnestFilter() is also ignored. Queries such as SELECT d3, COUNT(*) ... WHERE d3 = 'a' GROUP BY d3 will report the underlying array column as projected instead of GROUP_BY and FILTER. Preserve the removed role set and merge unnest-filter required columns as FILTER when mapping to unnestColumn.requiredColumns().
|
|
||
| emit(buildRunEvent(queryId, queryType, requestLogLine, inputs, null)); | ||
| Map<String, Map<String, EnumSet<QueryColumnUsageAnalyzer.ColumnUsage>>> columnsByTable = | ||
| columnLineageEnabled ? extractColumnsByTable(requestLogLine.getQuery()) : null; |
There was a problem hiding this comment.
[P2] Handle top-level UnionQuery inputs
A native top-level UnionQuery has getDataSource() deliberately throwing, but logNativeQuery always calls getDataSource().getTableNames() before the new QueryColumnUsageAnalyzer can handle UnionQuery. SQL plans with queryType union will fail request logging and emit no OpenLineage event. Special-case UnionQuery input table extraction, or derive inputs from the analyzer or branch datasources, and add a logger-level UnionQuery test.
Fixes #19314
Description
Builds on #19107 (the OpenLineage request-logger extension) to add column-level lineage for native queries. Each input dataset now carries which columns the query referenced and how they were used.
Two facets are attached per input dataset:
schema— the standard OpenLineageSchemaDatasetFacetlisting the referenced input column names (names only, sorted).druid_columnUsage— a Druid-specific dataset facet mapping each referenced column to the role(s) it was used in:PROJECTION,GROUP_BY,AGGREGATION,FILTER,JOIN.Validated in docker:
SELECT page, "user" FROM wikipedia WHERE countryName = '…' LIMIT 10scanpage=PROJECTION, user=PROJECTION, countryName=FILTERSELECT countryName, SUM(added) FROM wikipedia WHERE channel = '…' GROUP BY countryNamegroupBycountryName=GROUP_BY, added=AGGREGATION, channel=FILTERSELECT page, SUM(added) s FROM wikipedia GROUP BY page ORDER BY s DESC LIMIT 5topNpage=GROUP_BY, added=AGGREGATIONSELECT SUM(added) FROM wikipedia WHERE isRobot = 'false'timeseriesadded=AGGREGATION, isRobot=FILTERSELECT w1.page, w2.channel FROM wikipedia w1 JOIN wikipedia w2 ON w1.page = w2.page WHERE w1.countryName = '…'scan(join)page=[PROJECTION, JOIN], channel=PROJECTION, countryName=FILTER— right-sidew2.channelcorrectly un-prefixedSELECT countryName FROM (SELECT countryName, SUM(added) s FROM wikipedia GROUP BY countryName) GROUP BY countryNamegroupBycountryName=GROUP_BY— no fabricated sub-query-output columnsExample emitted facets for the join query's input dataset:
Release note
The OpenLineage emitter now emits column-level lineage for native queries as
schemaanddruid_columnUsagefacets on input datasets. This can be disabled withdruid.request.logging.columnLineageEnabled=false.Key changed/added classes in this PR
OpenLineageRequestLoggerOpenLineageRequestLoggerProviderDruidColumnUsageDatasetFacet.jsonThis PR has: