Skip to content

refactor(parquet): Fix comment formatting and order of files in cmake#17586

Closed
mohsaka wants to merge 1 commit into
facebookincubator:mainfrom
mohsaka:iceberg_refactor
Closed

refactor(parquet): Fix comment formatting and order of files in cmake#17586
mohsaka wants to merge 1 commit into
facebookincubator:mainfrom
mohsaka:iceberg_refactor

Conversation

@mohsaka

@mohsaka mohsaka commented May 21, 2026

Copy link
Copy Markdown
Contributor

Summary

Refactor and comment cleanup of velox/dwio Parquet writer files to improve code readability, consistency, and add input validation.

Impact

  • Improved code maintainability and documentation quality
  • Better adherence to coding standards

Derived from:
IBM/velox#1836

Co-authored-by: @PingLiuPing

@netlify

netlify Bot commented May 21, 2026

Copy link
Copy Markdown

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 512fb8d
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/6a179977960bf700079e5c7b

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 21, 2026
@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown

Build Impact Analysis

Selective Build Targets (building these covers all 49 affected)

cmake --build _build/release --target spark_aggregation_fuzzer_test velox_dwio_arrow_parquet_writer_test velox_dwio_iceberg_reader_benchmark velox_dwio_parquet_page_reader_test velox_dwio_parquet_reader_benchmark velox_dwio_parquet_reader_test velox_dwio_parquet_rlebp_decoder_test velox_dwio_parquet_structure_decoder_test velox_dwio_parquet_table_scan_test velox_dwio_parquet_tpch_test velox_exec_SpatialJoinTest velox_exec_test_group0 velox_exec_test_group1 velox_exec_test_group2 velox_exec_test_group3 velox_exec_test_group4 velox_exec_test_group5 velox_exec_test_group6 velox_exec_test_group7 velox_exec_test_group8 velox_gcs_insert_test velox_gcs_multiendpoints_test velox_hdfs_insert_test velox_hive_connector_test velox_hive_iceberg_deletion_vector_test velox_hive_iceberg_deletion_vector_writer_test velox_hive_iceberg_dwrf_insert_test velox_hive_iceberg_equality_delete_test velox_hive_iceberg_insert_test velox_hive_iceberg_test velox_hive_writer_options_adapter_test velox_parquet_e2e_filter_test velox_parquet_writer_sink_test velox_parquet_writer_test velox_query_replayer velox_s3insert_test velox_s3multiendpoints_test velox_spark_query_runner_test velox_tool_trace_test velox_wave_benchmark velox_wave_exec_test

Total affected: 49/581 targets

Warning: 1 file(s) could not be mapped to any target. A full build may be needed.

  • velox/dwio/parquet/common/CMakeLists.txt
Affected targets (49)

Directly changed (17)

Target Changed Files
velox_dwio_arrow_parquet_writer Properties.h
velox_dwio_arrow_parquet_writer_lib ColumnWriter.h, FileWriter.h, Properties.h
velox_dwio_arrow_parquet_writer_test ColumnWriter.h, FileWriter.h, Properties.h
velox_dwio_arrow_parquet_writer_test_lib ColumnWriter.h, FileWriter.h, Properties.h
velox_dwio_parquet_page_reader_test Properties.h
velox_dwio_parquet_reader_benchmark Properties.h
velox_dwio_parquet_reader_benchmark_lib Properties.h
velox_dwio_parquet_reader_test Properties.h
velox_dwio_parquet_table_scan_test Properties.h
velox_dwio_parquet_writer Properties.h
velox_hive_connector_test Properties.h
velox_hive_iceberg_splitreader Properties.h
velox_parquet_e2e_filter_test Properties.h
velox_parquet_writer_sink_test Properties.h
velox_parquet_writer_test ColumnWriter.h, Properties.h
velox_spark_query_runner Properties.h
velox_wave_benchmark Properties.h

Transitively affected (32)

  • spark_aggregation_fuzzer_test
  • velox_dwio_iceberg_reader_benchmark
  • velox_dwio_parquet_rlebp_decoder_test
  • velox_dwio_parquet_structure_decoder_test
  • velox_dwio_parquet_tpch_test
  • velox_exec_SpatialJoinTest
  • velox_exec_test_group0
  • velox_exec_test_group1
  • velox_exec_test_group2
  • velox_exec_test_group3
  • velox_exec_test_group4
  • velox_exec_test_group5
  • velox_exec_test_group6
  • velox_exec_test_group7
  • velox_exec_test_group8
  • velox_gcs_insert_test
  • velox_gcs_multiendpoints_test
  • velox_hdfs_insert_test
  • velox_hive_iceberg_deletion_vector_test
  • velox_hive_iceberg_deletion_vector_writer_test
  • velox_hive_iceberg_dwrf_insert_test
  • velox_hive_iceberg_equality_delete_test
  • velox_hive_iceberg_insert_test
  • velox_hive_iceberg_test
  • velox_hive_writer_options_adapter_test
  • velox_query_replayer
  • velox_query_trace_replayer_base
  • velox_s3insert_test
  • velox_s3multiendpoints_test
  • velox_spark_query_runner_test
  • velox_tool_trace_test
  • velox_wave_exec_test

Slow path • Graph generated from PR branch

@mohsaka mohsaka force-pushed the iceberg_refactor branch 2 times, most recently from de8fccc to d0bda26 Compare May 22, 2026 00:58
@mohsaka mohsaka marked this pull request as ready for review May 22, 2026 02:55
@mohsaka mohsaka requested a review from majetideepak as a code owner May 22, 2026 02:55
@mohsaka mohsaka requested a review from aditi-pandit May 22, 2026 05:11
@aditi-pandit aditi-pandit requested a review from Copilot May 22, 2026 06:27

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors Parquet writer-related code under velox/dwio/parquet with an emphasis on reorganizing the ParquetFieldId header location, tightening build wiring, and doing comment/naming cleanups in Arrow/Parquet writer glue code.

Changes:

  • Moved ParquetFieldId.h under velox/dwio/parquet/writer/ and updated includes accordingly.
  • Updated Parquet writer CMake targets/wiring and minor ordering tweaks in Parquet common CMake.
  • Comment formatting fixes and a small writer-properties validation change (plus a local variable rename).

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
velox/dwio/parquet/writer/Writer.h Updates include to new ParquetFieldId.h location.
velox/dwio/parquet/writer/ParquetFieldId.h Adds missing <cstdint> include after move.
velox/dwio/parquet/writer/CMakeLists.txt Defines velox_dwio_parquet_field_id in writer subdir and links it.
velox/dwio/parquet/writer/arrow/ThriftInternal.h Renames local variable for consistency.
velox/dwio/parquet/writer/arrow/Properties.h Comment reflow + adds maxRowGroupLength validation (and logging include).
velox/dwio/parquet/writer/arrow/FileWriter.h Comment punctuation fix.
velox/dwio/parquet/writer/arrow/ColumnWriter.h Comment punctuation fix.
velox/dwio/parquet/common/CMakeLists.txt Reorders sources for readability.
velox/dwio/parquet/CMakeLists.txt Removes prior top-level definition of velox_dwio_parquet_field_id.
velox/connectors/hive/iceberg/IcebergParquetStatsCollector.h Updates include to new ParquetFieldId.h location.
velox/connectors/hive/iceberg/IcebergColumnHandle.h Updates include to new ParquetFieldId.h location.
velox/connectors/hive/iceberg/IcebergColumnHandle.cpp Updates include to new ParquetFieldId.h location.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread velox/dwio/parquet/CMakeLists.txt
Comment thread velox/dwio/parquet/writer/arrow/Properties.h
@aditi-pandit aditi-pandit requested a review from Copilot May 22, 2026 06:57

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Comment thread velox/dwio/parquet/writer/arrow/Properties.h
Comment thread velox/dwio/parquet/writer/Writer.h Outdated
Comment thread velox/connectors/hive/iceberg/IcebergColumnHandle.h Outdated
@mohsaka mohsaka force-pushed the iceberg_refactor branch 2 times, most recently from e54392e to c0ea649 Compare May 22, 2026 07:13
@mohsaka mohsaka changed the title refactor: Refactor and comment cleanup of parquet writer code refactor: Refactor and comment cleanup to improve code readability, consistency, and add input validation. May 22, 2026
@mohsaka mohsaka changed the title refactor: Refactor and comment cleanup to improve code readability, consistency, and add input validation. refactor: Refactor and comment cleanup to improve code readability and add input validation. May 22, 2026
@mohsaka mohsaka changed the title refactor: Refactor and comment cleanup to improve code readability and add input validation. feat: Refactor and comment cleanup to improve code readability and add input validation May 22, 2026
@mohsaka mohsaka force-pushed the iceberg_refactor branch from c0ea649 to a4cf5dc Compare May 22, 2026 07:26

@aditi-pandit aditi-pandit left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mohsaka

@aditi-pandit

Copy link
Copy Markdown
Collaborator

@rui-mo @jinchengchenghh : Please can you help with review and merge.

@mbasmanova mbasmanova left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mohsaka, the two functional changes (maxRowGroupLength validation and Statisticsstatistics rename) are fine. But they're buried in 100+ lines of comment line rewrapping that doesn't change content — just moves line breaks. This creates git blame noise for every touched line with no improvement in meaning.

  1. PR title. feat: is wrong — there's no new feature. Use refactor: or misc:.

  2. Separate the functional changes from the reformatting. The validation and rename deserve their own small PR with a test for the maxRowGroupLength validation (verify that a non-positive value throws). Comment reformatting that doesn't change content should either be dropped or submitted separately.

  3. PR description. Remove the "Files Modified" section — GitHub already shows that. Move "Impact" to replace "Summary" — lead with what matters.

  4. Some rewrapped comments read worse. For example, "the values array of.\n/// List types" breaks mid-phrase. If reformatting, fix the content too — this should be "the values array of list types" (the original has a misplaced period).

@rui-mo rui-mo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

/// Default 1Mi rows.
Builder* maxRowGroupLength(int64_t maxRowGroupLength) {
if (maxRowGroupLength <= 0) {
throw ParquetException("maxRowGroupLength must be positive");

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to put this change into a separate PR with corresponding tests, and keep this PR focused on refactoring only.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! Thanks!. Removed all functional changes. Refactoring only PR now.

@PingLiuPing

Copy link
Copy Markdown
Collaborator

@mohsaka Please stop copying my code. My original PR still open.

This is an open-source community, not IBM proprietary code. Although I have left IBM, I have not left the Velox community.

@mohsaka

mohsaka commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

@PingLiuPing Please link the specific PR you are referring to.

We have asked multiple times whether you were still actively working on the Iceberg-related items, but we did not receive responses on those threads:

The intent is not to take ownership of your work. The goal is to help move functionality forward for the benefit of the Velox project and downstream users internally at IBM.

As you mentioned, this is an open-source community project. Velox is licensed under Apache 2.0, which permits contributors to “reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work.”

That said, if you are actively working on these PRs, please continue pushing them forward and getting them reviewed. We would appreciate the collaboration and would prefer to avoid duplicated effort.

@aditi-pandit

Copy link
Copy Markdown
Collaborator

Totally support @mohsaka

@PingLiuPing : Our intent is not to copy your IBM code submission. On the contrary, we prefer to submit it publicly so that we can move on with our future projects. We have been linking your IBM PR, soliciting your review and addressing your comments. Having said that, there is urgency as there are many active Iceberg requests for us and you have moved on from IBM as well.. so we don't want to keep this pending for very long.

We would be very happy to collaborate and co-operate with you on this as we too want to resolve it and move on with our new work.

Please let us know some concrete PRs that you have identified for your work and we will help you review and push it along. Else the best we can do is to keep breaking your original PR in a way that is manageable for us to resolve conflicts with it internally.

@libianoss

@mbasmanova, @pedroerp, @tdcmeehan : Appreciate your advice on this.

@mbasmanova mbasmanova left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PingLiuPing has made and continues to make significant contributions to the Iceberg work in Velox and we appreciate that.

When a PR builds on or derives from another contributor's work, please add Co-authored-by in the commit message and CC the original author for review. This ensures proper attribution and gives them the opportunity to stay involved.

CC @aditi-pandit

@ethanyzhang

Copy link
Copy Markdown

@mbasmanova Agree with you. I think Michael did have @PingLiuPing on the Co-author line from the beginning.

@PingLiuPing

PingLiuPing commented May 27, 2026

Copy link
Copy Markdown
Collaborator

@mohsaka @aditi-pandit

I want to be very clear about my concern here. I do not accept the framing that #17586 is simply “moving functionality forward” independently from my work. This PR contains code derived from my still-open PR #16407. Renaming variables and changing the PR description does not make it independent work.

Apache 2.0 grants broad rights to reproduce and prepare derivative works, but citing that license as the complete answer misses the point. This is not only a licensing question. It is also a question of attribution, authorship, and basic open-source collaboration norms. If a PR is based on another contributor’s active, open PR, then that should be stated clearly up front, the original PR should be linked prominently, and the original author should at least be notified and acknowledged before the work is re-submitted under another PR.

The “urgency” argument is also not convincing in this case. I have not tried to block urgent Iceberg functionality. For example, when @aditi-pandit and @nmahadevuni reverted a large PR (#16999) related to Iceberg stats and the work was later re-submitted, I did not object because that work was core Iceberg functionality, even though I still had serious concerns about the process, e.g no domain maintainer's approval, it is driven from another community. #17586 is different. This PR is primarily refactoring/cleanup and does not provide new Iceberg functionality that would justify bypassing the original author’s active work.

Regarding “please continue pushing them forward and getting them reviewed”: I am continuing to contribute to Velox, but I also have my own work and priorities. Your internal priorities do not create an obligation for me to work on your schedule, and they do not justify re-submitting my active PR work as a separate PR.

I also do not think “we will keep breaking your original PR in a manageable way” is an acceptable description of what is happening. This code was not inherently part of that larger reverted PR, and this is not just a normal merge conflict caused by ongoing development. This is code taken from my open PR, lightly modified, and submitted separately.

If the normal process had been followed for the larger stats PR, the issue should have been identified and fixed in Velox instead of reverting the entire PR and then re-submitting pieces of related work (the original issue was found in Preso not in velox, all tests in velox was passed). If that had happened, the work you mentioned would not have been needed in the first place.

If the project’s position is that Apache 2.0 means contributors may freely copy each other’s active PRs, lightly modify them, omit clear attribution in the PR description, and submit them as new PRs, then under that interpretation any contributor could copy another contributor’s new PR, change the description or small details, and submit it as their own. I do not believe that is a healthy or respectful standard for this community.

@aditi-pandit

aditi-pandit commented May 27, 2026

Copy link
Copy Markdown
Collaborator

@PingLiuPing : Thanks for your comments. We highly value your work, though I do want to offer some clarifications on my statements

i) Thanks for #16407. I see that Deepak has already reviewed it and we would like for you to merge it.
ii) @mohsaka derived #17586 from your IBM github PR IBM#1836. You are officially attributed as the author of the code as you are a co-author on the commit. We (as in IBM) want to submit your code in IBM#1836 in Velox OSS and it will all be attributed to you.
iii) Our new work is around Iceberg v3 functionality, schema evolution ALTER TABLE with different options, rewrite_data_files which is newly done by us. The schema evolution code does depend on your work in IBM#1836 and so there is an urgency on our side to move it along.
iv) The Iceberg stats PR had broken Iceberg tests on specific platforms in Presto OSS and given that none of the engineers understood the code we preferred to revert and unblock Presto releases until the fix was found. Michael since fixed the code and resubmit the Iceberg stats work officially attributing you as the original author.

@mohsaka and I are both working on our own features without taking any of yours. We both strongly prefer that you submit your code in IBM#1836 as its your contribution and we prefer to take responsibility for our own.

I will defer to the IBM managers/Velox PLC now as I wasn't really part of Iceberg until very recently and I'm mostly coming from all the new Iceberg work we want to do.

@ethanyzhang @mbasmanova @libianoss

@mohsaka mohsaka changed the title feat: Refactor and comment cleanup to improve code readability and add input validation refactor: comment cleanup to improve code readability and add input validation May 28, 2026
@mohsaka mohsaka changed the title refactor: comment cleanup to improve code readability and add input validation refactor: comment cleanup to improve code readability May 28, 2026
@mohsaka mohsaka changed the title refactor: comment cleanup to improve code readability refactor: Comment cleanup and CMake ordering changes May 28, 2026
@mohsaka mohsaka force-pushed the iceberg_refactor branch from a4cf5dc to d2f9af4 Compare May 28, 2026 01:23
…files.

Co-authored-by: Ping Liu <ping.liu.ping@gmail.com>
@mohsaka mohsaka force-pushed the iceberg_refactor branch from d2f9af4 to 512fb8d Compare May 28, 2026 01:25
@mbasmanova mbasmanova changed the title refactor: Comment cleanup and CMake ordering changes refactor: Cleanup comments and fix file order in cmake May 28, 2026
@mbasmanova mbasmanova changed the title refactor: Cleanup comments and fix file order in cmake refactor(parquet): Fix comment formatting and order of files in cmake May 28, 2026

@mbasmanova mbasmanova left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates!

@mbasmanova mbasmanova added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label May 28, 2026
@PingLiuPing

Copy link
Copy Markdown
Collaborator

@aditi-pandit Thanks for the clarification.

@mohsaka and I are both working on our own features without taking any of yours.

Could you please clarify what you mean by this? For example:

So I’m having trouble reconciling those PRs with the statement that this is independent work.

The Iceberg stats PR had broken Iceberg tests on specific platforms in Presto OSS and given that none of the engineers understood the code we preferred to revert and unblock Presto releases until the fix was found.

Could you also clarify the reasoning and process here? If the issue was in Presto OSS or a downstream integration, what was the basis for reverting the Velox OSS PR rather than filing a bug, asking the original author to help, or reverting only in the downstream project until the fix was available?

as I wasn't really part of Iceberg until very recently and I'm mostly coming from all the new Iceberg work we want to do.

Good to know, thanks.

@aditi-pandit

aditi-pandit commented May 28, 2026

Copy link
Copy Markdown
Collaborator

@PingLiuPing :

Both #17582 and #17599 are derived from IBM#1836 and you are credited as author. They are not independent work.

@mohsaka has done several Iceberg PRs of his own https://github.com/prestodb/presto/pulls?q=is%3Apr+author%3Amohsaka+is%3Aclosed and like I said we are mostly concerned about our new work. Our schema evolution work depends on your previous contributions in IBM#1836.

About your Iceberg stats PR : Presto OSS advances Velox as a submodule to pick up new changes. As Velox submodule is not versioned, we don't have any way to disable individual commits and need to revert problematic changes before advancing Velox the next time. Presto OSS issue prestodb/presto#27450 was created for this. We hadn't advanced Velox in Presto OSS for over a week at this point and didn't want to continue without a release.

@libianoss

@meta-codesync

meta-codesync Bot commented May 28, 2026

Copy link
Copy Markdown

@kKPulla has imported this pull request. If you are a Meta employee, you can view this in D106649890.

@meta-codesync

meta-codesync Bot commented Jun 1, 2026

Copy link
Copy Markdown

@kKPulla merged this pull request in a647051.

@mohsaka mohsaka deleted the iceberg_refactor branch June 1, 2026 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants