refactor(parquet): Fix comment formatting and order of files in cmake#17586
refactor(parquet): Fix comment formatting and order of files in cmake#17586mohsaka wants to merge 1 commit into
Conversation
✅ Deploy Preview for meta-velox canceled.
|
Build Impact AnalysisSelective Build Targets (building these covers all 49 affected)Total affected: 49/581 targets
Affected targets (49)Directly changed (17)
Transitively affected (32)
Slow path • Graph generated from PR branch |
de8fccc to
d0bda26
Compare
There was a problem hiding this comment.
Pull request overview
Refactors Parquet writer-related code under velox/dwio/parquet with an emphasis on reorganizing the ParquetFieldId header location, tightening build wiring, and doing comment/naming cleanups in Arrow/Parquet writer glue code.
Changes:
- Moved
ParquetFieldId.hundervelox/dwio/parquet/writer/and updated includes accordingly. - Updated Parquet writer CMake targets/wiring and minor ordering tweaks in Parquet common CMake.
- Comment formatting fixes and a small writer-properties validation change (plus a local variable rename).
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| velox/dwio/parquet/writer/Writer.h | Updates include to new ParquetFieldId.h location. |
| velox/dwio/parquet/writer/ParquetFieldId.h | Adds missing <cstdint> include after move. |
| velox/dwio/parquet/writer/CMakeLists.txt | Defines velox_dwio_parquet_field_id in writer subdir and links it. |
| velox/dwio/parquet/writer/arrow/ThriftInternal.h | Renames local variable for consistency. |
| velox/dwio/parquet/writer/arrow/Properties.h | Comment reflow + adds maxRowGroupLength validation (and logging include). |
| velox/dwio/parquet/writer/arrow/FileWriter.h | Comment punctuation fix. |
| velox/dwio/parquet/writer/arrow/ColumnWriter.h | Comment punctuation fix. |
| velox/dwio/parquet/common/CMakeLists.txt | Reorders sources for readability. |
| velox/dwio/parquet/CMakeLists.txt | Removes prior top-level definition of velox_dwio_parquet_field_id. |
| velox/connectors/hive/iceberg/IcebergParquetStatsCollector.h | Updates include to new ParquetFieldId.h location. |
| velox/connectors/hive/iceberg/IcebergColumnHandle.h | Updates include to new ParquetFieldId.h location. |
| velox/connectors/hive/iceberg/IcebergColumnHandle.cpp | Updates include to new ParquetFieldId.h location. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
e54392e to
c0ea649
Compare
|
@rui-mo @jinchengchenghh : Please can you help with review and merge. |
mbasmanova
left a comment
There was a problem hiding this comment.
@mohsaka, the two functional changes (maxRowGroupLength validation and Statistics → statistics rename) are fine. But they're buried in 100+ lines of comment line rewrapping that doesn't change content — just moves line breaks. This creates git blame noise for every touched line with no improvement in meaning.
-
PR title.
feat:is wrong — there's no new feature. Userefactor:ormisc:. -
Separate the functional changes from the reformatting. The validation and rename deserve their own small PR with a test for the
maxRowGroupLengthvalidation (verify that a non-positive value throws). Comment reformatting that doesn't change content should either be dropped or submitted separately. -
PR description. Remove the "Files Modified" section — GitHub already shows that. Move "Impact" to replace "Summary" — lead with what matters.
-
Some rewrapped comments read worse. For example, "the values array of.\n/// List types" breaks mid-phrase. If reformatting, fix the content too — this should be "the values array of list types" (the original has a misplaced period).
| /// Default 1Mi rows. | ||
| Builder* maxRowGroupLength(int64_t maxRowGroupLength) { | ||
| if (maxRowGroupLength <= 0) { | ||
| throw ParquetException("maxRowGroupLength must be positive"); |
There was a problem hiding this comment.
It might be better to put this change into a separate PR with corresponding tests, and keep this PR focused on refactoring only.
There was a problem hiding this comment.
Agreed! Thanks!. Removed all functional changes. Refactoring only PR now.
|
@mohsaka Please stop copying my code. My original PR still open. This is an open-source community, not IBM proprietary code. Although I have left IBM, I have not left the Velox community. |
|
@PingLiuPing Please link the specific PR you are referring to. We have asked multiple times whether you were still actively working on the Iceberg-related items, but we did not receive responses on those threads:
The intent is not to take ownership of your work. The goal is to help move functionality forward for the benefit of the Velox project and downstream users internally at IBM. As you mentioned, this is an open-source community project. Velox is licensed under Apache 2.0, which permits contributors to “reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work.” That said, if you are actively working on these PRs, please continue pushing them forward and getting them reviewed. We would appreciate the collaboration and would prefer to avoid duplicated effort. |
|
Totally support @mohsaka @PingLiuPing : Our intent is not to copy your IBM code submission. On the contrary, we prefer to submit it publicly so that we can move on with our future projects. We have been linking your IBM PR, soliciting your review and addressing your comments. Having said that, there is urgency as there are many active Iceberg requests for us and you have moved on from IBM as well.. so we don't want to keep this pending for very long. We would be very happy to collaborate and co-operate with you on this as we too want to resolve it and move on with our new work. Please let us know some concrete PRs that you have identified for your work and we will help you review and push it along. Else the best we can do is to keep breaking your original PR in a way that is manageable for us to resolve conflicts with it internally. @mbasmanova, @pedroerp, @tdcmeehan : Appreciate your advice on this. |
mbasmanova
left a comment
There was a problem hiding this comment.
@PingLiuPing has made and continues to make significant contributions to the Iceberg work in Velox and we appreciate that.
When a PR builds on or derives from another contributor's work, please add Co-authored-by in the commit message and CC the original author for review. This ensures proper attribution and gives them the opportunity to stay involved.
|
@mbasmanova Agree with you. I think Michael did have @PingLiuPing on the Co-author line from the beginning. |
|
I want to be very clear about my concern here. I do not accept the framing that #17586 is simply “moving functionality forward” independently from my work. This PR contains code derived from my still-open PR #16407. Renaming variables and changing the PR description does not make it independent work. Apache 2.0 grants broad rights to reproduce and prepare derivative works, but citing that license as the complete answer misses the point. This is not only a licensing question. It is also a question of attribution, authorship, and basic open-source collaboration norms. If a PR is based on another contributor’s active, open PR, then that should be stated clearly up front, the original PR should be linked prominently, and the original author should at least be notified and acknowledged before the work is re-submitted under another PR. The “urgency” argument is also not convincing in this case. I have not tried to block urgent Iceberg functionality. For example, when @aditi-pandit and @nmahadevuni reverted a large PR (#16999) related to Iceberg stats and the work was later re-submitted, I did not object because that work was core Iceberg functionality, even though I still had serious concerns about the process, e.g no domain maintainer's approval, it is driven from another community. #17586 is different. This PR is primarily refactoring/cleanup and does not provide new Iceberg functionality that would justify bypassing the original author’s active work. Regarding “please continue pushing them forward and getting them reviewed”: I am continuing to contribute to Velox, but I also have my own work and priorities. Your internal priorities do not create an obligation for me to work on your schedule, and they do not justify re-submitting my active PR work as a separate PR. I also do not think “we will keep breaking your original PR in a manageable way” is an acceptable description of what is happening. This code was not inherently part of that larger reverted PR, and this is not just a normal merge conflict caused by ongoing development. This is code taken from my open PR, lightly modified, and submitted separately. If the normal process had been followed for the larger stats PR, the issue should have been identified and fixed in Velox instead of reverting the entire PR and then re-submitting pieces of related work (the original issue was found in Preso not in velox, all tests in velox was passed). If that had happened, the work you mentioned would not have been needed in the first place. If the project’s position is that Apache 2.0 means contributors may freely copy each other’s active PRs, lightly modify them, omit clear attribution in the PR description, and submit them as new PRs, then under that interpretation any contributor could copy another contributor’s new PR, change the description or small details, and submit it as their own. I do not believe that is a healthy or respectful standard for this community. |
|
@PingLiuPing : Thanks for your comments. We highly value your work, though I do want to offer some clarifications on my statements i) Thanks for #16407. I see that Deepak has already reviewed it and we would like for you to merge it. @mohsaka and I are both working on our own features without taking any of yours. We both strongly prefer that you submit your code in IBM#1836 as its your contribution and we prefer to take responsibility for our own. I will defer to the IBM managers/Velox PLC now as I wasn't really part of Iceberg until very recently and I'm mostly coming from all the new Iceberg work we want to do. |
…files. Co-authored-by: Ping Liu <ping.liu.ping@gmail.com>
|
@aditi-pandit Thanks for the clarification.
Could you please clarify what you mean by this? For example:
So I’m having trouble reconciling those PRs with the statement that this is independent work.
Could you also clarify the reasoning and process here? If the issue was in Presto OSS or a downstream integration, what was the basis for reverting the Velox OSS PR rather than filing a bug, asking the original author to help, or reverting only in the downstream project until the fix was available?
Good to know, thanks. |
|
Both #17582 and #17599 are derived from IBM#1836 and you are credited as author. They are not independent work. @mohsaka has done several Iceberg PRs of his own https://github.com/prestodb/presto/pulls?q=is%3Apr+author%3Amohsaka+is%3Aclosed and like I said we are mostly concerned about our new work. Our schema evolution work depends on your previous contributions in IBM#1836. About your Iceberg stats PR : Presto OSS advances Velox as a submodule to pick up new changes. As Velox submodule is not versioned, we don't have any way to disable individual commits and need to revert problematic changes before advancing Velox the next time. Presto OSS issue prestodb/presto#27450 was created for this. We hadn't advanced Velox in Presto OSS for over a week at this point and didn't want to continue without a release. |
|
@kKPulla has imported this pull request. If you are a Meta employee, you can view this in D106649890. |
Summary
Refactor and comment cleanup of velox/dwio Parquet writer files to improve code readability, consistency, and add input validation.
Impact
Derived from:
IBM/velox#1836
Co-authored-by: @PingLiuPing