[10125] [encode path] Minor optimizations to arrow-flight#10137
[10125] [encode path] Minor optimizations to arrow-flight#10137Rich-T-kid wants to merge 6 commits into
Conversation
| } | ||
|
|
||
| /// Place the `FlightData` in the queue to send | ||
| #[inline] |
There was a problem hiding this comment.
The compiler very likely could have inlined this, but I think its work adding this explicitly.
|
run benchmarks flight |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rich-T-kid/minor-arrow-flight-opt (d02e297) to 826b808 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
seems like its mostly noise |
its interesting that this seems to always regress |
2c00600 to
337abd5
Compare
337abd5 to
094579b
Compare
|
@Jefffrey I meant to ping you on this PR . Sorry about that! |
|
run benchmarks flight |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rich-T-kid/minor-arrow-flight-opt (505fb20) to 826b808 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
Nice, regressions are gone. should re-run when 54faeda gets merged. I expected a larger improvement for larger rows/columns batches. I'll profile & update the PR |
37c7231 to
166e2e6
Compare
|
@alamb could you run the benchmarks for arrow-flight? 🚀 |
|
run benchmark flight |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rich-T-kid/minor-arrow-flight-opt (166e2e6) to 826b808 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
similar story here. encode is path shows good improvements but the roundtrip is the same or slightly worse. |
Thank you Even though it might feel silly to make 5 small PRs, it is much easier to review them in isolation and thus they will likely get merged much quicker than 1 PR with 5 independent changes |
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - works towards #10125 & works with #10137. # Rationale for this change there were no benchmarks for the decode path <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> # What changes are included in this PR? single benchmark that measures the time it takes to decode a stream of `flight_data` <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> # Are these changes tested? n/a <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? If this PR claims a performance improvement, please include evidence such as benchmark results. --> # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --> no
166e2e6 to
ff6d10d
Compare
34a0b87 to
a918b8c
Compare
a918b8c to
bbca95e
Compare
|
@alamb this PR is ready for review |
|
I'll experiment with a buffer pool in a separate PR, a pool of pre-allocated buffers sized to |
|
@Rich-T-kid I am sorry I have lost track of all the PRs that are currently outstanding. Which is the most important / the most ready for review? |
| /// Note this value would normally be 4MB, but the size calculation is | ||
| /// somewhat inexact, so we set it to 2MB. | ||
| pub const GRPC_TARGET_MAX_FLIGHT_SIZE_BYTES: usize = 2097152; | ||
| /// gRPC's default max message size is 4MB; this 2MB target gives headroom for |
There was a problem hiding this comment.
Im going to replace this with the original values. I initially changed this when i was tuning buffer size math. Will revert!
Which issue does this PR close?
starting small 😄
Rationale for this change
The arrow-flight encode path was allocating intermediate Vecs to hold data that was immediately iterated and discarded. Replacing these with lazy iterators and inlining the one helper that existed only to loop removes allocations that served no purpose beyond bridging two adjacent lines of code.
What changes are included in this PR?
[commit #1]
Impl<Iterator>.[commit #2]
[commit #3]
CompressionContexttoIpcWriteContextand added anfbb: FlatBufferBuilder<'static>field to it[commit #4]
IpcWriteContextgains a scratch:Vec<u8>field. When set before a call toIpcDataGenerator::encode(), the existing allocation is reused instead of allocating a fresh buffer for each batch's arrow data body.FlightIpcEncodermaintains anArrowDataPool, a small pool of Arc<Mutex<Vec<Vec>>> buffers pre-sized to the gRPC message limit (2 MiB). Before eachencode()call, a buffer is acquired from the pool and placed inIpcWriteContext::scratch. After encoding, the buffer is wrapped inPooledBufand handed to Bytes::from_owner; when the Bytes is dropped (after the gRPC frame is sent), the buffer is automatically returned to the pool rather than freed.[commit #5 & 6]
acquiremethod to also pre-allocate 2MB of space in the vectoripc::encode()calls. the buffers are pre-allocated to themax_flight_data_sizeas an estimate. This means no intermediate vector copies to larger vectorscommit [ #7] final commit
arrow-ipc::encode()sink fromIpcBodySink::Write()toIpcBodySink::collect()memcpyto new vectors.memcpy. this is also why the profile shows alot of memcpy or _platform_memmove on mac.output buffer size changes
This PR also changes the size of buffers being output by
split_batch_for_grpc_response()The old algorithm computed n_batches first via ceiling division, then derived rows_per_batch from that:
This evenly distributes rows across chunks, meaning each buffer ends up smaller than max on average. Thus leaving capacity unused.
The new algorithm works directly from the target size:
This packs each buffer as close to max as possible before moving on.
This matters because the output buffers are pre-allocated to max_flight_data_size. Since the allocation cost is already paid upfront, the only cost of filling a buffer is the memcpy itself. As the profiles show, most time is spent serializing RecordBatches to IPC format, doing that for as many rows as possible in one pass, followed by a single large memcpy, is faster than multiple smaller serializations and copies. Leaving pre-allocated capacity unused means splitting work across more messages than necessary, each carrying its own network and serialization overhead.
note: I expect to have to tune this a bit. this is because the size that is used to determine the total size of the record batch isn't exact. strings vary from row to row, so its hard to get the math 100% correct. but the closer we can get to the max_size the better
Are these changes tested?
yes
Are there any user-facing changes?
no