[10125] [encode path] Minor optimizations to arrow-flight by Rich-T-kid · Pull Request #10137 · apache/arrow-rs

Rich-T-kid · 2026-06-12T13:13:21Z

Which issue does this PR close?

works towards closing Optimize arrow-flight #10125

starting small 😄

Rationale for this change

The arrow-flight encode path was allocating intermediate Vecs to hold data that was immediately iterated and discarded. Replacing these with lazy iterators and inlining the one helper that existed only to loop removes allocations that served no purpose beyond bridging two adjacent lines of code.

What changes are included in this PR?

[commit #1]

Remove intermediate Vec allocations in encode path, replace these with Impl<Iterator>.
Cache num_rows before split closure
Remove queue_messages, inline call site, mark queue_message #[inline]

[commit #2]

pre-allocate the vector used to hold uncompressed data.
- avoids build up of [64k,512k,4MB,12MB...]

[commit #3]

Renamed CompressionContext to IpcWriteContext and added an fbb: FlatBufferBuilder<'static> field to it
This avoids repeated heap allocations by reusing the same FlatBufferBuilder across writes, using its reset() method to clear state without deallocating

[commit #4]

IpcWriteContext gains a scratch: Vec<u8> field. When set before a call to IpcDataGenerator::encode(), the existing allocation is reused instead of allocating a fresh buffer for each batch's arrow data body.
arrow-flight's FlightIpcEncoder maintains an ArrowDataPool, a small pool of Arc<Mutex<Vec<Vec>>> buffers pre-sized to the gRPC message limit (2 MiB). Before each encode() call, a buffer is acquired from the pool and placed in IpcWriteContext::scratch. After encoding, the buffer is wrapped in PooledBuf and handed to Bytes::from_owner; when the Bytes is dropped (after the gRPC frame is sent), the buffer is automatically returned to the pool rather than freed.

[commit #5 & 6]

tuning the buffer pool, updated the acquire method to also pre-allocate 2MB of space in the vector
keep scratch buffer across multiple ipc::encode() calls. the buffers are pre-allocated to the max_flight_data_size as an estimate. This means no intermediate vector copies to larger vectors

commit [ #7] final commit

remove buffer pool. This was actually causing more overhead then letting the memory allocation handle pooling memory.
replaced arrow-ipc::encode() sink from IpcBodySink::Write() to IpcBodySink::collect()
ideally all RecordBatch buffers are written in O(1) time with no need to re-allocate and memcpy to new vectors.
- extend_from_slice boils down to a very fast memcpy. this is also why the profile shows alot of memcpy or _platform_memmove on mac.
- All buffers are written at once into a pre-sized destination, which improves branch prediction and allows the CPU to use SIMD for the copy.

output buffer size changes

This PR also changes the size of buffers being output by split_batch_for_grpc_response()
The old algorithm computed n_batches first via ceiling division, then derived rows_per_batch from that:

n_batches    = ceil(size / max)
rows_per_batch = num_rows / n_batches

This evenly distributes rows across chunks, meaning each buffer ends up smaller than max on average. Thus leaving capacity unused.

The new algorithm works directly from the target size:

rows_per_batch = max * num_rows / size

This packs each buffer as close to max as possible before moving on.

This matters because the output buffers are pre-allocated to max_flight_data_size. Since the allocation cost is already paid upfront, the only cost of filling a buffer is the memcpy itself. As the profiles show, most time is spent serializing RecordBatches to IPC format, doing that for as many rows as possible in one pass, followed by a single large memcpy, is faster than multiple smaller serializations and copies. Leaving pre-allocated capacity unused means splitting work across more messages than necessary, each carrying its own network and serialization overhead.

note: I expect to have to tune this a bit. this is because the size that is used to determine the total size of the record batch isn't exact. strings vary from row to row, so its hard to get the math 100% correct. but the closer we can get to the max_size the better

Are these changes tested?

yes

Are there any user-facing changes?

no

Rich-T-kid · 2026-06-12T13:19:23Z

    }

    /// Place the `FlightData` in the queue to send
+    #[inline]


The compiler very likely could have inlined this, but I think its work adding this explicitly.

gabotechs · 2026-06-12T13:24:01Z

run benchmarks flight

adriangbot · 2026-06-12T13:27:50Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4691665801-559-vg5z6 6.12.68+ #1 SMP Sat May 2 07:49:07 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/minor-arrow-flight-opt (d02e297) to 826b808 (merge-base) diff
BENCH_NAME=flight
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench flight
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-06-12T13:40:15Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                         main                                   rich-T-kid_minor-arrow-flight-opt
-----                         ----                                   ---------------------------------
encode/dict/65536x1           1.02    283.2±1.04µs   887.9 MB/sec    1.00    278.5±1.31µs   902.7 MB/sec
encode/dict/65536x8           1.01      8.7±0.07ms   232.0 MB/sec    1.00      8.5±0.18ms   235.3 MB/sec
encode/dict/8192x1            1.00     35.2±0.02µs   928.0 MB/sec    1.02     35.8±0.03µs   913.1 MB/sec
encode/dict/8192x8            1.02    301.6±1.68µs   866.3 MB/sec    1.00    296.2±1.29µs   882.1 MB/sec
encode/fixed/65536x1          1.03     10.2±0.02µs    47.8 GB/sec    1.00      9.9±0.02µs    49.2 GB/sec
encode/fixed/65536x8          1.02   1121.6±1.92µs     3.5 GB/sec    1.00   1099.7±2.33µs     3.6 GB/sec
encode/fixed/8192x1           1.01      3.2±0.01µs    19.2 GB/sec    1.00      3.1±0.01µs    19.5 GB/sec
encode/fixed/8192x8           1.00     17.7±0.04µs    27.6 GB/sec    1.03     18.2±0.02µs    26.8 GB/sec
encode/nested/65536x1         1.01     38.9±0.29µs    31.4 GB/sec    1.00     38.4±0.17µs    31.8 GB/sec
encode/nested/65536x8         1.03      3.1±0.01ms     3.2 GB/sec    1.00      3.0±0.01ms     3.3 GB/sec
encode/nested/8192x1          1.00      5.7±0.01µs    26.9 GB/sec    1.01      5.8±0.01µs    26.5 GB/sec
encode/nested/8192x8          1.00     48.9±0.13µs    25.0 GB/sec    1.00     48.8±0.08µs    25.0 GB/sec
encode/variable/65536x1       1.00     73.4±0.26µs    29.9 GB/sec    1.01     73.9±0.31µs    29.7 GB/sec
encode/variable/65536x8       1.00      5.2±0.06ms     3.4 GB/sec    1.00      5.2±0.07ms     3.4 GB/sec
encode/variable/8192x1        1.00      6.9±0.01µs    40.1 GB/sec    1.02      7.0±0.01µs    39.1 GB/sec
encode/variable/8192x8        1.01     89.4±0.15µs    24.6 GB/sec    1.00     88.9±0.22µs    24.7 GB/sec
roundtrip/dict/65536x1        1.00  1275.9±46.22µs   197.0 MB/sec    1.01  1284.9±45.94µs   195.7 MB/sec
roundtrip/dict/65536x8        1.00     14.4±0.63ms   140.0 MB/sec    1.14     16.3±0.56ms   123.2 MB/sec
roundtrip/dict/8192x1         1.00    205.6±5.43µs   158.8 MB/sec    1.01    208.7±5.77µs   156.5 MB/sec
roundtrip/dict/8192x8         1.00  1313.8±42.83µs   198.9 MB/sec    1.00  1315.5±50.14µs   198.6 MB/sec
roundtrip/fixed/65536x1       1.00    305.2±3.84µs  1638.6 MB/sec    1.02    310.5±4.65µs  1610.4 MB/sec
roundtrip/fixed/65536x8       1.01      2.2±0.07ms  1855.0 MB/sec    1.00      2.1±0.04ms  1870.2 MB/sec
roundtrip/fixed/8192x1        1.02     90.3±1.35µs   693.3 MB/sec    1.00     88.9±1.07µs   703.7 MB/sec
roundtrip/fixed/8192x8        1.00    323.9±3.75µs  1545.8 MB/sec    1.02    330.9±5.18µs  1513.4 MB/sec
roundtrip/nested/65536x1      1.00   843.8±41.42µs  1481.6 MB/sec    1.00   841.6±41.74µs  1485.6 MB/sec
roundtrip/nested/65536x8      1.00      9.4±0.67ms  1066.8 MB/sec    1.12     10.5±0.37ms   949.0 MB/sec
roundtrip/nested/8192x1       1.00    156.6±5.36µs   999.1 MB/sec    1.01    157.9±4.96µs   990.6 MB/sec
roundtrip/nested/8192x8       1.00   889.4±42.46µs  1407.3 MB/sec    1.01   896.2±45.08µs  1396.6 MB/sec
roundtrip/variable/65536x1    1.00  1203.2±34.81µs  1870.1 MB/sec    1.04  1254.1±70.01µs  1794.3 MB/sec
roundtrip/variable/65536x8    1.03     16.4±0.51ms  1094.7 MB/sec    1.00     16.0±0.43ms  1124.1 MB/sec
roundtrip/variable/8192x1     1.00    204.6±5.86µs  1375.8 MB/sec    1.01    206.5±5.97µs  1362.6 MB/sec
roundtrip/variable/8192x8     1.00  1204.0±33.06µs  1869.9 MB/sec    1.01  1217.2±28.50µs  1849.6 MB/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	340.1s
Peak memory	98.5 MiB
Avg memory	36.7 MiB
CPU user	345.0s
CPU sys	73.4s
Peak spill	0 B

branch

Metric	Value
Wall time	335.1s
Peak memory	99.5 MiB
Avg memory	36.6 MiB
CPU user	339.9s
CPU sys	76.9s
Peak spill	0 B

File an issue against this benchmark runner

Rich-T-kid · 2026-06-12T13:44:27Z

seems like its mostly noise

Rich-T-kid · 2026-06-12T13:45:33Z

roundtrip/nested/65536x8      1.00      9.4±0.67ms  1066.8 MB/sec    1.12     10.5±0.37ms   949.0 MB/sec

its interesting that this seems to always regress

Rich-T-kid · 2026-06-13T02:31:37Z

@Jefffrey I meant to ping you on this PR . Sorry about that!

Jefffrey · 2026-06-13T02:33:22Z

run benchmarks flight

adriangbot · 2026-06-13T02:36:19Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4697178377-565-qnbtn 6.12.68+ #1 SMP Sat May 2 07:49:07 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/minor-arrow-flight-opt (505fb20) to 826b808 (merge-base) diff
BENCH_NAME=flight
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench flight
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-06-13T02:53:25Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                          main                                   rich-T-kid_minor-arrow-flight-opt
-----                          ----                                   ---------------------------------
encode/dict/65536x1            1.01    273.2±1.44µs   920.4 MB/sec    1.00    271.3±0.45µs   926.7 MB/sec
encode/dict/65536x16                                                  1.00     17.3±0.22ms   232.8 MB/sec
encode/dict/65536x4                                                   1.00   1180.2±4.95µs   852.1 MB/sec
encode/dict/65536x8            1.42      8.9±0.19ms   225.1 MB/sec    1.00      6.3±0.11ms   318.6 MB/sec
encode/dict/8192x1             1.00     35.2±0.03µs   928.7 MB/sec    1.00     35.2±0.04µs   927.0 MB/sec
encode/dict/8192x16                                                   1.00    630.1±2.04µs   829.3 MB/sec
encode/dict/8192x4                                                    1.00    143.2±0.12µs   912.3 MB/sec
encode/dict/8192x8             1.00    298.5±2.75µs   875.4 MB/sec    1.00    298.1±0.83µs   876.5 MB/sec
encode/fixed/65536x1           1.08     10.6±0.02µs    46.0 GB/sec    1.00      9.8±0.01µs    49.7 GB/sec
encode/fixed/65536x16                                                 1.00      2.4±0.03ms     3.3 GB/sec
encode/fixed/65536x4                                                  1.00     49.8±0.17µs    39.3 GB/sec
encode/fixed/65536x8           1.00   1110.2±5.22µs     3.5 GB/sec    1.02   1135.8±3.38µs     3.4 GB/sec
encode/fixed/8192x1            1.00      3.2±0.01µs    19.0 GB/sec    1.03      3.3±0.01µs    18.5 GB/sec
encode/fixed/8192x16                                                  1.00     36.2±0.18µs    27.0 GB/sec
encode/fixed/8192x4                                                   1.00      8.8±0.01µs    27.8 GB/sec
encode/fixed/8192x8            1.04     17.4±0.05µs    28.1 GB/sec    1.00     16.7±0.02µs    29.3 GB/sec
encode/nested/65536x1          1.00     28.1±0.20µs    43.5 GB/sec    1.04     29.3±0.30µs    41.7 GB/sec
encode/nested/65536x16                                                1.00      7.1±0.18ms     2.8 GB/sec
encode/nested/65536x4                                                 1.00  1485.8±19.84µs     3.3 GB/sec
encode/nested/65536x8          1.00      3.2±0.06ms     3.0 GB/sec    1.00      3.2±0.08ms     3.0 GB/sec
encode/nested/8192x1           1.16      6.8±0.01µs    22.6 GB/sec    1.00      5.8±0.01µs    26.2 GB/sec
encode/nested/8192x16                                                 1.00    148.7±0.41µs    16.4 GB/sec
encode/nested/8192x4                                                  1.00     21.3±0.03µs    28.7 GB/sec
encode/nested/8192x8           1.00     46.2±0.23µs    26.4 GB/sec    1.06     48.8±0.11µs    25.0 GB/sec
encode/variable/65536x1        1.59     81.4±0.51µs    27.0 GB/sec    1.00     51.2±0.22µs    42.9 GB/sec
encode/variable/65536x16                                              1.00     11.2±0.14ms     3.1 GB/sec
encode/variable/65536x4                                               1.00      2.4±0.05ms     3.6 GB/sec
encode/variable/65536x8        1.05      5.4±0.08ms     3.2 GB/sec    1.00      5.1±0.10ms     3.4 GB/sec
encode/variable/8192x1         1.17      7.0±0.01µs    39.1 GB/sec    1.00      6.0±0.01µs    45.8 GB/sec
encode/variable/8192x16                                               1.00   1171.6±7.63µs     3.8 GB/sec
encode/variable/8192x4                                                1.00     24.9±0.04µs    44.2 GB/sec
encode/variable/8192x8         1.06     80.7±0.13µs    27.2 GB/sec    1.00     76.0±0.22µs    28.9 GB/sec
roundtrip/dict/65536x1         1.01  1330.0±45.25µs   189.0 MB/sec    1.00  1315.9±45.27µs   191.1 MB/sec
roundtrip/dict/65536x16                                               1.00     29.5±1.10ms   136.5 MB/sec
roundtrip/dict/65536x4                                                1.00      6.7±0.23ms   150.9 MB/sec
roundtrip/dict/65536x8         1.06     15.3±0.72ms   131.9 MB/sec    1.00     14.3±0.54ms   140.2 MB/sec
roundtrip/dict/8192x1          1.00    212.8±5.92µs   153.4 MB/sec    1.00    212.4±6.06µs   153.8 MB/sec
roundtrip/dict/8192x16                                                1.00      2.4±0.05ms   216.8 MB/sec
roundtrip/dict/8192x4                                                 1.00   687.6±23.18µs   190.0 MB/sec
roundtrip/dict/8192x8          1.00  1355.1±49.83µs   192.8 MB/sec    1.00  1357.6±52.34µs   192.5 MB/sec
roundtrip/fixed/65536x1        1.01    319.7±3.74µs  1564.3 MB/sec    1.00    315.2±4.71µs  1586.4 MB/sec
roundtrip/fixed/65536x16                                              1.00      7.0±0.22ms  1142.9 MB/sec
roundtrip/fixed/65536x4                                               1.00  1306.1±82.37µs  1531.6 MB/sec
roundtrip/fixed/65536x8        1.00      2.3±0.08ms  1733.1 MB/sec    1.00      2.3±0.06ms  1727.3 MB/sec
roundtrip/fixed/8192x1         1.04     95.5±1.40µs   655.5 MB/sec    1.00     92.2±1.00µs   678.7 MB/sec
roundtrip/fixed/8192x16                                               1.00    654.1±8.15µs  1531.1 MB/sec
roundtrip/fixed/8192x4                                                1.00    197.5±3.38µs  1267.8 MB/sec
roundtrip/fixed/8192x8         1.00    339.6±4.53µs  1474.5 MB/sec    1.00    338.7±5.18µs  1478.5 MB/sec
roundtrip/nested/65536x1       1.03   882.8±43.55µs  1416.1 MB/sec    1.00   859.9±42.06µs  1453.9 MB/sec
roundtrip/nested/65536x16                                             1.00     19.3±0.68ms  1036.5 MB/sec
roundtrip/nested/65536x4                                              1.00      3.8±0.23ms  1305.6 MB/sec
roundtrip/nested/65536x8       1.24     10.7±0.73ms   931.4 MB/sec    1.00      8.7±0.28ms  1152.5 MB/sec
roundtrip/nested/8192x1        1.03    162.9±5.47µs   960.6 MB/sec    1.00    158.7±5.99µs   986.1 MB/sec
roundtrip/nested/8192x16                                              1.00  1628.2±41.63µs  1537.4 MB/sec
roundtrip/nested/8192x4                                               1.00   470.5±21.40µs  1330.1 MB/sec
roundtrip/nested/8192x8        1.00   930.5±41.73µs  1345.1 MB/sec    1.00   926.5±44.00µs  1350.9 MB/sec
roundtrip/variable/65536x1     1.01  1249.7±39.83µs  1800.5 MB/sec    1.00  1236.9±36.21µs  1819.2 MB/sec
roundtrip/variable/65536x16                                           1.00     31.3±1.17ms  1150.4 MB/sec
roundtrip/variable/65536x4                                            1.00      8.1±0.31ms  1115.6 MB/sec
roundtrip/variable/65536x8     1.04     17.0±0.50ms  1059.5 MB/sec    1.00     16.4±0.70ms  1100.1 MB/sec
roundtrip/variable/8192x1      1.03    214.7±5.60µs  1310.8 MB/sec    1.00    208.7±6.21µs  1348.2 MB/sec
roundtrip/variable/8192x16                                            1.00      3.3±0.27ms  1367.4 MB/sec
roundtrip/variable/8192x4                                             1.00   680.5±24.00µs  1654.1 MB/sec
roundtrip/variable/8192x8      1.03  1267.2±30.87µs  1776.7 MB/sec    1.00  1228.6±32.30µs  1832.4 MB/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	340.1s
Peak memory	100.2 MiB
Avg memory	38.1 MiB
CPU user	338.8s
CPU sys	75.8s
Peak spill	0 B

branch

Metric	Value
Wall time	660.1s
Peak memory	146.3 MiB
Avg memory	47.0 MiB
CPU user	620.5s
CPU sys	187.9s
Peak spill	0 B

File an issue against this benchmark runner

Rich-T-kid · 2026-06-13T03:59:39Z

Nice, regressions are gone. should re-run when 54faeda gets merged. I expected a larger improvement for larger rows/columns batches. I'll profile & update the PR

Rich-T-kid · 2026-06-18T18:06:08Z

@alamb could you run the benchmarks for arrow-flight? 🚀

alamb · 2026-06-18T19:20:19Z

run benchmark flight

adriangbot · 2026-06-18T19:23:22Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4745385483-591-gfbs4 6.12.68+ #1 SMP Sat May 2 07:49:07 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/minor-arrow-flight-opt (166e2e6) to 826b808 (merge-base) diff
BENCH_NAME=flight
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench flight
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-06-18T19:41:31Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                          main                                   rich-T-kid_minor-arrow-flight-opt
-----                          ----                                   ---------------------------------
encode/dict/65536x1            1.06    286.7±2.38µs   876.8 MB/sec    1.00    269.5±1.65µs   932.9 MB/sec
encode/dict/65536x16                                                  1.00     19.4±0.30ms   207.7 MB/sec
encode/dict/65536x4                                                   1.00      4.0±0.31ms   252.0 MB/sec
encode/dict/65536x8            1.00      8.1±0.73ms   248.1 MB/sec    1.24     10.1±0.86ms   199.9 MB/sec
encode/dict/8192x1             1.02     36.2±0.04µs   902.3 MB/sec    1.00     35.5±0.08µs   921.2 MB/sec
encode/dict/8192x16                                                   1.00    606.9±5.60µs   861.1 MB/sec
encode/dict/8192x4                                                    1.00    139.2±0.19µs   938.4 MB/sec
encode/dict/8192x8             1.08    311.4±3.10µs   839.1 MB/sec    1.00    287.3±2.07µs   909.6 MB/sec
encode/fixed/65536x1           1.04     10.3±0.03µs    47.5 GB/sec    1.00      9.8±0.03µs    49.6 GB/sec
encode/fixed/65536x16                                                 1.00      2.2±0.01ms     3.6 GB/sec
encode/fixed/65536x4                                                  1.00     50.4±1.04µs    38.8 GB/sec
encode/fixed/65536x8           8.56  1087.5±30.99µs     3.6 GB/sec    1.00    127.1±9.93µs    30.7 GB/sec
encode/fixed/8192x1            1.02      3.1±0.02µs    19.5 GB/sec    1.00      3.1±0.01µs    19.9 GB/sec
encode/fixed/8192x16                                                  1.00     37.7±0.59µs    26.0 GB/sec
encode/fixed/8192x4                                                   1.00      8.5±0.02µs    28.8 GB/sec
encode/fixed/8192x8            1.03     17.1±0.03µs    28.5 GB/sec    1.00     16.6±0.08µs    29.5 GB/sec
encode/nested/65536x1          1.00     28.8±0.21µs    42.5 GB/sec    1.13     32.5±0.59µs    37.5 GB/sec
encode/nested/65536x16                                                1.00      5.7±0.44ms     3.4 GB/sec
encode/nested/65536x4                                                 1.00   178.4±10.14µs    27.4 GB/sec
encode/nested/65536x8          1.26      2.9±0.11ms     3.3 GB/sec    1.00      2.3±0.14ms     4.2 GB/sec
encode/nested/8192x1           1.15      6.7±0.02µs    22.9 GB/sec    1.00      5.8±0.01µs    26.4 GB/sec
encode/nested/8192x16                                                 1.00    101.4±5.36µs    24.1 GB/sec
encode/nested/8192x4                                                  1.00     20.0±0.11µs    30.5 GB/sec
encode/nested/8192x8           1.06     46.9±0.21µs    26.1 GB/sec    1.00     44.2±0.32µs    27.6 GB/sec
encode/variable/65536x1        1.44     64.0±2.37µs    34.3 GB/sec    1.00     44.6±1.05µs    49.3 GB/sec
encode/variable/65536x16                                              1.00     11.2±0.98ms     3.1 GB/sec
encode/variable/65536x4                                               1.00   279.5±29.03µs    31.5 GB/sec
encode/variable/65536x8        1.63      5.4±0.50ms     3.3 GB/sec    1.00      3.3±0.03ms     5.3 GB/sec
encode/variable/8192x1         1.23      7.4±0.01µs    37.3 GB/sec    1.00      6.0±0.01µs    45.8 GB/sec
encode/variable/8192x16                                               1.00   159.3±13.58µs    27.6 GB/sec
encode/variable/8192x4                                                1.00     26.2±0.32µs    42.0 GB/sec
encode/variable/8192x8         1.53     83.3±1.90µs    26.4 GB/sec    1.00     54.5±2.36µs    40.3 GB/sec
roundtrip/dict/65536x1         1.00  1284.5±49.82µs   195.7 MB/sec    1.02  1306.8±57.29µs   192.4 MB/sec
roundtrip/dict/65536x16                                               1.00     29.9±4.15ms   134.5 MB/sec
roundtrip/dict/65536x4                                                1.00      7.2±0.35ms   138.8 MB/sec
roundtrip/dict/65536x8         1.00     15.2±0.88ms   132.4 MB/sec    1.09     16.5±1.06ms   121.7 MB/sec
roundtrip/dict/8192x1          1.00    208.4±6.37µs   156.7 MB/sec    1.01    211.2±5.66µs   154.7 MB/sec
roundtrip/dict/8192x16                                                1.00      2.7±0.09ms   197.1 MB/sec
roundtrip/dict/8192x4                                                 1.00   687.5±25.80µs   190.0 MB/sec
roundtrip/dict/8192x8          1.01  1325.5±65.68µs   197.1 MB/sec    1.00  1308.0±45.54µs   199.8 MB/sec
roundtrip/fixed/65536x1        1.01    308.2±3.29µs  1622.9 MB/sec    1.00    305.6±4.94µs  1636.5 MB/sec
roundtrip/fixed/65536x16                                              1.00      6.7±0.18ms  1192.1 MB/sec
roundtrip/fixed/65536x4                                               1.00  1227.5±60.18µs  1629.6 MB/sec
roundtrip/fixed/65536x8        1.02      2.1±0.03ms  1880.7 MB/sec    1.00      2.1±0.10ms  1922.6 MB/sec
roundtrip/fixed/8192x1         1.00     89.1±1.29µs   702.8 MB/sec    1.03     91.6±1.07µs   683.2 MB/sec
roundtrip/fixed/8192x16                                               1.00   666.1±17.36µs  1503.4 MB/sec
roundtrip/fixed/8192x4                                                1.00    198.5±2.38µs  1261.2 MB/sec
roundtrip/fixed/8192x8         1.00    328.2±4.59µs  1525.5 MB/sec    1.02    333.5±3.15µs  1501.5 MB/sec
roundtrip/nested/65536x1       1.02   854.9±53.14µs  1462.3 MB/sec    1.00   841.7±44.42µs  1485.2 MB/sec
roundtrip/nested/65536x16                                             1.00     20.2±0.98ms   990.8 MB/sec
roundtrip/nested/65536x4                                              1.00      3.1±0.19ms  1593.3 MB/sec
roundtrip/nested/65536x8       1.00     10.8±0.67ms   924.3 MB/sec    1.07     11.6±0.75ms   863.1 MB/sec
roundtrip/nested/8192x1        1.00    159.2±6.06µs   983.0 MB/sec    1.00    159.2±5.22µs   982.8 MB/sec
roundtrip/nested/8192x16                                              1.00  1797.0±85.29µs  1393.0 MB/sec
roundtrip/nested/8192x4                                               1.00   479.2±28.79µs  1306.0 MB/sec
roundtrip/nested/8192x8        1.03   930.8±63.29µs  1344.7 MB/sec    1.00   906.9±53.55µs  1380.1 MB/sec
roundtrip/variable/65536x1     1.00  1299.0±83.27µs  1732.3 MB/sec    1.13  1466.0±112.68µs  1534.9 MB/sec
roundtrip/variable/65536x16                                           1.00     31.3±1.00ms  1149.4 MB/sec
roundtrip/variable/65536x4                                            1.00      7.6±0.35ms  1185.0 MB/sec
roundtrip/variable/65536x8     1.00     15.9±0.76ms  1130.8 MB/sec    1.11     17.6±0.97ms  1021.0 MB/sec
roundtrip/variable/8192x1      1.00    203.5±5.47µs  1383.1 MB/sec    1.01    206.3±6.08µs  1364.1 MB/sec
roundtrip/variable/8192x16                                            1.00      2.7±0.15ms  1669.1 MB/sec
roundtrip/variable/8192x4                                             1.00   662.3±33.08µs  1699.8 MB/sec
roundtrip/variable/8192x8      1.00  1209.5±25.57µs  1861.5 MB/sec    1.10  1331.0±46.97µs  1691.5 MB/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	335.1s
Peak memory	95.4 MiB
Avg memory	37.8 MiB
CPU user	341.0s
CPU sys	72.3s
Peak spill	0 B

branch

Metric	Value
Wall time	675.1s
Peak memory	203.1 MiB
Avg memory	54.2 MiB
CPU user	667.9s
CPU sys	148.2s
Peak spill	0 B

File an issue against this benchmark runner

Rich-T-kid · 2026-06-18T23:35:54Z

similar story here. encode is path shows good improvements but the roundtrip is the same or slightly worse.
Going to start looking at the decode path, if the scope of that grows too big ill split it into a separate PR

alamb · 2026-06-19T11:10:46Z

if the scope of that grows too big ill split it into a separate PR

Thank you

Even though it might feel silly to make 5 small PRs, it is much easier to review them in isolation and thus they will likely get merged much quicker than 1 PR with 5 independent changes

# Which issue does this PR close?  - works towards #10125 & works with #10137. # Rationale for this change there were no benchmarks for the decode path  # What changes are included in this PR? single benchmark that measures the time it takes to decode a stream of `flight_data`  # Are these changes tested? n/a  # Are there any user-facing changes?  no

…mory to fill array data

Rich-T-kid · 2026-06-24T20:26:27Z

@alamb this PR is ready for review

Rich-T-kid · 2026-06-24T20:31:17Z

I'll experiment with a buffer pool in a separate PR, a pool of pre-allocated buffers sized to max_bytes_size that get reused over the lifetime of FlightDataEncoder. Realistically this only pays off for large transfers (128KB+), since for smaller allocations the pooling overhead outweighs the benefit and the allocator handles those cases well on its own.

small attempt in this previous commit in this PR

alamb · 2026-06-25T17:58:20Z

@Rich-T-kid I am sorry I have lost track of all the PRs that are currently outstanding. Which is the most important / the most ready for review?

Rich-T-kid · 2026-06-25T18:21:01Z

#10137 (this PR) & #10206
[encode path] [decode path]

Rich-T-kid · 2026-06-25T19:03:55Z

-/// Note this value would normally be 4MB, but the size calculation is
-/// somewhat inexact, so we set it to 2MB.
-pub const GRPC_TARGET_MAX_FLIGHT_SIZE_BYTES: usize = 2097152;
+/// gRPC's default max message size is 4MB; this 2MB target gives headroom for


Im going to replace this with the original values. I initially changed this when i was tuning buffer size math. Will revert!

github-actions Bot added arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate labels Jun 12, 2026

Rich-T-kid commented Jun 12, 2026

View reviewed changes

Rich-T-kid changed the title ~~[10125] Minor optimizations to arrow-flight~~ [10125] [encode path] Minor optimizations to arrow-flight Jun 12, 2026

Rich-T-kid force-pushed the rich-T-kid/minor-arrow-flight-opt branch 2 times, most recently from 2c00600 to 337abd5 Compare June 12, 2026 18:02

Rich-T-kid commented Jun 12, 2026

View reviewed changes

Comment thread arrow-ipc/src/compression.rs Outdated

Rich-T-kid force-pushed the rich-T-kid/minor-arrow-flight-opt branch from 337abd5 to 094579b Compare June 12, 2026 21:03

Rich-T-kid mentioned this pull request Jun 13, 2026

[arrow-flight] Optimize flight, remove some allocations, add dictionary focused benchmarks #10126

Merged

Rich-T-kid mentioned this pull request Jun 17, 2026

Add arrow-flight test coverage for IPC compression #10097

Merged

Rich-T-kid force-pushed the rich-T-kid/minor-arrow-flight-opt branch 2 times, most recently from 37c7231 to 166e2e6 Compare June 18, 2026 17:45

Rich-T-kid mentioned this pull request Jun 23, 2026

introduce decode benchmarks #10202

Merged

Rich-T-kid mentioned this pull request Jun 23, 2026

[10125] arrow-flight decode path optimizations #10206

Open

Rich-T-kid added 5 commits June 24, 2026 15:06

avoid re-allocations for uncompressed path

ec8342f

re-use flatbuffer allocations across calls

5453202

introduce buffer pool to avoid repeatily allocating 2/4/12 MB's of me…

5ceb08d

…mory to fill array data

checkpoint for buffer observibility

35ea3af

resolve merge conflicts

3c737be

Rich-T-kid force-pushed the rich-T-kid/minor-arrow-flight-opt branch from 166e2e6 to ff6d10d Compare June 24, 2026 19:51

Rich-T-kid commented Jun 24, 2026

View reviewed changes

Comment thread arrow-flight/src/encode.rs Outdated

Rich-T-kid commented Jun 24, 2026

View reviewed changes

Comment thread arrow-ipc/src/writer.rs Outdated

Rich-T-kid commented Jun 24, 2026

View reviewed changes

Comment thread arrow-ipc/src/writer.rs Outdated

Rich-T-kid force-pushed the rich-T-kid/minor-arrow-flight-opt branch 2 times, most recently from 34a0b87 to a918b8c Compare June 24, 2026 20:15

trim PR

bbca95e

Rich-T-kid force-pushed the rich-T-kid/minor-arrow-flight-opt branch from a918b8c to bbca95e Compare June 24, 2026 20:16

Rich-T-kid commented Jun 25, 2026

View reviewed changes

Uh oh!

Conversation

Rich-T-kid commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

output buffer size changes

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Rich-T-kid Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabotechs commented Jun 12, 2026

Uh oh!

adriangbot commented Jun 12, 2026

Uh oh!

adriangbot commented Jun 12, 2026

Uh oh!

Rich-T-kid commented Jun 12, 2026

Uh oh!

Rich-T-kid commented Jun 12, 2026

Uh oh!

Uh oh!

Rich-T-kid commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jefffrey commented Jun 13, 2026

Uh oh!

adriangbot commented Jun 13, 2026

Uh oh!

adriangbot commented Jun 13, 2026

Uh oh!

Rich-T-kid commented Jun 13, 2026

Uh oh!

Rich-T-kid commented Jun 18, 2026

Uh oh!

alamb commented Jun 18, 2026

Uh oh!

adriangbot commented Jun 18, 2026

Uh oh!

adriangbot commented Jun 18, 2026

Uh oh!

Rich-T-kid commented Jun 18, 2026

Uh oh!

alamb commented Jun 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Rich-T-kid commented Jun 24, 2026

Uh oh!

Rich-T-kid commented Jun 24, 2026

Uh oh!

alamb commented Jun 25, 2026

Uh oh!

Rich-T-kid commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rich-T-kid Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Rich-T-kid commented Jun 12, 2026 •

edited

Loading

Rich-T-kid Jun 12, 2026 •

edited

Loading

Rich-T-kid commented Jun 13, 2026 •

edited

Loading

Rich-T-kid commented Jun 25, 2026 •

edited

Loading