Skip to content

feat: coalesce contiguous range reads for partial segment downloads#19652

Draft
capistrant wants to merge 1 commit into
apache:masterfrom
capistrant:range-reading-coalesce
Draft

feat: coalesce contiguous range reads for partial segment downloads#19652
capistrant wants to merge 1 commit into
apache:masterfrom
capistrant:range-reading-coalesce

Conversation

@capistrant

Copy link
Copy Markdown
Contributor

Description

First cut at adding some coalescing to range reads for partial segment downloads. The overarching goal is to combine what would have been multiple reads of contiguous ranges (or near-contiguous ranges - see below) into a single read when it meets certain parameters. I say near-contiguous because this PR adds a knob to define a gap in bytes between required reads that can be opportunistically also read in order to create a ranged read over two required internal files. If unneeded files are downloaded for a coalesced range, they are still marked downloaded on the host and become queryable. This technically means there could be never-requested data becoming resident on the data server and cause evictions on a host that has disk pressure. Operators can control the knobs to reduce or increase these new configs to become more or less aggressive with coalescing reads. The hope is the defaults are a generally good for all, but may need tuning after real world learnings.

Configs

Config Default Description
druid.segmentCache.virtualStorageCoalesceMaxGapBytes 1048576 (1 MiB) Largest unwanted gap, in bytes, read through to merge two adjacent requested files into a single deep-storage range read. Larger values trade over-fetched bytes for fewer requests; 0 merges only truly-adjacent files. Must be >= 0.
druid.segmentCache.virtualStorageCoalesceMaxChunkBytes 16777216 (16 MiB) Largest size, in bytes, of a single coalesced range read. Bounds how big one fetch can grow and keeps a wide request split into several reads that can download concurrently rather than collapsing into one serial read. A single file larger than this is still fetched whole (the cap only limits how many files are merged). Must be >= 1.

Both are validated at startup; an invalid value fails service startup with a clear message.

New metric

Metric Description Dimensions
storage/virtual/read/gapBytes Of storage/virtual/read/bytes, the bytes read that were not part of a requested file (unrequested files spanned to coalesce, plus inter-file padding). Ratio to read/bytes is the over-fetch fraction. location

Release note


Key changed/added classes in this PR
  • processing/src/main/java/org/apache/druid/segment/PartialQueryableIndex.java
  • processing/src/main/java/org/apache/druid/segment/PartialQueryableIndexCursorFactory.java
  • processing/src/main/java/org/apache/druid/segment/file/PartialSegmentFileMapperV10.java

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Embedded test

rename

fix checkstyle
@capistrant capistrant marked this pull request as draft July 2, 2026 21:57
* @param jsonMapper used by the metadata entry's mount path to parse the header
* @param storagePool thread pool the async cursor path submits on-demand column downloads to (which bounds
* load concurrency itself); may be null in tests that never invoke the cursor factory
* @param coalesceConfig range-coalescing thresholds applied to on-demand downloads once the entry is mounted
@jtuglu1

jtuglu1 commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Question 1: Per-historical, do we track in-flight range requests so we avoid duplicate calls? E.g. query 1 calls for range A, not available locally so we initiate fetch from S3, meanwhile query 2 calls for range A. Does query 2 know about the in-flight request to avoid sending a duplicate?

Question 2: how are we balancing coalescing ranged GET calls while maximizing parallelism at the S3 connection level?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants