Skip to content

feat: Add skipOffsetFromEarliest to compaction configuration#19635

Draft
ashwintumma23 wants to merge 5 commits into
apache:masterfrom
ashwintumma23:feature/add-skipOffsetFromEarliest
Draft

feat: Add skipOffsetFromEarliest to compaction configuration#19635
ashwintumma23 wants to merge 5 commits into
apache:masterfrom
ashwintumma23:feature/add-skipOffsetFromEarliest

Conversation

@ashwintumma23

Copy link
Copy Markdown
Contributor

Description

This PR adds a skipOffsetFromEarliest parameter to the compaction configuration, providing the inverse functionality of the existing skipOffsetFromLatest parameter. This allows users to skip re-compacting older historical data while experimenting with compaction configuration changes on recent data.

Problem

Currently, Druid supports skipOffsetFromLatest to avoid re-compacting recent data (e.g., last 24 hours). However, there is no way to skip re-compacting old historical data. This becomes problematic in scenarios such as:

  1. Changing partition dimensions - When experimenting with new partition dimensions on large tables, all historical data gets re-compacted unnecessarily
  2. Switching shard specs (dynamic ↔ range) - Triggers re-compaction of the entire dataset
  3. Long-retention tables - Tables with months of historical data where old segments don't need re-compaction

Solution

Added skipOffsetFromEarliest field that works symmetrically to skipOffsetFromLatest:

  • skipOffsetFromLatest: Skips data from the END of the timeline (latest timestamp - offset, latest timestamp)
  • skipOffsetFromEarliest: Skips data from the START of the timeline (earliest timestamp, earliest timestamp + offset)

Example configuration:

{
  "dataSource": "my_datasource",
  "skipOffsetFromEarliest": "P30D",  // Skip first 30 days
  "skipOffsetFromLatest": "P1D"      // Skip last 1 day
}

Implementation Details

  • DataSourceCompactionConfig interface: Added getSkipOffsetFromEarliest() with default Period.ZERO
  • InlineSchemaDataSourceCompactionConfig & CatalogDataSourceCompactionConfig: Implemented field with JSON serialization
  • DataSourceCompactibleSegmentIterator: Added computeEarliestSkipInterval() method (mirror of computeLatestSkipInterval())
  • Skip interval merging: Updated sortAndAddSkipIntervals() to handle both earliest and latest skip intervals
  • CascadingReindexingTemplate: Added support with validation to ensure mutual exclusivity of skip offset strategies
  • Backward compatibility: New field is optional and defaults to Period.ZERO (no skipping), ensuring no breaking changes

Release note

Added skipOffsetFromEarliest compaction configuration parameter to skip re-compacting old historical data. This complements the existing skipOffsetFromLatest parameter and is useful when experimenting with partition dimensions or shard spec changes on large tables where you want to apply changes only to recent data.


Key changed/added classes in this PR
  • DataSourceCompactionConfig
  • InlineSchemaDataSourceCompactionConfig
  • CatalogDataSourceCompactionConfig
  • DataSourceCompactibleSegmentIterator
  • CascadingReindexingTemplate
  • DataSourceCompactibleSegmentIteratorSkipOffsetTest

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

This commit adds a skipOffsetFromEarliest parameter to the compaction
configuration, providing the inverse functionality of skipOffsetFromLatest.
This allows users to skip re-compacting older data while experimenting
with changes to recent data.

Changes include:
- Add skipOffsetFromEarliest field to DataSourceCompactionConfig interface
  with default value of Period.ZERO (no skipping)
- Implement in InlineSchemaDataSourceCompactionConfig and
  CatalogDataSourceCompactionConfig with JSON serialization support
- Add computeEarliestSkipInterval() method to DataSourceCompactibleSegmentIterator
  for computing skip interval from earliest timestamp
- Update sortAndAddSkipIntervals() to handle both earliest and latest skip offsets
- Support in CascadingReindexingTemplate with validation to ensure mutual
  exclusivity of skipOffsetFromNow, skipOffsetFromLatest, and skipOffsetFromEarliest
- Add unit tests in DataSourceCompactibleSegmentIteratorSkipOffsetTest
Ashwin Tumma added 4 commits June 29, 2026 05:16
Update test files to include the new skipOffsetFromEarliest parameter
in CatalogDataSourceCompactionConfig constructor calls:
- CatalogDataSourceCompactionConfigTest: Add null for skipOffsetFromEarliest
- CompactSegmentsTest: Add null for skipOffsetFromEarliest
Add skipOffsetFromEarliest parameter (null) to all CascadingReindexingTemplate
constructor calls in test file. The new parameter goes between skipOffsetFromLatest
and skipOffsetFromNow parameters.
Add missing skipOffsetFromEarliest parameter to exception test constructors:
- test_constructor_setBothSkipOffsetStrategiesThrowsException
- test_constructor_nullDataSourceThrowsException
- test_constructor_nullRuleProviderThrowsException
- test_constructor_nullDefaultSegmentGranularityThrowsException
- test_constructor_tuningConfigWithPartitionsSpecThrowsException

Also update assertion for setBothSkipOffsetStrategiesThrowsException to match
the new error message format.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant