Cap refresh start when hypertable has tiered data#9811
Conversation
0ef8281 to
3e5c07d
Compare
3e5c07d to
221041a
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
221041a to
fbd5fcd
Compare
|
@melihmutlu, @natalya-aksman: please review this pull request.
|
fbd5fcd to
1ed30b3
Compare
1ed30b3 to
3ddfbf5
Compare
f4a57f4 to
e9c4ce4
Compare
Currently during a refresh, we process the ranges before the earliest chunk in the hypertable as well. This can lead to potential missing data when tiered data is present but tiered reads are disabled during the refresh. By capping the refresh range at the start of the earliest chunk, any invalidations before the earliest chunk are no longer processed and removed. Thus, in a subsequent refresh, if tiered data reads are enabled, that data would be materialized in the CAgg. We only need to do this when the hypertable has tiered data. Any new tiered data will either have been processed before tiering (since it exists in some chunk) or will be inserted after this refresh (which will generate invalidations).
Capping is done incorrectly when the OSM chunk is the earliest chunk.
e9c4ce4 to
ddca3ff
Compare
Incremental batch generation would generate batches considering tiered data ranges, even if tiered reads are disabled. This can lead to invalid batches since we cap the refresh start to the earliest chunk in the hypertable during the refresh.
6d40793 to
ff7856d
Compare
| ---------------------------------------------------------------------- | ||
| -- Test 5: When the OSM chunk's range is updated to precede the | ||
| -- earliest real chunk, the wrong dimension slice is picked up | ||
| -- and the refresh is not capped correctly. | ||
| ---------------------------------------------------------------------- |
There was a problem hiding this comment.
the comment needs an update as it can cap correctly now.
| *slice = dimension_slice_from_slot(ti->slot); | ||
| MemoryContextSwitchTo(old); | ||
| Chunk *chunk = ts_chunk_get_by_id((*slice)->fd.chunk_id, true); | ||
|
|
||
| if (IS_OSM_CHUNK(chunk)) | ||
| { | ||
| return SCAN_CONTINUE; | ||
| } |
There was a problem hiding this comment.
*slice here assigned whether it's an osm chunk or not. Imagine a case where there is no non-osm chunk and all data is tiered, we would end up this *slice already assigned. The caller assumes that it's non-osm as long as it's not NULL which does not hold in this case.
We should either check again in the caller, or nullify *slice if it's osm in the later IS_OSM_CHUNK check here.
| int64 earliest_start = invalidation_get_earliest_chunk_start(cagg->data.raw_hypertable_id); | ||
| if (earliest_start != INVAL_NEG_INFINITY) | ||
| { | ||
| Invalidation boundary = { .lowest_modified_value = earliest_start, | ||
| .greatest_modified_value = earliest_start }; | ||
| invalidation_expand_to_bucket_boundaries(&boundary, | ||
| cagg->partition_type, | ||
| cagg->bucket_function); | ||
| earliest_start = boundary.greatest_modified_value; | ||
| } |
There was a problem hiding this comment.
I understand why we want to ignore anything before the value initially return by invalidation_get_earliest_chunk_start. But why do we move earliest_start to its bucket end which is a further value?
I feel like skipping the whole bucket as if it's not invalidated may trigger rewrite to use the cagg when the specific bucket is actually stale in the cagg.
Currently during a refresh, we process the ranges before the earliest chunk in the hypertable as well. This can lead to potential missing data when tiered data is present but tiered reads are disabled during the refresh.
By capping the refresh range at the start of the earliest chunk, any invalidations before the earliest chunk are no longer processed and removed. Thus, in a subsequent refresh, if tiered data reads are enabled, that data would be materialized in the CAgg.
We only need to do this when the hypertable has tiered data. Any new tiered data will either have been processed before tiering (since it exists in some chunk) or will be inserted after this refresh (which will generate invalidations).
We also cap the refresh start when generating batches for an incremental refresh.