Skip to content

hyper-parquet: Fix caching of Parquet files and add single-file variant#969

Closed
caetanosauer wants to merge 4 commits into
ClickHouse:mainfrom
caetanosauer:hyper-parquet-persistent-external-plancache
Closed

hyper-parquet: Fix caching of Parquet files and add single-file variant#969
caetanosauer wants to merge 4 commits into
ClickHouse:mainfrom
caetanosauer:hyper-parquet-persistent-external-plancache

Conversation

@caetanosauer

@caetanosauer caetanosauer commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

The hyper-parquet harness created a TEMP external table per ./query
connection. With the persistent-server model (issue #936) the driver
opens a fresh connection per try. This caused our plan cache to be wiped,
requiring re-sampling the data on every hot run as well. To properly
test a cached scenario, this commit changes our harness to create
persistent external tables and maintain an attached database throughout
the server's lifetime. The query iterations themselves still open a
fresh connection each time.

We also introduce the single variant for the parquet lane, now
reporting both hyper-parquet-single and hyper-parquet-partitioned.

As a drive-by change, we also drop ::bigint casts from ClickBench Q29,
since those are not needed anymore, matching the queries used by other
systems as well.

Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com

caetanosauer and others added 3 commits June 23, 2026 13:35
Fixes ClickHouse#936. The shared driver
(lib/benchmark-common.sh) calls ./query once per try and, for
daemon-backed systems, keeps the server alive across tries so tries
2..N measure hot execution. Hyper's ./query instead opened a brand-new
HyperProcess on every call, so each "hot" try hit an empty buffer pool
against a just-cache-dropped file: every reported hot time was actually
cold.

Convert both hyper/ and hyper-parquet/ to the client-server model the
framework expects (mirroring umbra/):

  - start: background a supervisor that opens one long-lived hyperd and
    publishes its connection descriptor to server.endpoint. In hyper/ it
    also holds a keep-alive connection to hits.hyper so the buffer pool
    isn't torn down when each per-try ./query process exits (Hyper
    detaches a .hyper DB when its last connection closes).
  - stop: SIGTERM the supervisor (cleanly shutting down hyperd) and wait
    for it to fully exit so drop_caches isn't defeated by pinned mmap
    pages.
  - check / query / load: reconnect to the persistent server via its
    descriptor instead of spawning their own HyperProcess. Loading
    through the same server also avoids briefly running two hyperd
    instances (each claiming ~80% RAM) during the heavy COPY.
  - benchmark.sh: BENCH_RESTARTABLE=yes (there is now a real daemon
    whose lifecycle matters) and drop the BENCH_CONCURRENT_DURATION=0
    override, re-enabling the concurrent-QPS test.

Net effect: the driver's cold cycle (stop -> wait -> drop_caches ->
start) gives an honest cold try 1, and tries 2..N hit the warm server
= genuinely hot.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@caetanosauer caetanosauer changed the title Hyper: persistent server for honest hot runs; split parquet lane into single/partitioned hyper-parquet: Fix caching of Parquet files and add single-file variant Jul 1, 2026
The hyper-parquet harness created a TEMP external table per ./query
connection. With the persistent-server model (issue ClickHouse#936) the driver
opens a fresh connection per try. This caused our plan cache to be wiped,
requiring re-sampling the data on every hot run as well. To properly
test a cached scenario, this commit changes our harness to create
persistent external tables and maintain an attached database throughout
the server's lifetime. The query iterations themselves still open a
fresh connection each time.

We also introduce the `single` variant for the parquet lane, now
reporting both `hyper-parquet-single` and `hyper-parquet-partitioned`.

As a drive-by change, we also drop ::bigint casts from ClickBench Q29,
since those are not needed anymore, matching the queries used by other
systems as well.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@caetanosauer caetanosauer force-pushed the hyper-parquet-persistent-external-plancache branch from 341179e to 1bc7035 Compare July 1, 2026 14:20
@caetanosauer

Copy link
Copy Markdown
Contributor Author

Depends on #955 (persistent-server / issue #936). This PR is stacked on that branch — until #955 merges, the diff here also shows the persistent-server changes; it will collapse to just the parquet-caching work once #955 lands and this is rebased onto main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants