analyticsGPUsperformance

Scaling ClickHouse on Heterogeneous Clusters (CPU + GPU + NVLink)

bbitbox

2026-02-14

10 min read

Accelerate ClickHouse with NVLink GPUs and RISC‑V hosts. Practical guide: query placement, memory tuning, Kubernetes orchestration, and hybrid execution.

Hook — Why heterogeneous clusters matter for your ClickHouse workloads in 2026

If high query latency, exploding cloud bills, and fragmented toolchains are slowing analytics, combining ClickHouse with NVLink-connected GPU nodes and RISC‑V hosts is one of the most practical architectures to consider in 2026. Recent industry moves — ClickHouse’s major fundraise and ecosystem growth (Bloomberg, Jan 2026) and SiFive's NVLink Fusion roadmap for RISC‑V (Forbes, late 2025–2026) — make hybrid CPU+GPU clusters feasible and attractive for production OLAP at scale.

Executive summary — what you’ll get from this guide

This article is a hands-on guide for technology teams and platform engineers. You will learn:

Architectures that combine ClickHouse storage and routing with NVLink-attached GPU accelerators and RISC‑V CPU nodes.
Query placement patterns to push the right work to GPUs and keep ClickHouse fast and predictable.
Memory management and NVLink-specific tuning to avoid bottlenecks and costly spills to disk.
Hybrid execution models — from simple GPU microservices to tightly coupled GPU-backed operators.
Operational recipes for Kubernetes, CI/CD, and observability to operate safely at scale.

Context: Why this matters now (2026 trends)

By 2026, GPU acceleration has moved beyond model training into general-purpose data processing. NVLink and NVLink Fusion reduce GPU-to-GPU and GPU-to-CPU transfer latency significantly compared with PCIe, enabling novel architectures where GPUs behave like shared memory accelerators inside the cluster. At the same time, RISC‑V silicon is starting to support tighter GPU integration, opening options for custom, lower-cost CPU+GPU nodes (see recent SiFive/NVIDIA NVLink Fusion announcements reported by Forbes).

These platform trends let you colocate high-bandwidth GPU compute with ClickHouse storage or run dedicated GPU executors that communicate over NVLink for near-zero copy transfers.

High-level architectures — patterns that work

Pattern A — ClickHouse CPU cluster + GPU compute farm (recommended initially)

Keep ClickHouse nodes on CPU-optimized servers (including RISC‑V where supported) and put GPU-heavy data transforms into a separate fleet of NVLink-connected GPU nodes. Use a lightweight router that decides, per-query, whether to execute on the CPU cluster or submit sub-queries to the GPU farm.

Pros: Lower operational risk; easier to deploy on existing ClickHouse clusters.
Cons: Extra network hops; requires robust inter-service protocols (gRPC/HTTP).

Pattern B — Heterogeneous ClickHouse nodes with local GPUs

Run mixed nodes: some ClickHouse replicas have local NVLink-connected GPUs and expose an API for GPU-accelerated functions. ClickHouse stores hot data locally on nodes with GPUs and routes heavy queries to them via sharding or weight-based routing.

Pros: Lower data movement; simpler query routing for sharded data.
Cons: More complex capacity planning and scheduling; requires ClickHouse-aware GPU drivers or external execution hooks.

Pattern C — Tightly coupled NVLink clusters (advanced)

With NVLink Fusion and NVLink fabrics, build machines where GPUs across nodes share high-bandwidth links. This is advanced and hardware-specific but gives the lowest latency and best scalability for fused query plans that mix CPU and GPU operators.

Query placement: rules, heuristics, and implementation

At the core is deciding which operations gain from GPU acceleration. Use pragmatic heuristics and instrument them.

Which operators to offload

Great candidates: large aggregations, group-bys with high cardinality (when using GPU radix/hash strategies), sorts on large datasets, vectorized compression/decompression, and wide-column numeric transforms (e.g., encoding/decoding, SIMD-style math).
Poor candidates: low-row-count lookups, heavy string-processing with many small allocations, and queries dominated by random IO (where GPU won’t help).

Heuristics for placement

Estimate input size (rows, bytes). If input > threshold (e.g., 10M rows or 10GB) and operator is GPU-friendly, consider offload.
Estimate expected memory pressure. If combined memory usage fits NVLink GPU pool without spill, choose GPU.
Respect latency SLOs: prefer CPU for sub-second interactive queries; prefer GPU for throughput or batch processing.

Implementation options

Query router: A small service that parses incoming SQL (or uses metadata tags) and routes to ClickHouse CPU cluster or to a GPU executor. Works well with existing ClickHouse clusters and is language-agnostic.
External table functions: Define a GPU-backed table function that ClickHouse can call. This keeps SQL semantics but pushes execution out-of-process.
Materialized views + async workers: Materialize intermediate datasets into GPU-friendly formats and run scheduled GPU jobs for heavy aggregations.

Memory management: NVLink and GPU considerations

Memory is the frequent root cause of instability in hybrid clusters. NVLink reduces transfer cost but doesn’t remove capacity limits. Design for predictable memory use.

Key concepts

Pinned vs pageable memory: Use pinned host memory for fast DMA to GPU over NVLink/PCIe; avoid repeated pin/unpin churn.
Zero-copy and Unified Virtual Addressing (UVA): When supported by the stack (CUDA + RDMA + NVLink Fusion), prefer zero-copy transfers to avoid host-side duplication.
GPUDirect RDMA: For cross-node NVLink/RDMA fabrics, enable GPUDirect to bypass host copies when possible.

ClickHouse-specific settings

Tune these in config.xml or users.xml profiles:

<!-- Example ClickHouse memory settings -->
<max_memory_usage>150000000000</max_memory_usage>
<max_bytes_before_external_group_by>5000000000</max_bytes_before_external_group_by>
<max_memory_usage_for_user>120000000000</max_memory_usage_for_user>

Notes:

Set max_memory_usage so ClickHouse won't OOM host processes; prefer conservative values when GPU offload is active.
max_bytes_before_external_group_by controls spill-to-disk behavior; increase for GPUs if your GPU execution engine can handle larger in-memory aggregation, but ensure host swap won’t kick in.

GPU executor memory sizing

Measure typical operator memory (histogram across queries).
Reserve a GPU execution pool (e.g., 70–80% of GPU memory) for queries and keep the rest for model or filesystem caches.
Implement preflight checks: when a query’s estimated GPU memory > available pool, degrade to CPU or split into smaller tasks.

Hybrid execution models — practical recipes

Recipe 1 — Offload aggregator microservice (fastest to implement)

1) Implement a microservice in C++/CUDA or Python+RAPIDS that exposes an API for group-by/agg operations. 2) Create a ClickHouse table function that calls the microservice with serialized chunks. 3) Stream results back into ClickHouse for final merge.

-- Pseudocode SQL flow:
SELECT * FROM gpu_agg_table_function(
  'SELECT user_id, SUM(amount) AS total FROM source_table WHERE dt=\'2026-01-01\' GROUP BY user_id'
);

Recipe 2 — Pushdown via materialized views

Use materialized views to prepare data in columnar and GPU-friendly formats (e.g., arrow/parquet). A scheduled GPU worker picks these up, processes, and writes back aggregated results to a ClickHouse table.

Recipe 3 — Co-located GPU UDFs (advanced)

Embed GPU-accelerated UDFs as external functions or shared libraries that ClickHouse calls via the external function API. This reduces serialization overhead but requires careful ABI and memory management.

Orchestration, containers, and Kubernetes patterns

For production, Kubernetes is the de facto control plane. Below are practical operational patterns.

Node labeling and scheduling

# Example Pod spec snippet
nodeSelector:
  node.kubernetes.io/gpu: "true"
resources:
  limits:
    nvidia.com/gpu: 1

Use Node Feature Discovery (NFD) to expose NVLink presence and bandwidth as labels (e.g., node.k8s.io/nvlink=true). Combine with taints/tolerations for workload segregation.

GPU Operator and device plugins

Use NVIDIA GPU Operator (or equivalent for other vendors) to manage drivers, device plugins, and monitoring. For NVLink features, confirm operator support for GPUDirect and peer-to-peer visibility. Also integrate driver and device patching into your CI/CD process (example approaches discussed in virtual patching guides) to reduce emergency rollouts.

SR-IOV / RDMA for NVLink fabric

If your cluster uses RDMA or RoCE to extend NVLink-like fabrics between nodes, configure SR-IOV and ensure CNI supports high-performance networking. The goal is to avoid host-level TCP copy paths when large transfers are frequent.

CI/CD and GitOps for schema and operator changes

Store ClickHouse configs (macros, clusters, users) in Git and manage with ArgoCD/Flux.
Deploy GPU executor images and ClickHouse schema migrations via the same pipeline.
Use canary deploys for GPU execution changes — a bad GPU kernel can blow up several nodes quickly.

Monitoring, observability, and SLOs

You cannot tune what you don't measure. Track these metrics:

ClickHouse: query latency P50/P95/P99, memory usage, merges, disk spills, distributed query shuffle sizes.
GPU nodes: GPU memory utilization, PCIe/NVLink bandwidth, CUDA kernel time, queue wait times.
Network: RDMA throughput, packet drops, link saturation.

Instrument with Prometheus exporters (ClickHouse exporter, NVIDIA DCGM exporter). Create composite SLI dashboards that show end-to-end query time and which stage (ClickHouse CPU, network, GPU compute) dominated the latency. Forensics and operational evidence capture patterns for edge fabrics are covered in operational playbooks like evidence capture and preservation.

Operational playbook — step-by-step rollout

Start with a non-production replica of your ClickHouse cluster.
Deploy a small GPU farm (2–4 nodes) with NVLink and NVIDIA Operator or vendor equivalent.
Implement a simple aggregator microservice and a ClickHouse table function for offload; run shared test workloads and compare throughput and cost.
Measure end-to-end: CPU time saved, GPU time consumed, network transfer cost, and cloud cost delta (audit and cost reduction tips in guides on how to audit hidden costs).
Gradually expand supported query types and introduce preflight estimators in the router to avoid overcommitting GPUs (see edge migration patterns in edge migration playbooks).
Roll out to production via GitOps with automated rollback on SLA regressions.

Performance tuning checklist

Tune ClickHouse user and query profiles for conservative memory usage when offload is enabled.
Use batch sizes and chunking to match GPU kernel occupancy (avoid many tiny batches).
Enable pinned memory and GPUDirect where supported to reduce CPU-GPU copy cost.
Set sensible timeouts and queue limits in GPU executor to prevent long-tail queries from hogging resources.
Tune Kubernetes PodDisruptionBudgets and priorityClass to ensure GPU nodes remain stable under maintenance.

Case study (hypothetical, based on 2026 trends)

A fintech platform ran heavy daily aggregations in ClickHouse. After deploying a 4-node NVLink GPU pool and offloading the largest GROUP BY queries to a RAPIDS-based microservice, they saw:

3–5x faster batch aggregation runs
40% reduction in ClickHouse cluster CPU costs
Zero disk-spill events during peak consolidations thanks to GPU memory headroom

Key enablers were NVLink transfer tuning, preflight memory estimation, and conservative query routing for interactive user queries.

Risks and mitigations

Risk: GPU hotspots and queueing. Mitigation: admission control and preflight checks.
Risk: Vendor-specific NVLink features create lock-in. Mitigation: choose modular offload services and rely on open exchange formats (Arrow/Parquet).
Risk: Complexity of mixed CPU architectures (RISC‑V vs x86). Mitigation: invest in portable container builds and CI tests across CPU architectures.

Future-proofing & predictions for 2026–2028

Expect three parallel trends:

1) Tighter hardware integration. NVLink Fusion and RISC‑V integration (SiFive + NVIDIA roadmap) will lower host CPU overhead and simplify P2P GPU transfers.
2) Ecosystem tooling. More mature operators and ClickHouse extensions for GPU offload will appear as commercial vendors invest in acceleration for analytics (ClickHouse’s growth is funded and accelerating — Bloomberg, Jan 2026).
3) Standardized execution formats. Arrow, UCX, and MOVEDATA-like protocols for zero-copy transfer between ClickHouse and GPU frameworks will be common by 2027.

Actionable checklist — Get started in 30 days

Inventory queries and identify top-10 heavy aggregations by CPU time and I/O.
Stand up 2 GPU nodes (NVLink if available) and deploy a simple RAPIDS aggregator service.
Create a ClickHouse table function or materialized view to invoke the service for one query.
Measure, tune, and add admission controls to avoid GPU OOMs.
Formalize deployment with GitOps and add Prometheus dashboards for end-to-end traces.

Closing thoughts

Heterogeneous clusters combining ClickHouse, GPUs connected by NVLink, and emerging RISC‑V platforms unlock dramatic throughput gains for large analytical workloads. The trick is to start small, measure everything, and architect for graceful fallbacks. With careful query placement, deterministic memory management, and controlled hybrid execution models, teams can accelerate analytics without blowing up operational complexity.

Call to action

Ready to prototype? Start with the 30-day checklist above and connect with our DevOps team to build a tested CI/CD pipeline for heterogenous ClickHouse clusters. If you want a tailored architecture review, contact our platform experts for a 2‑week readiness assessment.

bitbox

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.