ClickHouse rollups for customer-facing dashboards: raw, hourly, daily
Customer-facing analytics has a tighter latency budget than internal BI. A 4-second dashboard load that's fine in Looker is a churn signal in your product. Rollups are how you keep things under 500ms once you have a few hundred million events. The interesting part isn't the rollup tables — it's the planner that picks between raw, hourly, and daily for every query, and the fallback paths when the rollup doesn't fit.
The shape of the problem
Every Emban dashboard widget is one query: a metric (count, distinct, sum, p95) over a time window with optional group-by and filters. A single tenant might have 50 million events. A typical dashboard renders 8–12 widgets at once. If each widget scans raw events at ClickHouse-native speed, you're looking at 200–400ms per widget, which compounds the moment the customer hits "this week" instead of "today."
Pre-aggregating helps, but you can't pre-aggregate everything — the cardinality of (tenant_id, event_name, period, group_by, filter_combo) explodes. The trick is to aggregate the safe combinations, and keep raw events for everything else.
The tables
ClickHouse materialized views fan out events into 6 rollup tables:
-- Aggregated by day, no extra dimensions. Smallest, fastest.
events_daily_mv (event_date, org_id, env_id, tenant_id, event_name, count, ...)
-- Aggregated by day, with the canonical "filterable" dimensions baked in.
events_daily_dim_rollup (event_date, ..., dim_country, dim_plan, dim_browser, count, ...)
-- Same shape but the metric is a HyperLogLog state instead of a count.
events_daily_hll_rollup (event_date, ..., uniqHLL12 state)
events_daily_dim_hll_rollup (event_date, ..., dim_*, uniqHLL12 state)
-- Hourly variants for "last 24h" and "last 48h" queries.
events_hourly_mv (event_hour, ..., count)
events_hourly_dim_rollup (event_hour, ..., dim_*, count)
events_hourly_hll_rollup (event_hour, ..., uniqHLL12 state)
events_hourly_dim_hll_rollup (event_hour, ..., dim_*, uniqHLL12 state)
-- Raw fallback. Where everything that doesn't fit the rollup runs.
events (timestamp, ..., numeric_props, string_props)The dimension-rollup tables only carry a fixed set of columns we decided up-front were worth pre-aggregating — filterAwareRollupDims in code. Anything outside that set falls back to raw events.
The planner
On every query, buildQueryPlan in internal/query/planner.go picks the source table and the metric expression. The decision tree is short and worth reading in full:
func buildQueryPlan(req models.QueryRequest, dateFrom, dateTo time.Time) queryPlan {
plan := queryPlan{
SourceTable: "events",
MetricExpr: buildMetricExpr(req.Metric, req.NumericProp),
TimeExpr: granularityFunc(req.Granularity),
TimeColumn: "timestamp",
PlannedBy: "raw_events",
PlannerReason: plannerRawReason(req, dateFrom, dateTo),
}
if p, ok := buildDailyRollupPlan(req); ok {
plan = applyRollup(plan, p)
return plan
}
if p, ok := buildHourlyRollupPlan(req, dateFrom, dateTo); ok {
plan = applyRollup(plan, p)
return plan
}
return plan // raw events fallback
}The order matters: try daily first (cheapest), then hourly, then raw. Daily wins when the period is 7d, 30d, or 90d and the granularity is day/week/month. Hourly wins for 24h and any custom range under 48 hours.
Where the planner deliberately gives up
The interesting part of plannerRawReason is what it refuses to plan:
switch req.Metric {
case "p95", "p99":
return "Percentile metrics currently run on raw events."
case "sum", "avg":
return "This metric does not have a rollup path yet."
}
if req.Metric == "count_unique" && (req.GroupBy != "" || hasDynamicFilters(req)) {
return "Exact distinct queries with filters or group by stay on raw events."
}
if req.GroupBy != "" && !filterAwareRollupDims[req.GroupBy] {
return "This group by dimension is not in the current rollup set."
}Three things are doing real work here:
- Percentiles stay raw. A pre-aggregated p95 over arbitrary time windows is mathematically wrong — you can't combine percentiles across daily buckets to get a weekly p95. Either keep a t-digest state per day (extra storage, finicky merge) or just scan raw. We chose scan-raw with a clear reason string the user sees in the planner debug panel.
- Exact distinct with filters stays raw. Same reason —
uniqExactover pre-aggregated counts isn't valid. The HLL variants do compose: that's whyevents_daily_hll_rollupstores auniqHLL12state, not a number. Switch the metric tocount_unique_approxand the planner picks the HLL rollup and you get an answer in 30ms instead of 800ms, accurate to within ~1%. - Group-by on a non-rollup dimension stays raw. You don't get a rollup for free on a custom dimension you added last week. The planner tells you so via
PlannerReason; the dashboard shows a "running on raw events" hint so the user understands why their query took 1.2 seconds.
The HLL trick
Distinct counts are the case people get wrong most often. SELECT uniqExact(user_id) over 30 days of events for a busy tenant is hundreds of millions of hashed keys held in memory. Pre-aggregating it as a daily count and summing gives the wrong answer (the same user counted twice across two days).
HyperLogLog states do merge. Store the state per day, union them at query time, get the cardinality of the union — that's the trick. ClickHouse's uniqHLL12 uses 4KB per state with ~0.6% error. The materialized view stores the state; the query reads a few states and unions them:
-- materialized view definition (excerpt)
SELECT
toDate(timestamp) AS event_date,
org_id, env_id, tenant_id, event_name,
uniqHLL12State(user_id) AS users_hll
FROM events
GROUP BY event_date, org_id, env_id, tenant_id, event_name
-- read path
SELECT
toStartOfWeek(event_date) AS ts,
uniqHLL12Merge(users_hll) AS value
FROM events_daily_hll_rollup
WHERE org_id = ? AND tenant_id = ? AND event_date >= ?
GROUP BY ts
ORDER BY tsFor a customer-facing "weekly active users" widget over 90 days, this is the difference between scanning ~80M raw rows and reading 90 4KB states. The wall-clock difference is roughly two orders of magnitude.
Hourly is for the live dashboard
The 24h widgets — the ticking ones at the top of every dashboard — are the hottest by request count. They run against events_hourly_mv, which is one row per (hour, tenant, event_name). A 24-hour range over a busy tenant is ~24 rows after aggregation. The query returns in single-digit milliseconds before any caching.
The threshold at which hourly applies isn't just "period == 24h" — isHourlyRollupWindow also catches custom ranges narrow enough to fit:
func isHourlyRollupWindow(period string, dateFrom, dateTo time.Time) bool {
if period == "24h" {
return true
}
return dateTo.Sub(dateFrom) <= 48*time.Hour
}So a "drag-to-zoom" on a chart that selects a 6-hour window automatically routes to the hourly rollup. The user gets a snappy zoom; the planner picks the right table without a config.
What we measured
Numbers below are from our own demo workload — one tenant on a single ClickHouse box, ~250 events/sec, a 12-widget dashboard mixing 24h, 7d, and 30d windows. They're not a universal benchmark; they're what the planner buys you on the workload we actually run in production for the live demo.
- Cold dashboard load (no cache, all rollups): ~247ms total.
- Cold load with planner forced to raw events: ~3.4s total.
- Warm dashboard load (in-process cache hit): ~34ms total.
The big win is the cache, but the cache only works because the rollup made the underlying query small enough to keep cheap to re-execute on every TTL miss. Without rollups, the cache eviction rate during a tenant's busy hour would push you back into the 3-second budget.
PlannerReason string is what makes this debuggable in production.What we'd add next
quantilesTDigestStatefor percentile rollups, so p95 dashboards over 30-day windows stop running on raw events. The merge math works; it's a pure storage cost question.- A second-tier rollup: 5-minute buckets for the live "last hour" widget. Right now those still hit raw events because the hourly bucket would smear away the resolution.
- Per-customer rollup eligibility, so a customer who routinely filters by a custom dimension can opt that dimension into the rollup set without bloating it for everyone else.