Building a query result cache that doesn't poison itself
We added a 90-line in-memory cache to the query engine and saw dashboard repeat-load drop from 247 ms to 34 ms — a 7x win for the most common operation in the product. The same change introduced a subtle pointer aliasing bug that would have silently corrupted KPI deltas under load. This post is the bug, the fix, and the rule it taught us about every cache going forward.
The setup
Emban dashboards are a fanout. A typical published dashboard has 8–15 widgets, each issuing its own ClickHouse query when the embed iframe loads. The queries are fast (≈10–15 ms each) but the network round trip plus ClickHouse parse cost adds up — full dashboard load was consistently 200–250 ms.
Most of those queries are identical between consecutive embed loads: same org, same tenant, same time window, same dimension filter. So we added a TTL cache keyed by a SHA-256 of the request scope, with TTL scaling by window length: 30 seconds for hour-scale windows, 30 minutes for month-scale ones. Standard pattern.
func (c *resultCache) get(key string) *models.QueryResponse {
c.mu.RLock()
entry, ok := c.entries[key]
c.mu.RUnlock()
if !ok || time.Now().After(entry.expiresAt) {
return nil
}
return entry.value
}Looks fine. Tests pass. Latency drops. Ship.
What the cache stored
A QueryResponse is a struct around a slice of DataPoint. The DataPoint struct has a few pointer fields like PreviousValue *float64 and DeltaPct *float64 — populated only when a KPI is rendered with a comparison enabled.
The crucial detail: the comparison fields are not populated by the cached call path. A bare Execute() writes them as nil. The orchestrator that runs comparison — ExecuteWithCompare — runs two range queries (current period and previous period), then walks the current response and mutates each DataPoint's PreviousValue/DeltaPct in place.
current, _ := e.executeRange(orgID, env, req, dateFrom, dateTo) // CACHED
previous, _ := e.executeRange(orgID, env, req, prevFrom, prevTo) // CACHED
// Mutates current.Data — but current.Data points into the cache.
for i := range current.Data {
prevVal := previousMap[current.Data[i].GroupValue]
current.Data[i].PreviousValue = &prevVal
delta := ((current.Data[i].Value - prevVal) / prevVal) * 100
current.Data[i].DeltaPct = &delta
}ExecuteWithCompare then mutates the cached response. The third call sees a "cached" result that is no longer what we executed — DeltaPct now reflects the previous comparison, not the current one.How it would have manifested
On the live demo dashboard, the values barely move minute to minute, so the bug wouldn't have looked broken — DeltaPct would just lag. On a real customer dashboard with active traffic, KPI deltas would have drifted in non-obvious ways. Hard to debug, easy to ship.
Worse: two concurrent requests against the same cache key would both write to the same DataPoint slice. That's a data race the Go race detector would flag immediately under load — but the unit tests we'd written exercised get/set in serial.
The fix
Clone the response on cache hit. The structure is shallow — the only thing that needs duplicating is the Data slice; Meta is a value type with no internal pointers.
func (c *resultCache) get(key string) *models.QueryResponse {
c.mu.RLock()
entry, ok := c.entries[key]
c.mu.RUnlock()
if !ok || time.Now().After(entry.expiresAt) {
return nil
}
return cloneResponse(entry.value)
}
func cloneResponse(src *models.QueryResponse) *models.QueryResponse {
if src == nil {
return nil
}
out := *src
if src.Data != nil {
out.Data = make([]models.DataPoint, len(src.Data))
copy(out.Data, src.Data)
}
return &out
}DataPoint is a value type containing pointers, but the pointers point to scalars (*float64). When ExecuteWithCompare assigns current.Data[i].PreviousValue = &prevVal, it overwrites the field in the cloned slice element, not the cached one. The cached entry stays clean.
The test that catches this
The bug was invisible to "does the cache return what was put in" tests. The test that catches it has to mutate the returned value, then ask for it again:
func TestCache_CloneIsolatesMutation(t *testing.T) {
c := newResultCache(8)
resp := &models.QueryResponse{Data: []models.DataPoint{{Value: 1}}}
c.set("k", resp, time.Minute)
first := c.get("k")
first.Data[0].Value = 999
second := c.get("k")
if second.Data[0].Value != 1 {
t.Fatalf("cache leaked mutation: got %v, want 1", second.Data[0].Value)
}
}That test failed before the clone. After: green, including under -race.
Rules of thumb
- A cache that returns a pointer is implicitly promising callers won't mutate. Almost no API documents that promise. Almost no caller respects it.
- Either return a copy or return an interface that's read-only at the type level. Cloning is cheaper than tracking down a mystery intermittent bug six months later.
- Test that mutating the returned value doesn't affect the next get. That single test would have caught this on the first run.
- Run new caches with the race detector under realistic concurrency before claiming they work. Serial unit tests aren't enough.
Where this landed
The cache shipped to the live emban.sidelabs.dev dashboard. With the clone fix in place, repeat dashboard load is 34 ms (down from 247 ms), sustained throughput is 380 req/s at 16 concurrent clients, and we haven't seen a single race-condition flag in two days of synthetic traffic.
Source for the cache and concurrency limiter is in internal/query/cache.go and internal/query/concurrency.go. The full set of tests, including the mutation-isolation one, is in cache_test.go.