observabilitydatabasesMLOps

Observability Patterns for Continuous Model Evaluation in OLAP Stores

UUnknown

2026-02-13

12 min read

Use ClickHouse as a telemetry-first OLAP to join late labels, run cohort evaluations, backfill metrics, and detect drift with reproducible pipelines.

Hook: Why your ML observability fails when labels arrive late — and how OLAP fixes it

Teams building production AI struggle with a familiar problem: models are running in prod, but ground-truth labels trickle in late, metrics diverge, and audits become a painful manual exercise. You need repeatable, fast retrospective evaluation, cohort analysis, and a scalable way to backfill metrics for drift detection and compliance. Using an OLAP database such as ClickHouse changes this workflow: store every model output as telemetry, join late-arriving labels to that telemetry, run retrospective evaluations at scale, and backfill computed metrics for alerts and audits — all cost-effectively.

Executive summary — the pattern you can implement today

In 2026 OLAP systems (ClickHouse in particular) are the de facto choice for large-scale model telemetry because of their high ingestion throughput, compact storage, vector/extension ecosystem, and fast analytic queries. This article gives you a practical blueprint:

Schema and ingestion patterns for storing model outputs and labels
Query patterns for retrospective evaluation and cohort analysis
Backfill workflows and materialized views for metric recomputation
Drift-detection recipes (PSI, KS approximations, histograms, KL proxies)
Operational tips: partitioning, compression, sampling, TTLs and cost controls

The big idea: telemetry-first model evaluation

Instead of only logging metrics in monitoring systems, persist raw model outputs in an OLAP table: timestamped events containing request identifiers, model version, input hashes (or feature vectors), predictions (scores, labels), and metadata. When ground-truth labels arrive later (batch jobs, human review, or data lake events), write them to the same OLAP namespace using the same request ID. The OLAP store lets you run retrospective joins, windowed aggregations, and heavy cohort queries cheaply — a critical capability for production MLOps.

Core benefits

Retrospective joins: Match predictions to labels arriving minutes, hours, or months later.
Fast cohort queries: Compute metrics for arbitrary slices (customer tier, feature bucket, country, model_version).
Backfillable metrics: Recompute historical metrics and insert them into pre-aggregated tables for dashboards/alerts.
Efficient storage: Columnar compression reduces costs compared to row stores; ClickHouse improvements in 2024–2025 further optimized telemetry use cases.

Designing your OLAP schema for telemetry

Start with a lightweight, consistent schema. Key design goals: joinability on request IDs, compactness for cost control, and partitioning for query performance.

Recommended base table: model_outputs

CREATE TABLE analytics.model_outputs
  (
    event_time DateTime64(3),
    event_date Date DEFAULT toDate(event_time),
    request_id String,
    model_version String,
    prediction Float32,
    predicted_label UInt8,
    input_hash String,
    feature_keys Array(String),
    feature_values Array(Float32),
    metadata Nested(key String, value String),
    ingestion_version UInt32
  )
  ENGINE = MergeTree()
  PARTITION BY toYYYYMM(event_date)
  ORDER BY (event_date, model_version, request_id)
  SETTINGS index_granularity = 8192;

Notes: use input_hash instead of raw inputs for privacy (store raw inputs only if you have encryption & audit controls). The nested metadata allows flexible tags (user_segment, country, pipeline_id). The ORDER BY should include the request_id to make joins efficient at the shard level.

Ground-truth table: labels

CREATE TABLE analytics.labels
  (
    label_time DateTime64(3),
    label_date Date DEFAULT toDate(label_time),
    request_id String,
    ground_truth UInt8,
    label_source String,
    label_version UInt32
  )
  ENGINE = ReplacingMergeTree(label_version)
  PARTITION BY toYYYYMM(label_date)
  ORDER BY (label_date, request_id);

Use ReplacingMergeTree when labels can be corrected (store label_version). This pattern makes late-arriving corrections simple: write again with higher label_version.

Ingestion patterns: streaming + batch

Combine streaming for real-time telemetry and periodic batch jobs for labels and backfills. Common architectures in 2026:

Model inference writes to Kafka (or Kinesis); ClickHouse consumes via Kafka engine or an ingestion service.
Labels are written by batch ETL jobs (Spark/Beam) into ClickHouse using bulk inserts or lake connectors.
Materialized views compute pre-aggregates as events arrive.

Example: use ClickHouse Kafka engine into a buffer table, then a Materialized View to insert into model_outputs. This reduces downstream pressure and lets you deduplicate on insert.

Retrospective evaluation: joining predictions to labels

Once you have both tables, retrospective evaluation is mostly SQL joins and windowed aggregations. Here are production-ready patterns.

Basic retrospective metrics (per model version, per day)

SELECT
    mo.model_version,
    mo.event_date,
    count() AS total_preds,
    sumIf(predicted_label = 1 AND l.ground_truth = 1, 1) AS true_positives,
    sumIf(predicted_label = 1 AND l.ground_truth = 0, 1) AS false_positives,
    round(true_positives / NULLIF(total_preds, 0), 4) AS precision
  FROM analytics.model_outputs AS mo
  LEFT JOIN analytics.labels AS l
    ON mo.request_id = l.request_id
  WHERE mo.event_date BETWEEN '2025-12-01' AND '2026-01-15'
  GROUP BY (mo.model_version, mo.event_date)
  ORDER BY mo.event_date, mo.model_version;

Use LEFT JOIN to include predictions without labels (important for coverage metrics). ClickHouse will perform high-performance joins if primary keys and partitioning align.

Handling late labels and deduplication

Late labels can arrive after your initial metric computation. Two practical strategies:

Immutable telemetry + replacing labels: Keep model_outputs immutable, write labels to ReplacingMergeTree. Recompute metrics by joining — Backfill runs will re-evaluate any changed label_version.
Use deduplication keys on ingest: If duplicates are common, use a write-time deduplication pipeline (Hash+version) or COLLAPSING/REPLACING merge tree engines.

Cohort analysis patterns

Cohorts help you find degradations that affect a subset of users. Cohorts can be defined by user attributes, feature buckets, or time windows.

SELECT
    c.signup_month,
    mo.model_version,
    count() AS preds,
    round(sumIf(predicted_label=1 AND l.ground_truth=1,1)/NULLIF(sumIf(predicted_label=1,1),0),4) AS precision
  FROM analytics.model_outputs mo
  LEFT JOIN analytics.labels l ON mo.request_id = l.request_id
  LEFT JOIN analytics.users c ON mo.metadata.key = 'user_id' AND c.user_id = mo.metadata.value
  WHERE mo.event_date >= addMonths(today(), -3)
  GROUP BY (c.signup_month, mo.model_version)
  ORDER BY c.signup_month, mo.model_version;

Tip: pre-join user attributes into an enrichment table or use a materialized view to avoid repeated lookups.

Windowed cohort conversion (retention style)

Use array aggregation and window functions to compute retention-like metrics: how predictions for a signup cohort perform over successive weeks.

SELECT
    signup_month,
    arrayMap(w -> w.week_start, weeks) AS week_starts,
    arrayMap(w -> w.precision, weeks) AS weekly_precision
  FROM (
    SELECT
      c.signup_month,
      groupArray((week_start, precision)) AS weeks
    FROM (
      SELECT
        c.signup_month,
        toStartOfWeek(mo.event_date) AS week_start,
        round(sumIf(predicted_label=1 AND l.ground_truth=1,1)/NULLIF(sumIf(predicted_label=1,1),0),4) AS precision
      FROM analytics.model_outputs mo
      LEFT JOIN analytics.labels l ON mo.request_id = l.request_id
      LEFT JOIN analytics.users c ON c.user_id = mo.metadata.value
      WHERE mo.event_date >= addMonths(today(), -6)
      GROUP BY (c.signup_month, week_start)
    )
    GROUP BY signup_month
  );

Backfill strategies: recompute, store, and audit

Backfills are essential when labels arrive late or when you release a new metric definition. Your backfill should be reproducible and idempotent.

Backfill pipeline checklist

Identify the missing time window(s) and metric definitions.
Run a deterministic SELECT that joins telemetry + labels over that window.
INSERT results into a time-series metrics table (with run_id and checksum).
Store run metadata (start/end times, query hash, operator) in a backfill_runs table for audit.
Mark dashboards and alerts as using backfilled vs live metrics.

Example: computed metrics table and backfill insert

CREATE TABLE analytics.daily_metrics
  (
    metric_date Date,
    model_version String,
    metric_name String,
    metric_value Float64,
    run_id UUID
  ) ENGINE = SummingMergeTree()
  PARTITION BY toYYYYMM(metric_date)
  ORDER BY (metric_date, model_version, metric_name);

  INSERT INTO analytics.daily_metrics
  SELECT
    mo.event_date AS metric_date,
    mo.model_version,
    'precision' AS metric_name,
    round(sumIf(predicted_label=1 AND l.ground_truth=1,1)/NULLIF(sumIf(predicted_label=1,1),0),4) AS metric_value,
    '00000000-0000-0000-0000-000000000001' AS run_id
  FROM analytics.model_outputs mo
  LEFT JOIN analytics.labels l ON mo.request_id = l.request_id
  WHERE mo.event_date BETWEEN '2025-12-01' AND '2025-12-31'
  GROUP BY (mo.event_date, mo.model_version);

Use a unique run_id per backfill so you can safely re-run with different metric definitions and trace changes.

Drift detection: practical OLAP recipes

Detecting drift requires comparing current distributions to a baseline. OLAP excels here: compute histograms and quantiles for features/predictions and run lightweight statistical proxies in SQL.

Population Stability Index (PSI) approximation

PSI measures change between baseline and current distributions across bins. Approximate with quantile-based bins in ClickHouse:

WITH
    arrayMap(x -> toFloat64(x), quantiles(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)(prediction)) AS cuts
  SELECT
    sum( (cur_pct - base_pct) * ln(cur_pct / base_pct) ) AS psi
  FROM (
    SELECT bin, sum(cur_count)/sum(cur_total) AS cur_pct, sum(base_count)/sum(base_total) AS base_pct
    FROM (
      SELECT
        arrayJoin(arrayMap(i -> i, range(5))) AS bin,
        countIf(prediction >= cuts[bin] AND prediction < cuts[bin+1]) AS cur_count,
        (SELECT count() FROM analytics.model_outputs WHERE event_date BETWEEN '2025-11-01' AND '2025-11-30') AS cur_total,
        (SELECT count() FROM analytics.model_outputs WHERE event_date BETWEEN '2024-11-01' AND '2024-11-30') AS base_total,
        -- base_count computed similarly against baseline window
      )
  );

Practical tip: compute baseline histograms once and store them in a small table. Then compute current histograms daily and compare to baseline with a lightweight JOIN.

Kolmogorov–Smirnov (KS) proxy and quantile diffs

Exact KS is expensive; compare a set of quantiles instead and take the max absolute difference as a KS proxy.

SELECT max(abs(cur_q - base_q)) AS ks_proxy
  FROM (
    SELECT q, cur_q, base_q
    FROM (
      SELECT arrayJoin([0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99]) AS q,
             quantile(q)(prediction) AS cur_q
      FROM analytics.model_outputs
      WHERE event_date BETWEEN today()-7 AND today()
    ) AS cur
    ANY INNER JOIN (
      SELECT arrayJoin([0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99]) AS q,
             quantile(q)(prediction) AS base_q
      FROM analytics.model_outputs
      WHERE event_date BETWEEN toDate('2025-11-01') AND toDate('2025-11-30')
    ) USING q
  );

Monitoring categorical drift

For categorical features, compute daily top-k distributions and compare with baseline using KL divergence or TVD (total variation distance).

Query performance and cost optimization

Telemetry tables can get huge. Use ClickHouse features to control query cost and latency.

Partitioning & ORDER BY

Partition by month (toYYYYMM(event_date)) for efficient range deletes and backfills.
ORDER BY (event_date, model_version, request_id) to make JOINs and GROUP BYs faster.

Compression & codecs

Choose codecs per column: LZ4 or ZSTD for general columns; Delta/DoubleDelta for timestamps; run column-level profiling to find best tradeoffs.

Sampling for interactive analysis

For exploratory cohort queries use sampling clauses (SAMPLE) to get approximate answers fast. Pair sampling with determinism (sample by request_id) for reproducible investigation.

Materialized views & pre-aggregations

Compute daily aggregates into a metrics table via materialized views to offload expensive joins from dashboards. Use SummingMergeTree or AggregatingMergeTree for efficient rollups.

TTL and data lifecycle

Use TTLs to automatically drop raw inputs after compliance windows (e.g., keep raw telemetry 90 days, keep aggregated metrics 2 years). If privacy rules require deletion per-request, plan for tombstone patterns with CollapsingMergeTree.

Operational patterns: alerts, audits and governance

Build observability and governance into the pipeline:

Backfill_runs table: record all backfill jobs, SQL hash, operator, and checksums of produced metrics.
Model registry link: join model_metrics to your model registry (version metadata) to audit which code produced which outputs.
Alerting: compute drift and metric thresholds in ClickHouse and forward alerts to Prometheus/Grafana or an alert manager. Use monotonic counters for trusted alerting signals.
Access controls: avoid raw prompt storage; use hashing, encryption, and RBAC for sensitive columns.

Example end-to-end backfill run (step-by-step)

Lock the backfill window and create a run_id.
Snapshot baseline histograms (if needed) into analytics.baseline_histograms.
Run the deterministic SELECT query joining model_outputs and labels for the window.
INSERT aggregated metrics into analytics.daily_metrics with the run_id.
Store a checksum of the query results (e.g., hash of concatenated values) in backfill_runs for reproducibility.
Notify dashboards and update annotations so stakeholders know metrics were backfilled.

2026 trends and why this matters now

By 2026, companies increasingly standardize on OLAP systems for telemetry due to several trends:

ClickHouse and other OLAP projects have matured with higher adoption and venture backing, making them production-ready at scale.
Vector functions and embedding workflows have been integrated into many ClickHouse deployments, enabling hybrid telemetry (scores + embeddings) for richer drift analysis.
Cost pressures force teams to store raw telemetry economically and compute heavy metrics on-demand or via scheduled backfills.
Regulation and audit requirements mean teams must keep reproducible trails: original prediction event + joined label + backfill metadata.

Security & compliance checklist

Hash or tokenize PII before writing to model_outputs; store raw prompts only in encrypted buckets with strict access controls.
Enable TLS in transit and disk encryption for ClickHouse data disks.
Use RBAC to restrict who can run backfills and modify metrics tables.
Maintain a backfill_runs audit table with operator, timestamp, and query hash for audits.

Common pitfalls and how to avoid them

No request_id: If predictions and labels lack the same ID, you’ll need fuzzy joins that are expensive. Always propagate a request_id.
Over-retaining raw inputs: Keep raw text only when necessary — store input_hash and feature snapshots instead.
Unaligned partitions: Mismatched partitioning across tables kills join performance — align partitions on event_date/month.
No backfill metadata: Recomputing metrics without a run_id makes audits impossible. Always record run metadata.

Actionable checklist to implement in your next sprint

Define and enforce a global telemetry schema that includes request_id, model_version, and input_hash.
Set up ClickHouse ingestion from your inference Kafka topic using a buffer + materialized view.
Create ReplacingMergeTree labels table and ensure pipelines write label_version for corrections.
Implement a daily backfill job that computes core metrics and writes to a metrics table with run_id.
Add a baseline histogram table and a daily drift job that computes PSI/quantile diffs and fires alerts.

Conclusion & next steps

Putting OLAP at the center of model observability turns late labels from a headache into a solvable data engineering problem. ClickHouse gives you the throughput and analytical horsepower to join predictions with labels, run cohort analyses, backfill metrics reliably, and detect drift early — while keeping costs manageable through compression, partitioning, and pre-aggregation. In 2026, teams that adopt these OLAP patterns move faster on experiments, ship safer models, and provide auditable metrics for business stakeholders.

"Persist raw telemetry, join late labels, backfill deterministically, and track every backfill — that's the core of repeatable model observability in production."

Call to action

Ready to operationalize this pattern? Start with a 2-week spike: create the telemetry schema in ClickHouse, ingest one week of predictions and labels, and run a backfill. If you want a checklist, scripts, or a sample ClickHouse deployment optimized for telemetry, request our MLOps playbook — we’ll share schema templates, backfill runners, and alerting examples you can plug into your pipeline.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.