Today's VLMs are strong on video analysis — if you prompt them correctly. "Is there a knife on the floor?" → correct. "Does the baby hold a knife?" → correct.

But that's reactive. You're enumerating the world for the model in advance. Even looping "is everything safe?" doesn't fix it — you're still defining the safety surface upfront. The cognitive work of deciding what to look for still lives in the human.

Real proactive would mean the model watches, builds its own understanding of what's normal for the scene, and triggers when something deviates. No prompt list. No question loop.

The question is what that actually looks like as a system.

Why prompt loops aren't proactive

The "loop the safety prompt every N seconds" trick has a fundamental ceiling: every loop only checks the closure of the prompt vocabulary. If your prompt list doesn't anticipate "child climbed onto the kitchen counter," the loop won't catch it — no matter how often you ask.

More prompts = bigger closure, but the closure is still upfront-defined, and adversarial cases are always one step outside it. This is the same failure pattern that fixed-vocabulary detectors had pre-CLIP: enumerating classes fundamentally can't keep up with the long tail.

The whole point of proactive should be inverting the direction of inquiry: model → human, not human → model.

What "proactive" decomposes into

I think proactive VLM = four capabilities, in order:

  1. Continuous scene understanding. A persistent latent of this particular scene, not just per-frame features. This is essentially a world model in the sense of blogs 01 and 02 — predicts forward, accumulates a state-of-the-scene, conditions on context.
  2. Normalcy baseline. A learned distribution over "what usually happens here." Same kitchen at 7am vs 7pm is different; same ICU room with different patient is different. Normalcy is scene-conditional, not universal.
  3. Surprise signal. A scalar (or structured) deviation between predicted and observed state. The natural object is prediction error in latent space — exactly what a JEPA-style predictor already produces as a byproduct of training. Free energy / predictive coding framing fits cleanly.
  4. Verbalizer. When surprise spikes, describe what's different in natural language, using the VLM's existing concept vocabulary. The model doesn't have to invent the word "knife" — it just has to invoke it when the predicted latent and observed latent diverge in a region whose nearest CLIP concept is "knife."

The first three are world-model territory. The fourth is what makes it a VLM rather than just an anomaly detector.

Architecture sketch (the minimum viable proactive VLM)

Architecture of a minimum-viable proactive VLM: a video stream feeds a vision encoder, a latent dynamics model predicts the next latent, a surprise gate compares predicted versus observed latent, and a verbalizer LLM produces a natural-language alert.
The minimum viable proactive VLM — a verbalizer wired onto the surprise signal of a predictive world model.

The interesting parts are not at the boxes — they're at the arrows:

  • The threshold is a calibration problem. Too low = alert fatigue. Too high = miss the things you actually care about. Probably needs to be learned per-scene, conditioned on time-of-day, occupant, etc.
  • The metric matters enormously. L2 in latent space treats all dimensions equally; a knife on the floor and a shadow on the floor produce comparable L2 surprise but very different meaningful surprise. You probably want a metric weighted by downstream task or by concept salience.
  • The verbalizer's prior matters too. Most surprise events are uninteresting (lighting flicker, someone walked through). The verbalizer needs an implicit "what's worth saying" filter — closer to a journalism model than a captioning model.

The deep version: surprise as the only signal

The really clean framing — straight from predictive coding / free-energy literature:

Action and attention are both consequences of minimizing prediction error. Speech is one form of action. A proactive VLM speaks when prediction error exceeds a learned threshold.

In this framing the system has no objectives other than predicting its own next latent. Everything else — anomaly detection, alerting, captioning — is downstream of the same signal.

It's a strong claim, and I'm not sure it survives scrutiny in practice. But the elegance is real: every existing component (JEPA, continuous-time dynamics from blog 02, an LLM head, episodic memory) is already trained or trainable in the service of next-latent prediction. There's no extra "anomaly head" to engineer.

Failure modes to take seriously

  • Alert fatigue. Most of life is "anomalies" by any naive surprise metric. Lighting, weather, normal motion variance. The first 10× more "proactive alerts" you generate, the less anyone reads them. This is a signal-to-noise problem before it's a modeling problem.
  • Ontology drift. "Normal for this scene" needs to update — a new baby in the family, a renovation, seasonal change. But it can't update so fast that genuinely novel threats get absorbed into the new normal in two weeks. Catastrophic forgetting of safety.
  • Coverage of the long tail. Surprise-based triggering inherits all of CLIP's blind spots. If the vision encoder has never seen a particular hazardous object class, "high surprise" means nothing — it just routes to a verbalizer that doesn't have the word either.
  • Adversarial normalcy. A scene operator who wants to evade detection can drift the baseline gradually until something dangerous looks normal. Surprise-based systems need adversarial robustness baked in, not bolted on.

The thesis

The proactive VLM isn't a new architecture so much as wiring a verbalizer onto the surprise signal of a predictive world model. The hard problems are calibration, scene-conditional priors, and adversarial robustness — not "do we even know how to build it."