What does a proactive VLM actually look like?

Ask today's VLM "is there a knife on the floor?" and it answers correctly, every time. It will never tell you, unprompted, that the baby is crawling toward it. That gap — between answering and noticing — is the whole problem.

The model is reactive: you enumerate the world for it in advance. Even looping "is everything safe?" doesn't fix it — you're still defining the safety surface upfront. The cognitive work of deciding what to look for still lives with the human.

Truly proactive would mean the model watches, learns how the scene usually unfolds, and speaks up on its own when it sees things heading somewhere they rarely go. No prompt list. No question loop.

The question is what that actually looks like as a system.

Why prompt loops aren't proactive

The "loop the safety prompt every N seconds" trick has a fundamental ceiling: every loop only checks the closure of the prompt vocabulary. If your prompt list doesn't anticipate "baby is crawling toward a dropped knife," the loop won't catch it — no matter how often you ask.

More prompts = bigger closure, but the closure is still upfront-defined, and adversarial cases are always one step outside it. This is the same failure pattern that fixed-vocabulary detectors had pre-CLIP: enumerating classes fundamentally can't keep up with the long tail.

The whole point of proactive should be inverting the direction of inquiry: model → human, not human → model.

What "proactive" decomposes into

At its core it's one shift: the model should reason from a predictive state, not a per-frame one. Four steps:

Hold a predictive state of the scene. Not features of the current frame — a latent that captures where things are and where they're heading: the person mid-stride, the pan heating, the baby crawling toward a knife on the floor. A state you can roll forward — a world model in the sense of blogs 01 and 02.
Predict trajectories in latent space. Given the current state, roll it forward as a continuous-time latent trajectory — the Neural ODE move (Chen et al., 2018): learn the dynamics dz/dt = f(z, t) and integrate. I've built exactly this kind of latent-trajectory model — Mixed-Effects Neural ODE for the dynamics of longitudinal data, and Variational Sampling of Temporal Trajectories for learning a distribution over trajectories instead of a single path. That distribution is the point: you don't want one predicted future, you want the cone of plausible ones.
Ask whether the trajectory is a usual one. A distribution over trajectories hands you the test for free — a likelihood. Ordinary evolutions (someone walks through, the pan heats normally) score high; the baby's path that ends at a knife on the floor is low-likelihood — an outlier in trajectory space. This is precisely the out-of-distribution / abnormal-trajectory detection that Variational Sampling of Temporal Trajectories was built for. No separate anomaly head, no hand-listed hazards.
Speak when it's unusual — without being asked. That's the proactive part, and it's the easy part once the first three exist. The model isn't answering "is there a knife?" It's noticing that the way this scene is evolving is one it rarely sees, and flagging it on its own.

That's the entire inversion. A reactive system waits for your question about a state. A proactive one watches the trajectory of the scene and speaks the moment that trajectory stops looking like the ones it knows.

Architecture sketch (the minimum viable proactive VLM)

Architecture of a minimum-viable proactive VLM: a video stream feeds a vision encoder, a latent dynamics model rolls the scene state forward as a trajectory, an outlier gate checks whether that trajectory is low-likelihood, and a notifier LLM produces a natural-language alert. — The minimum viable proactive VLM — a notifier wired onto the likelihood of a predictive world model's trajectory.

Building those components is the easy part. The hard problems are three design decisions:

The threshold is a calibration problem. How unlikely is unlikely enough to interrupt someone? Too loose = alert fatigue; too tight = you miss what matters. It almost certainly has to be learned per-scene, conditioned on time-of-day, occupant, and so on.
What "unlikely" means matters enormously. A raw likelihood treats every low-probability trajectory the same — a knife sliding to the floor and a shadow sweeping across it can be equally rare yet wildly different in importance. You want the outlier score weighted by downstream stakes or concept salience, not just probability mass.
The notifier needs a prior on what's worth saying. Most unusual trajectories are still boring — the light flickered, someone you know walked through. The model needs an implicit editorial filter, closer to a journalist deciding what's newsworthy than a captioner describing a frame.

The deep version: prediction is the only objective

The cleanest framing — straight from predictive-coding / free-energy ideas:

Action and attention are both consequences of minimizing prediction error. Speaking is just another action. A proactive VLM speaks when its realized trajectory lands far enough into the tail of what it predicted.

In this view the system has no objective other than predicting its own latent trajectory. Everything downstream — outlier detection, alerting, describing — falls out of that one signal.

It's a strong claim, and I'm not sure it survives contact with practice. But the elegance is real: every component you'd need (JEPA, the continuous-time dynamics from blog 02, a distribution over trajectories, an LLM head, episodic memory) is already trained — or trainable — in the service of predicting the next stretch of trajectory. There's no separate "anomaly head" to bolt on.

Failure modes to take seriously

Alert fatigue. By any naive likelihood threshold, most of life is an outlier — lighting, weather, ordinary motion variance. The first time you 10× the alerts, people stop reading them. This is a signal-to-noise problem before it's a modeling one.
Distribution drift. "Usual for this scene" has to move — a new baby in the family, a renovation, seasonal change. But it can't move so fast that a genuinely dangerous trajectory gets absorbed into the new normal within two weeks. Catastrophic forgetting, but for safety.
Coverage of the long tail. Trajectory-likelihood triggering inherits every blind spot of the encoder underneath it. If the latent space can't represent a hazard, a trajectory passing through it isn't "unlikely" — it's invisible, and it routes to a notifier that has no word for it either.
Adversarial normalcy. An operator who wants to evade detection can drift the baseline gradually until a dangerous trajectory looks ordinary. Outlier-based systems need adversarial robustness baked in, not bolted on.

The thesis

A proactive VLM is possible — but it isn't a better VLM, it's a different job description. Today's models answer questions about what is. A proactive one lives in what might be: at every moment it carries many predicted futures for the scene, and the real work is triage — deciding which of those futures are worth saying out loud, and how likely they are, before it ever speaks.

That reframing is the contribution. Not a new head bolted onto a captioner, but a system whose core loop is predict many futures → judge which are both likely enough and important enough → notify on those. The hard parts — calibration, scene-conditional priors, adversarial robustness — are engineering on a framework we already know how to build, not open questions about whether it can exist at all.