Side Project: The 32ms Barrier: Why Real-Time Multimodal AI is Still Failing.

December 12, 2025 Multimodality Generative Video Real-time Systems

Voice-driven real-time video generation projector concept

Motivation

I love sci-fi and film, but imagining a scene purely from audio can be surprisingly hard—especially when the story is told quickly or in a non-native language setting. This project is my attempt to make “spoken stories” instantly visual: you talk, and a projector plays a cinematic sequence in near real time.

The Product Goal

The hardest constraint is latency. If the system takes 2–3 minutes to respond, it stops feeling like a projector and starts feeling like offline generation. The goal is to keep the loop interactive:

Streaming input: mic audio captured continuously
Fast understanding: incremental ASR + lightweight scene parsing
Stable visuals: avoid style/identity drift while the story evolves
Projector-ready output: consistent resolution, frame pacing, and audio-video sync

System Architecture

┌──────────────────────────────────────────────────────────────────────────┐
│                          Voice → Video Projector                          │
├──────────────────────────────────────────────────────────────────────────┤
│ Mic → VAD → Streaming ASR → Scene Planner (LLM) → Shot List               │
│                         │                         │                       │
│                         │                         ├─► Visual Spec (style) │
│                         │                         └─► Timing (beats)      │
│                         ▼                                                 │
│                 Context Store (characters / locations / props)            │
│                         │                                                 │
│                         ▼                                                 │
│               Keyframes (image gen) → Motion (video gen)                  │
│                         │                                                 │
│                         ▼                                                 │
│             Post-process (upscale, color, fps, captions)                  │
│                         │                                                 │
│                         ▼                                                 │
│                     Render Loop → Projector Output                        │
└──────────────────────────────────────────────────────────────────────────┘

Key Design Choices

1) Streaming ASR + VAD

Voice input is segmented with VAD (voice activity detection) so the system can react to “beats” rather than waiting for a full paragraph. The ASR layer emits partial transcripts; I treat these as tentative and replan when a segment stabilizes.

2) Scene Planning: from transcript to shots

The LLM’s job isn’t to write prose—it’s to produce a constrained shot plan that downstream models can execute:

Shot type: wide / medium / close-up
Subject: character + attributes (clothing, mood)
Environment: location, time, lighting
Action: what changes across frames
Camera: pan / dolly / handheld feel
Duration: seconds and transition style

3) Keyframes first, motion second

The pipeline is more stable when it generates strong keyframes (a few anchor images) and then animates between them. Keyframes reduce identity drift and give the motion model a clear target.

4) A projector is a display system, not a batch renderer

A real-time projector needs predictable frame pacing. I run a render loop that always has something to display: if the next clip isn’t ready, it can loop the last shot, crossfade, or show an in-universe “loading” cut.

Latency Budget (Practical)

The real trick is scheduling. Generation is spiky: one slow step can stall the whole UX. I treat each stage as a queue and keep a small buffer of upcoming shots.

Stage	What it does	Latency target
VAD + ASR	Partial + stabilized transcript	0.2–1.0s
Scene planning	Transcript → shot plan JSON	0.3–1.5s
Keyframes	Shot → anchor images	1–5s
Motion	Animate to short video clips	3–15s
Post-process	Upscale / fps / color / captions	0.5–3s

Implementation Notes

This is the kind of system where “glue code” matters more than any single model. The system needs: state management (characters + style), retries, caching, and graceful degradation.

# Pseudo-loop
# - always keep a few seconds buffered for the projector
# - treat every stage as streaming / incremental

while running:
    audio_chunk = mic.read()
    vad_events = vad.update(audio_chunk)
    asr_partial, asr_final = asr.update(audio_chunk)

    if asr_final:
        shots = planner.plan(asr_final, state=context)
        keyframe_queue.enqueue(shots)

    keyframes = keyframe_worker.try_generate(keyframe_queue)
    clips = motion_worker.try_generate(keyframes)
    ready_frames = postprocess.try_finalize(clips)

    projector.render(ready_frames or fallback_frames)

What’s Next

Better continuity: longer-term memory for character identity and locations
Interactive edits: “make it darker”, “change camera angle”, “slow down”
Audio-to-action alignment: synchronize motion beats to speech rhythm
Reliability: robust failure recovery so the projector never freezes

← Back to Blog