Side Project: The 32ms Barrier: Why Real-Time Multimodal AI is Still Failing.

Voice-driven real-time video generation projector concept

Motivation

I love sci-fi and film, but imagining a scene purely from audio can be surprisingly hard—especially when the story is told quickly or in a non-native language setting. This project is my attempt to make “spoken stories” instantly visual: you talk, and a projector plays a cinematic sequence in near real time.

The Product Goal

The hardest constraint is latency. If the system takes 2–3 minutes to respond, it stops feeling like a projector and starts feeling like offline generation. The goal is to keep the loop interactive:

  • Streaming input: mic audio captured continuously
  • Fast understanding: incremental ASR + lightweight scene parsing
  • Stable visuals: avoid style/identity drift while the story evolves
  • Projector-ready output: consistent resolution, frame pacing, and audio-video sync

System Architecture

┌──────────────────────────────────────────────────────────────────────────┐
│                          Voice → Video Projector                          │
├──────────────────────────────────────────────────────────────────────────┤
│ Mic → VAD → Streaming ASR → Scene Planner (LLM) → Shot List               │
│                         │                         │                       │
│                         │                         ├─► Visual Spec (style) │
│                         │                         └─► Timing (beats)      │
│                         ▼                                                 │
│                 Context Store (characters / locations / props)            │
│                         │                                                 │
│                         ▼                                                 │
│               Keyframes (image gen) → Motion (video gen)                  │
│                         │                                                 │
│                         ▼                                                 │
│             Post-process (upscale, color, fps, captions)                  │
│                         │                                                 │
│                         ▼                                                 │
│                     Render Loop → Projector Output                        │
└──────────────────────────────────────────────────────────────────────────┘

Key Design Choices

1) Streaming ASR + VAD

Voice input is segmented with VAD (voice activity detection) so the system can react to “beats” rather than waiting for a full paragraph. The ASR layer emits partial transcripts; I treat these as tentative and replan when a segment stabilizes.

2) Scene Planning: from transcript to shots

The LLM’s job isn’t to write prose—it’s to produce a constrained shot plan that downstream models can execute:

  • Shot type: wide / medium / close-up
  • Subject: character + attributes (clothing, mood)
  • Environment: location, time, lighting
  • Action: what changes across frames
  • Camera: pan / dolly / handheld feel
  • Duration: seconds and transition style

3) Keyframes first, motion second

The pipeline is more stable when it generates strong keyframes (a few anchor images) and then animates between them. Keyframes reduce identity drift and give the motion model a clear target.

4) A projector is a display system, not a batch renderer

A real-time projector needs predictable frame pacing. I run a render loop that always has something to display: if the next clip isn’t ready, it can loop the last shot, crossfade, or show an in-universe “loading” cut.

Latency Budget (Practical)

The real trick is scheduling. Generation is spiky: one slow step can stall the whole UX. I treat each stage as a queue and keep a small buffer of upcoming shots.

Stage What it does Latency target
VAD + ASR Partial + stabilized transcript 0.2–1.0s
Scene planning Transcript → shot plan JSON 0.3–1.5s
Keyframes Shot → anchor images 1–5s
Motion Animate to short video clips 3–15s
Post-process Upscale / fps / color / captions 0.5–3s

Implementation Notes

This is the kind of system where “glue code” matters more than any single model. The system needs: state management (characters + style), retries, caching, and graceful degradation.

# Pseudo-loop
# - always keep a few seconds buffered for the projector
# - treat every stage as streaming / incremental

while running:
    audio_chunk = mic.read()
    vad_events = vad.update(audio_chunk)
    asr_partial, asr_final = asr.update(audio_chunk)

    if asr_final:
        shots = planner.plan(asr_final, state=context)
        keyframe_queue.enqueue(shots)

    keyframes = keyframe_worker.try_generate(keyframe_queue)
    clips = motion_worker.try_generate(keyframes)
    ready_frames = postprocess.try_finalize(clips)

    projector.render(ready_frames or fallback_frames)

What’s Next

  • Better continuity: longer-term memory for character identity and locations
  • Interactive edits: “make it darker”, “change camera angle”, “slow down”
  • Audio-to-action alignment: synchronize motion beats to speech rhythm
  • Reliability: robust failure recovery so the projector never freezes