CVPR 2026

LiVER

Lighting-grounded Video Generation with Renderer-based Agent Reasoning

LiVER turns a coarse 3D scene, camera trajectory, and HDR lighting into physically grounded render passes, then injects these cues into a video diffusion model for controllable photorealistic generation.

Peking University · BAAI · OpenBayes · Beijing University of Posts and Telecommunications

Reference Lighting Scene proxy Generated video
Abstract

Diffusion models have achieved remarkable progress in video generation, but their controllability remains a major limitation. Key scene factors such as layout, lighting, and camera trajectory are often entangled or only weakly modeled, restricting their applicability in domains like filmmaking and virtual production where explicit scene control is essential. We present LiVER, a diffusion-based framework for scene-controllable video generation. To achieve this, we introduce a novel framework that conditions video synthesis on explicit 3D scene properties, supported by a new large-scale dataset with dense annotations of object layout, lighting, and camera parameters. Our method disentangles these properties by rendering control signals from a unified 3D representation. We propose a lightweight conditioning module and a progressive training strategy to integrate these signals into a foundational video diffusion model, ensuring stable convergence and high fidelity. Our framework enables a wide range of applications, including image-to-video and video-to-video synthesis where the underlying 3D scene is fully editable. To further enhance usability, we develop a scene agent that automatically translates high-level user instructions into the required 3D control signals. Experiments show that LiVER achieves state-of-the-art photorealism and temporal consistency while enabling precise, disentangled control over scene factors, setting a new standard for controllable video generation.

Method

Reason in 3D, synthesize in video space.

LiVER uses rendering as the contract between high-level scene intent and generative video synthesis. The agent prepares the scene; the renderer exposes physical cues; the diffusion model turns them into photorealistic frames.

01

Renderer-based agent

Converts natural scene instructions into layout, camera trajectory, lighting, and renderable scene structure.

02

Lighting-aware proxy

Uses diffuse, rough GGX, and glossy GGX passes to preserve physically meaningful illumination while staying compact.

03

Progressive training

Aligns renderer signals with a video foundation model while preserving temporal stability and visual fidelity.

LiVER overall framework teaser

Pipeline

From agent reasoning to grounded render passes.

LiVER method pipeline
Agent-guided scene construction and video conditioning pipeline.

Scene Proxy

Compact render signals expose lighting behavior.

Rendered scene proxy evolving into the final video.
Diffuse render pass
Diffuse
Rough material render pass
Rough GGX
Glossy material render pass
Glossy GGX
Generated output frame
Output

Control

Same scene and camera route, different illumination.

Each pair keeps the scene content and viewpoint aligned while swapping the lighting condition, making the relighting effect explicit rather than mixing unrelated setups.

Tower scene

Day lighting
Night lighting

Launch scene

Day lighting
Night lighting

Monument scene

Day lighting
Night lighting

LiVERSet

A lighting-aware video dataset with dense scene annotations.

LiVERSet combines real-world videos and synthetic physically based renders, with scene geometry, HDR environment maps, camera poses, and text descriptions.

11K+videos
81frames each
720 × 1280resolution
10K / 1Ktrain / eval
LiVERSet data annotation pipeline
Data annotation pipeline for geometry, lighting, and camera cues.
Annotated LiVERSet example 1 Annotated LiVERSet example 2 Annotated LiVERSet example 3 Annotated LiVERSet example 4

Results

Photorealistic videos with grounded lighting cues.

Generated result 01
Generated result 02
Generated result 03
Generated result 04
Generated result 05

Comparison

Renderer cues help preserve layout, motion, and illumination.

3D-aware baseline
Reference

Citation

Reference

@inproceedings{cai2026liver,
  title={Lighting-grounded Video Generation with Renderer-based Agent Reasoning},
  author={Cai, Ziqi and Yang, Taoyu and Chang, Zheng and Li, Si and Jiang, Han and Weng, Shuchen and Shi, Boxin},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}