1Apple
2University of Illinois Urbana-Champaign
Rendering scenes observed in a monocular video from novel viewpoints is a challenging problem. For static scenes the community has studied both scene-specific optimization techniques, which optimize on every test scene, and generalized techniques, which only run a deep net forward pass on a test scene. In contrast, for dynamic scenes, scene-specific optimization techniques exist, but, to our best knowledge, there is currently no generalized method for dynamic novel view synthesis from a given monocular video. We cannot help but ask a natural question:
"Is generalized dynamic novel view synthesis from monocular videos possible today?"
To answer this question, we establish an analysis framework based on existing techniques and work toward the generalized approach. We find
"A pseudo-generalized approach, i.e., no scene-specific appearance optimization, is possible, but geometrically and temporally consistent depth estimates are needed."
To clarify
Despite no scene-specific appearance optimization, the pseudo-generalized approach improves upon some scene-specific methods.
In the following we present frames that are used for quantitative evaluation.
Please click each frame to see detailed videos.
On the left, we visualize the rendering camera trajectories:
(1) Green ones are for source cameras;
(2) Grey ones represent camera trajectories;
(3) Red one represents current camera corresponding to the rendering on the right.
The trajectories are reasonably far away from input source views, our pseudo-generalized approach performs decently.
The following videos present the spatio-temporal interpolation renderings for other scenes.
The following videos present the spatio-temporal interpolation renderings for DAVIS.
@inproceedings{zhao2024pgdvs,
title = {{Pseudo-Generalized Dynamic View Synthesis from a Video}},
author = {Xiaoming Zhao
and Alex Colburn
and Fangchang Ma
and Miguel Ángel Bautista
and Joshua M. Susskind
and Alexander G. Schwing},
booktitle = {ICLR},
year = {2024},
}
Work done as part of Xiaoming Zhao's internship at Apple.
We thank Zhengqi Li for fruitful discussions and providing rendered images from DVS and NSFF.