Pseudo-Generalized Dynamic View Synthesis from a Video

(Originally titled as "Is Generalized Dynamic Novel View Synthesis from Monocular Videos Possible Today?")

ICLR 2024


Xiaoming Zhao1,2,    Alex Colburn1,    Fangchang Ma1,    Miguel Ángel Bautista1,    Joshua M. Susskind1,    Alexander G. Schwing1


1Apple   
2University of Illinois Urbana-Champaign



Rendering scenes observed in a monocular video from novel viewpoints is a challenging problem. For static scenes the community has studied both scene-specific optimization techniques, which optimize on every test scene, and generalized techniques, which only run a deep net forward pass on a test scene. In contrast, for dynamic scenes, scene-specific optimization techniques exist, but, to our best knowledge, there is currently no generalized method for dynamic novel view synthesis from a given monocular video. We cannot help but ask a natural question:

"Is generalized dynamic novel view synthesis from monocular videos possible today?"

To answer this question, we establish an analysis framework based on existing techniques and work toward the generalized approach. We find

"A pseudo-generalized approach, i.e., no scene-specific appearance optimization, is possible, but geometrically and temporally consistent depth estimates are needed."

To clarify

  1. We use the word pseudo due to the required scene-specific consistent depth optimization, which has already been utilized in many scene-specific approaches and can be replaced with depth from physical sensors, e.g., an iPhone LiDAR;
  2. We call it generalized because of no need for costly scene-specific appearance fitting.

Despite no scene-specific appearance optimization, the pseudo-generalized approach improves upon some scene-specific methods.



[NVIDIA Dynamic Scenes] Videos from Frames for Quantitative Evaluations

In the following we present frames that are used for quantitative evaluation.
Please click each frame to see detailed videos.


[DyCheck iPhone] Videos from Frames for Quantitative Evaluations

In the following we present frames that are used for quantitative evaluation.
Please click each frame to see detailed videos.


[NVIDIA Dynamic Scenes] Spatial-temporal Interpolation


(ZERO scene-specific appearance optimization on these scenes)

On the left, we visualize the rendering camera trajectories: (1) Green ones are for source cameras; (2) Grey ones represent camera trajectories; (3) Red one represents current camera corresponding to the rendering on the right.

The trajectories are reasonably far away from input source views, our pseudo-generalized approach performs decently.


The following videos present the spatio-temporal interpolation renderings for other scenes.


[DAVIS] Spatial-temporal Interpolation


(ZERO scene-specific appearance optimization on these scenes)

The following videos present the spatio-temporal interpolation renderings for DAVIS.



Bibtex

						
			@inproceedings{zhao2024pgdvs,
				title = {{Pseudo-Generalized Dynamic View Synthesis from a Video}},
				author = {Xiaoming Zhao
					and Alex Colburn
					and Fangchang Ma
					and Miguel Ángel Bautista,
					and Joshua M. Susskind,
					and Alexander G. Schwing},
				booktitle = {ICLR},
				year = {2024},
			}
						
					

Acknowledgements

Work done as part of Xiaoming Zhao's internship at Apple.
We thank Zhengqi Li for fruitful discussions and providing rendered images from DVS and NSFF.