Learning Bilateral Cost Volume for Rolling Shutter Temporal Super-Resolution

TPAMI 2024


Bin Fan, Yuchao Dai, Hongdong Li

School of Electronics and Information, Northwestern Polytechnical University, Xi'an, China   

Abstract


Rolling shutter temporal super-resolution (RSSR), which aims to synthesize intermediate global shutter (GS) video frames between two consecutive rolling shutter (RS) frames, has made remarkable progress with the development of deep convolutional neural networks over the past years. Existing methods cascade multiple separated networks to sequentially estimate intermediate motion fields and synthesize target GS frames. Nevertheless, they are typically complex, do not facilitate the interaction of complementary motion and appearance information, and suffer from problems such as pixel aliasing or poor interpretation. In this paper, we derive the uniform bilateral motion fields for RS-aware backward warping, which endows our network a more explicit geometric meaning by injecting spatio-temporal consistency information through time-offset embedding. More importantly, we develop a unified, single-stage RSSR pipeline to recover the latent GS video in a coarse-to-fine manner. It first extracts pyramid features from given inputs, and then refines the bilateral motion fields together with the anchor frame until generating the desired output. With the help of our proposed bilateral cost volume, which uses the anchor frame as a common reference to model the correlation with two RS frames, the gradually refined anchor frames not only facilitate intermediate motion estimation, but also compensate for contextual details, making additional frame synthesis or refinement networks unnecessary. Meanwhile, an asymmetric bilateral motion model built on top of the symmetric bilateral motion model further improves the generality and adaptability, yielding better GS video reconstruction performance. Extensive quantitative and qualitative experiments on synthetic and real data demonstrate that our method achieves new state-of-the-art results.


Contribution


  • We develop a uniform bilateral motion model for RS-aware frame warping, which is crucial for exploring the intrinsic geometric properties of RSSR tasks.
  • We propose a single-stage pipeline to jointly perform bilateral motion estimation and intermediate frame refinement for efficient GS video extraction.
  • Experiments demonstrate that our approach achieves quite excellent results while maintaining a compact, efficient, and promising network design.

Network Architecture


Architecture

Architecture overview of the proposed LBCNet. Our pipeline is an efficient and unified encoder-decoder-based network. In the encoder phase, a shared feature pyramid extractor is used to generate L-level pyramid contextual features. In the decoder phase, we propose a joint motion estimation and occlusion reasoning module (JMOM) to simultaneously estimate the bilateral motion fields and the anchor frame.


Quantitative comparisons on recovering GS images at time stamp


ablation study

Qualitative result on recovering GS images at the middle time stamp


ablation study

Citation


@article{fan_lbcnet_PAMI24,
  author={Fan, Bin and Dai, Yuchao and Li, Hongdong},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
  title={Learning Bilateral Cost Volume for Rolling Shutter Temporal Super-Resolution}, 
  year={2024},
  volume={46},
  number={5},
  pages={3862-3879},
  doi={10.1109/TPAMI.2024.3350900}
}