Joint Appearance and Motion Learning for Efficient Rolling Shutter Correction

CVPR 2023


Bin Fan, Yuxin Mao1, Yuchao Dai, Zhexiong Wan1, Qi Liu,

School of Electronics and Information, Northwestern Polytechnical University, Xi'an, China   

Abstract


Rolling shutter correction (RSC) is becoming increasingly popular for RS cameras that are widely used in commercial and industrial applications. Despite the promising performance, existing RSC methods typically employ a two-stage network structure that ignores intrinsic information interactions and hinders fast inference. In this paper, we propose a single-stage encoder-decoder-based network, named JAMNet, for efficient RSC. It first extract pyramid features from consecutive RS inputs, and then simultaneously refines the two complementary information (i.e., global shutter appearance and undistortion motion field) to achieve mutual promotion in a joint learning decoder. To inject sufficient motion cues for guiding joint learning, we introduce a transformer-based motion embedding module and propose to pass hidden states across pyramid levels. Moreover, we present a new data augmentation strategy “vertical flip + inverse order” to release the potential of the RSC datasets. Experiments on various benchmarks show that our approach surpasses the state-of-the-art methods by a large margin, especially with a 4.7dB PSNR leap on real-world RSC.


Contribution


  • We propose a tractable single-stage architecture to jointly perform GS appearance refinement and undistortion motion estimation for efficient RS correction.
  • We develop a general data augmentation strategy, i.e., vertical flip and inverse order, to maximize the exploration of the RS correction datasets.
  • xperiments show that our approach not only achieves SOTA RSC accuracy, but also enjoys a fast inference speed and a flexible and compact network structure.

Network Architecture


Architecture

Overall architecture of our JAMNet. It has three main processes: a feature pyramid encoder, a transformer-based motion embedding module, and a joint appearance and motion decoder. After extracting the hierarchical pyramid features, the transformer is used for motion embedding to inject motion cues, followed by a coarse-to-fine decoder that gradually refines the GS appearance and motion fields at the same time, until synthesizing the final full-resolution GS image. A hidden state $h^{j}$ is also passed sequentially.


Quantitative comparisons


ablation study

Qualitative results


ablation study

Citation


@inproceedings{fan2023joint,
  title={Joint Appearance and Motion Learning for Efficient Rolling Shutter Correction},
  author={Fan, Bin and Mao, Yuxin and Dai, Yuchao and Wan, Zhexiong and Liu, Qi},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={5671--5681},
  year={2023}
}