1Northwestern Polytechnical University
* denotes equal contribution
# corresponding author
juanjuanzhu2022@mail.nwpu.edu.cn, wanzhexiong@mail.nwpu.edu.cn, daiyuchao@nwpu.edu.cn
Recently, the task of Video Frame Prediction (VFP), which predicts future video frames from previous ones through extrapolation, has made remarkable progress. However, the performance of existing VFP methods is still far from satisfactory due to the fixed framerate video used: 1) they have difficulties in handling complex dynamic scenes; 2) they cannot predict future frames with flexible prediction time intervals. The event cameras can record the intensity changes asynchronously with a very high temporal resolution, which provides rich dynamic information about the observed scenes. In this paper, we propose to predict video frames from a single image and the following events, which can not only handle complex dynamic scenes but also predict future frames with flexible prediction time intervals. First, we introduce a symmetrical cross-modal attention augmentation module to enhance the complementary information between images and events. Second, we propose to jointly achieve optical flow estimation and frame generation by combining the motion information of events and the semantic information of the image, then inpainting the holes produced by forward warping to obtain an ideal prediction frame. Based on these, we propose a lightweight pyramidal coarse-to-fine model that can predict a 720P frame within 25 ms. Extensive experiments show that our proposed model significantly outperforms the state-of-the-art frame-based and event-based VFP methods and has the fastest runtime.
In our framework, we first use two encoders to extract pyramid features for the image and events. Then we apply a coarse-to-fine joint decoder to get the synthesized feature and optical flow at each pyramid layer. In the decoder, we utilize Symmetrical Cross-modal Attention(SCA) to augment both image and event features. We also introduce Warping and Inpainting Module(WIM) to repair the holes caused by forward warping and get spatially-aligned image features. Finally, we adopt Weighted Fusion(WF) to output the final frame prediction from the synthesized and warped frames.
@inproceedings{Zhu_VFPSIE_AAAI_2024,
title={Video Frame Prediction from a Single Image and Events},
author={Zhu, Juanjuan and Wan, Zhexiong and Dai, Yuchao},
booktitle={AAAI Conference on Artificial Intelligence (AAAI)},
year={2024},
}
This research was supported in part by the National Natural Science Foundation of China (62271410, 62001394), Zhejiang Lab (NO.2021MC0AB05), the Fundamental Research Funds for the Central Universities, and the Innovation Foundation for Doctor Dissertation of Northwestern Polytechnical University (CX2023013).
Thanks the ACs and the reviewers for their comments, which is very helpful to improve our paper.
Thanks for the following helpful open source projects: Time Lens, IFRNet, RAFT, esim_py, DSEC.