MUNet: Motion Uncertainty-aware Semi-supervised
Video Object Segmentation

arXiv: 2111.14646

Jiadai Sun1*, Yuxin Mao1*, Yuchao Dai1, Yiran Zhong2, Jianyuan Wang3

1Northwestern Polytechnical University    2SenseTime    3Australian National University
* denotes equal contribution


The task of semi-supervised video object segmentation (VOS) has been greatly advanced and state-of-the-art performance has been made by dense matching-based methods. The recent methods leverage space-time memory (STM) networks and learn to retrieve relevant information from all available sources, where the past frames with object masks form an external memory and the current frame as the query is segmented using the mask information in the memory. However, when forming the memory and performing matching, these methods only exploit the appearance information while ignoring the motion information. In this paper, we advocate the return of the motion information and propose a motion uncertainty-aware framework (MUNet) for semi-supervised VOS. First, we propose an implicit method to learn the spatial correspondences between neighboring frames, building upon a correlation cost volume. To handle the challenging cases of occlusion and textureless regions during constructing dense correspondences, we incorporate the uncertainty in dense matching and achieve motion uncertainty-aware feature representation. Second, we introduce a motion-aware spatial attention module to effectively fuse the motion feature with the semantic feature. Comprehensive experiments on challenging benchmarks show that using a small amount of data and combining it with powerful motion information can bring a significant performance boost. We achieve 76.5% J&F only using DAVIS17 for training, which significantly outperforms the SOTA methods under the low-data protocol.

Overview video

MUNet Architecture


First, we use Enc-Q and Enc-R to extract the semantic features $\mathcal{F}^{s}_{t}, \mathcal{F}^{s}_{t-1}$ of $\mathcal{I}_t, \mathcal{I}_{t-1}$ from image appearance. Then, we use MU-Layer to calculate the motion feature $\mathcal{F}^{m}_{t}$, and feed it into MSAM to guide and enhance the semantic feature $\mathcal{F}^{s}_{t}$. The matching results of Key-value Matcher is fed into the Decoder to generate the object mask $\widehat{\mathcal{M}}_t$, and the memory feature-bank is updated dynamically over time.

Qualitative comparison

ablation study

Qualitative comparison with competing some methods on three sequences of DAVIS17 validation set. Through the implicit modeling of objects motion and uncertainty, we can obtain more accurate results, especially in multiple bodies movements (row#1,3) and thin lines (row#2). The inaccurate parts are marked with a yellow dashed box.

Visualization of the displacement and uncertainty map

flow uncertainty visualization

We project the uncertainty map into a heat map and add it to the original image for better visualization.

IoU per-frame over time

IoU per-frame over time

IoU per-frame over time of Ours, STM and AFB-URR on five video sequences from DAVIS17 validation set. The last three columns have multiple objects, while the first two columns have only a single object. Since we effectively use the motion information between adjacent frames, the IoU can remain high even in the latter part of the video sequences.


  title={MUNet: Motion Uncertainty-aware Semi-supervised Video Object Segmentation},
  author={Sun, Jiadai and Mao, Yuxin and Dai, Yuchao and Zhong, Yiran and Wang, Jianyuan},
  journal={arXiv preprint arXiv:2111.14646},