1School of Electronics and Information, Northwestern Polytechnical University
2Department of Computer Science, National University of Singapore
3School of Intelligence Science and Technology, Peking University
Moving object segmentation plays a crucial role in understanding dynamic scenes involving moving objects. Existing methods based on video frames encounter difficulties in distinguishing whether pixel displacements of an object are caused by camera motion or object motion due to the complexities of accurate image-based motion modeling. Recent developments utilizing event data alone for motion segmentation have explored the motion sensitivity properties of event cameras. However, they struggle to accurately segment the pixel-by-pixel map of each object because events lack dense textures. To address these limitations imposed by unimodal data, we propose the first instance-level moving object segmentation framework that integrates complementary texture and motion cues. Our model incorporates implicit cross-modal masked attention augmentation, along with explicit contrastive feature learning and flow-guided motion enhancement to exploit dense texture information from a single image and rich motion information from events, respectively. By leveraging the augmented texture and motion features, we separate mask segmentation from motion classification to handle varying numbers of independently moving objects. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on both real and simulated datasets, illustrating the effectiveness of multimodal combinations.
Thanks for the following helpful open source projects: mmdetection, Restormer, pytorch_metric_learning.