## Efficient Spatial-Temporal Information Fusion for LiDAR-Based 3D Moving Object Segmentation

#### IROS 2022

###### Jiadai Sun1, Yuchao Dai1, Xianjing Zhang2, Jintao Xu2, Rui Ai2, Weihao Gu2, Xieyuanli Chen

1Northwestern Polytechnical University    2HAOMO.AI

### Abstract

Accurate moving object segmentation is an essential task for autonomous driving. It can provide effective information for many downstream tasks, such as collision avoidance, path planning, and static map construction. How to effectively exploit the spatial-temporal information is a critical question for 3D LiDAR moving object segmentation (LiDAR-MOS). In this work, we propose a novel deep neural network exploiting both spatial-temporal information and different representation modalities of LiDAR scans to improve LiDAR-MOS performance. Specifically, we first use a range image-based dual-branch structure to separately deal with spatial and temporal information that can be obtained from sequential LiDAR scans, and later combine them using motion-guided attention modules. We also use a point refinement module via 3D sparse convolution to fuse the information from both LiDAR range image and point cloud representations and reduce the artifacts on the borders of the objects. We verify the effectiveness of our proposed approach on the LiDAR-MOS benchmark of SemanticKITTI. Our method outperforms the state-of-the-art methods significantly in terms of LiDAR-MOS IoU. Benefiting from the devised coarse-to-fine architecture, our method operates online at sensor frame rate.

### MotionSeg3D Architecture

Overview of our method. We extend and modify SalsaNext into a dual-branch and dual-head architecture, consisting of a range image branch (Enc-A) to encode the appearance feature, a residual image branch (Enc-M) to encode the temporal motion information, and use multi-scales motion guided attention module to fuse them. And then an image head with skip connections is used to decode the features from fronts. Finally, we back-project 2D features to 3D points and use a point head to further refine the segmentation results. Specifically, BlockA and BlockE are the ResBlocks with dilated convolution, BlockB is the pooling and optional dropout layer, BlockC is the PixelShuffle and optional dropout layer, BlockD is the skip connection with optional dropout, BlockF is the fully connected layer.

### Ablation study

Ablation study of components on the validation set (seq08). "$\Delta$" shows the improvement compared to the vanilla baseline (a).

### Comparison of our Point-Head and kNN post-processing

As can be seen, there is a clear improvement at the boundary, with fewer mis-predictions.

### More qualitative comparison

Qualitative results of different methods for LiDAR-MOS on the validation set of the SemanticKITTI-MOS dataset. Blue circles highlight incorrect predictions and blurred boundaries. Best viewed in color and zoom in for details.

### Acknowledgements

We would like to thank Yufei Wang and Mochu Xiang for their insightful and effective discussions.
And we would like to thank HAOMO.AI for the support of this work.

### Citation

@inproceedings{sun2022mos3d,
title={Efficient Spatial-Temporal Information Fusion for LiDAR-Based 3D Moving Object Segmentation},
author={Sun, Jiadai and Dai, Yuchao and Zhang, Xianjing and Xu, Jintao and Ai, Rui and Gu, Weihao and Chen, Xieyuanli},
booktitle={IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
year={2022},
organization={IEEE}
}