Efficient Spatial-Temporal Information Fusion for LiDAR-Based 3D Moving Object Segmentation.

Efficient Spatial-Temporal Information Fusion
for LiDAR-Based 3D Moving Object Segmentation

IROS 2022

Jiadai Sun¹, Yuchao Dai¹, Xianjing Zhang², Jintao Xu², Rui Ai², Weihao Gu², Xieyuanli Chen

¹Northwestern Polytechnical University ²HAOMO.AI

Abstract

Accurate moving object segmentation is an essential task for autonomous driving. It can provide effective information for many downstream tasks, such as collision avoidance, path planning, and static map construction. How to effectively exploit the spatial-temporal information is a critical question for 3D LiDAR moving object segmentation (LiDAR-MOS). In this work, we propose a novel deep neural network exploiting both spatial-temporal information and different representation modalities of LiDAR scans to improve LiDAR-MOS performance. Specifically, we first use a range image-based dual-branch structure to separately deal with spatial and temporal information that can be obtained from sequential LiDAR scans, and later combine them using motion-guided attention modules. We also use a point refinement module via 3D sparse convolution to fuse the information from both LiDAR range image and point cloud representations and reduce the artifacts on the borders of the objects. We verify the effectiveness of our proposed approach on the LiDAR-MOS benchmark of SemanticKITTI. Our method outperforms the state-of-the-art methods significantly in terms of LiDAR-MOS IoU. Benefiting from the devised coarse-to-fine architecture, our method operates online at sensor frame rate.

MotionSeg3D Architecture

Overview of our method. We extend and modify SalsaNext into a dual-branch and dual-head architecture, consisting of a range image branch (Enc-A) to encode the appearance feature, a residual image branch (Enc-M) to encode the temporal motion information, and use multi-scales motion guided attention module to fuse them. And then an image head with skip connections is used to decode the features from fronts. Finally, we back-project 2D features to 3D points and use a point head to further refine the segmentation results. Specifically, BlockA and BlockE are the ResBlocks with dilated convolution, BlockB is the pooling and optional dropout layer, BlockC is the PixelShuffle and optional dropout layer, BlockD is the skip connection with optional dropout, BlockF is the fully connected layer.

More qualitative comparison

Qualitative results of different methods for LiDAR-MOS on the validation set of the SemanticKITTI-MOS dataset. Blue circles highlight incorrect predictions and blurred boundaries. Best viewed in color and zoom in for details.