Complete-to-Partial 4D Distillation for Self-Supervised Point Cloud Sequence Representation Learning

1Tsinghua University, 2Shanghai Qi Zhi Institute, 3Shanghai Artificial Intelligence Laboratory

C2P proposes a new 4D self-supervised pre-training method to achieve the synergy of geometry and motion for point cloud representation learning.

Abstract

Recent work on 4D point cloud sequences has attracted a lot of attention. However, obtaining exhaustively labeled 4D datasets is often very expensive and laborious, so it is especially important to investigate how to utilize raw unlabeled data. However, most existing self-supervised point cloud representation learning methods only consider geometry from a static snapshot omitting the fact that sequential observations of dynamic scenes could reveal more comprehensive geometric details. And the video representation learning frameworks mostly model motion as image space flows, let alone being 3D-geometric-aware. To overcome such issues, this paper proposes a new 4D self-supervised pre-training method called Complete-to-Partial 4D Distillation. Our key idea is to formulate 4D self-supervised representation learning as a teacher-student knowledge distillation framework and let the student learn useful 4D representations with the guidance of the teacher. Experiments show that this approach significantly outperforms previous pre-training approaches on a wide range of 4D point cloud sequence understanding tasks including indoor and outdoor scenarios.

Video

Problem Overview

Our main idea is to distill the spatial-temporal information of a complete point cloud sequence into a partial point cloud sequence so that the neural network can extract strong features in a self-supervised manner.

Method Overview

Our main idea is to distill the spatial-temporal information of a complete point cloud sequence into a partial point cloud sequence so that the neural network can extract strong features in a self-supervised manner.

Results

We evaluate our C2P method on various downstream tasks, including action segmentation, semantic segmentation and action recognition. Satisfactory performance improvement in all experiments shows our method is effective for different kinds of downstream tasks and general for various data.