Rotationally Equivariant 3D Object Detection

CVPR 2022

  • 1Stanford University

  • 2Tsinghua University

Abstract

Rotation equivariance has recently become a strongly desired property in the 3D deep learning community. Yet most existing methods focus on equivariance regarding a global input rotation while ignoring the fact that rotation symmetry has its own spatial support. Specifically, we consider the object detection problem in 3D scenes, where an object bounding box should be equivariant regarding the object pose, independent of the scene motion. This suggests a new desired property we call object-level rotation equivariance. To incorporate object-level rotation equivariance into 3D object detectors, we need a mechanism to extract equivariant features with local object-level spatial support while being able to model cross-object context information. To this end, we propose Equivariant Object detection Network (EON) with a rotation equivariance suspension design to achieve object-level equivariance. EON can be applied to modern point cloud object detectors, such as VoteNet and PointRCNN, enabling them to exploit object rotation symmetry in scene-scale inputs. Our experiments on both indoor scene and autonomous driving datasets show that significant improvements are obtained by plugging our EON design into existing state-of-the-art 3D object detectors.

Video

Rotation Equivariance in 3D Detection

If we rotate an object in a scene, we want to see that:

  • The bounding box rotates along with the object, and keeps its shape unchanged.
  • Other boxes stay still.

In the figure below we rotate a chair in a room. While a non-equivariant detector yields jerking predictions, ours leads to smoother target box rotation and much smaller influences on other boxes.

Reference: rotating a chair

VoteNet predictions

EON-VoteNet (Ours)

Reference: rotating a chair

VoteNet predictions

EON-VoteNet (Ours)

Collectively, when we rotate the whole scene, we want to see the left figure below. A non-equivariant detector has difficulty generalizing to unseen object poses, yielding unstable predictions. Our method predicts more consistent boxes.

Reference: rotating a scene

VoteNet predictions

EON-VoteNet (Ours)

Reference: rotating a scene

VoteNet predictions

EON-VoteNet (Ours)

Quantitative Results

Our method works on both indoor and outdoor datasets, different 3D detector architectures, and different backbones.

Indoor (ScanNet V2)

Method VoteNet[1] EON-VoteNet (Ours) Transformer[2] EON-Transformer (Ours)
mAP 50.4 56.7 52.4 61.4

Outdoor (KITTI 3D)

Method PointRCNN[3] EON-PointRCNN (Ours)
Ped-AP 55.9 61.1
[1] C.R. Qi et.al., Deep hough voting for 3d object detection in point clouds, ICCV'19
[2] Z. Liu et.al., Group-free 3d object detection via transformers, ICCV'21
[3] S. Shi et.al., Pointrcnn: 3d object proposal generation and detection from point cloud, CVPR'19

Please send any question to Koven Yu.