PARIS3D: Reasoning-based 3D Part Segmentation Using Large Multimodal Model

1Mohamed bin Zayed University of Artificial Intelligence, UAE
teaser image.

We propose PARIS3D, a model that is capable of segmenting parts of 3D objects based on implicit textual queries and generating natural language explanations corresponding to 3D object segmentation requests. The figure shows our semantic segmentation results.

Abstract

Recent advancements in 3D perception systems have significantly improved their ability to perform visual recognition tasks such as segmentation. However, these systems still heavily rely on explicit human instruction to identify target objects or categories, lacking the capability to actively reason and comprehend implicit user intentions. We introduce a novel segmentation task known as reasoning part segmentation for 3D objects, aiming to output a segmentation mask based on complex and implicit textual queries about specific parts of a 3D object. To facilitate evaluation and benchmarking, we present a large 3D dataset comprising over 60k instructions paired with corresponding ground-truth part segmentation annotations specifically curated for reasoning-based 3D part segmentation. We propose a model that is capable of segmenting parts of 3D objects based on implicit textual queries and generating natural language explanations corresponding to 3D object segmentation requests. Experiments show that our method achieves competitive performance to models that use explicit queries, with the additional abilities to identify part concepts, reason about them, and complement them with world knowledge.

Video

Architecture

Semantic Segmentation Results

Comparison to previous 3D part segmentation methods. Object category mIoU(%) is shown. In the 45x8+28k setting, baseline models use an additional 28k training shapes for 17 overlapping object categories. These are categories present in common with PartNet datset. For the remaining 28 non-overlapping object categories, there are only 8 shapes per object category during training. PartSLIP* indicates that one model has been trained for each category. + shows our implementation of PartSLIP where one model is trained for all the categories together.

Generalizability Experiments

To demonstrate the generalizability of PARIS3D to data derived from the real world, we perform our 3D segmentation on real point clouds shot using a smartphone's LiDAR sensor, as suggested by this paper. In the image, we show qualitative examples of passing the fused point clouds through the PARIS3D architecture to obtain part segmentation labels as in the previous experiments without much drop in performance.

RPSeg Dataset

Given a coloured point cloud P, the goal of a 3D segmentation model is to predict its label for each point. However, in our reasoning segmentation task, we go further to output a 3D segmentation mask M, given an input point cloud and an implicit query text instruction. The complexity of the query text in reasoning part segmentation is a key differentiator. Instead of providing the names of the parts, the query text may include more intricate expressions that involve an understanding of structures, geometries, and semantics of 3D objects. By introducing this task, we aim to bridge the gap between user intent and system response, enabling more intuitive and dynamic interactions in 3D object perception.

BibTeX

 @misc{kareem2024paris3d,
      title={PARIS3D: Reasoning-based 3D Part Segmentation Using Large Multimodal Model}, 
      author={Amrin Kareem and Jean Lahoud and Hisham Cholakkal},
      year={2024},
      eprint={2404.03836},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}