CoPart

Contextual Part Latents for 3D Generation

ICCV 2025


1HKUST    2CUHK    3SenseTime Research   

*Equal contribution    Corresponding author


CoPart
: high quality part-based 3D generation.


PartVerse Dataset

We are pleased to release the first large-scale 3D object part dataset PartVerse that has been manually annotated.

dit Comparasions


We follow the pipeline of "raw data - mesh segment algorithm - human post correction" to produce part-level data.

dit Comparasions


Part-level text captions are provided, including appearance, shape, and the relationship between parts and the whole.

dit Comparasions

Abstract

To generate 3D objects, early research focused on multi-view-driven approaches relying solely on 2D renderings. Recently, the 3D native latent diffusion paradigm has demonstrated superior performance in 3D generation, because it fully leverages the geometric information provided in ground truth 3D data. Despite its fast development, 3D diffusion still faces three challenges. First, the majority of these methods represent a 3D object by one single latent, regardless of its complexity. This may lead to detail loss when generating 3D objects with multiple complicated parts. Second, most 3D assets are designed parts by parts, yet the current holistic latent representation overlooks the independence of these parts and their interrelationships, limiting the model's generative ability. Third, current methods rely on global conditions (e.g., text, image, point cloud) to control the generation process, lacking detailed controllability. Therefore, motivated by how 3D designers create a 3D object, we present a new part-based 3D generation framework, CoPart, which represents a 3D object with multiple contextual part latents and simultaneously generates coherent 3D parts. This part-based framework has several advantages, including: i) reduces the encoding burden of intricate objects by decomposing them into simpler parts, ii) facilitates part learning and part relationship modeling, and iii) naturally supports part-level control. To ensure the coherence of part latents and to harness the powerful priors from foundation models, we propose a novel mutual guidance strategy to fine-tune pre-trained diffusion models for joint part latent denoising. We provide part-level text captions for each part, describing its shape, appearance, and relationship with the whole object.

Method Overview


dit Comparasions

The framework of CoPart operates as follows: Gaussian noise is added to part image and geometric tokens extracted from the VAE, which are then fed into 3D and 2D denoisers. Mutual guidance (a) is introduced to facilitate information exchange between the 3D and 2D modalities (via Cross-Modality Attention) as well as between different parts (via Cross-Part Attention). Additionally, (b) the 3D bounding boxes are treated as cube meshes, and the extracted box tokens are injected into the 3D denoiser through cross-attention. Simultaneously, the boxes are rendered into 2D images and injected into the 2D denoiser via ControlNet.

BibTeX

@article{dong2025copart,
  title={From One to More: Contextual Part Latents for 3D Generation},
  author={Shaocong Dong, Lihe Ding, Xiao Chen, Yaokun Li, Yuxin WANG, Yucheng Wang, Qi WANG, Jaehyeok Kim, Chenjian Gao, Zhanpeng Huang, Zibin Wang, Tianfan Xue, Dan Xu},
  booktitle={ICCV},
  year={2025}
}