Abstract
While generative models have seen significant adoption across a wide range of data modalities, including 3D data, a consensus on which model is best suited for which task has yet to be reached. Further, conditional information such as text and images to steer the generation process are frequently employed, whereas others, like partial 3D data, have not been thoroughly evaluated. In this work, we compare two of the most promising generative models—Denoising Diffusion Probabilistic Models and Autoregressive Causal Transformers—which we adapt for the tasks of generative shape modeling and completion. We conduct a thorough quantitative evaluation and comparison of both tasks, including a baseline discriminative model and an extensive ablation study. Our results show that (1) the diffusion model with continuous latents outperforms both the discriminative model and the autoregressive approach and delivers state-of-the-art performance on multi-modal shape completion from a single, noisy depth image under realistic conditions and (2) when compared on the same discrete latent space, the autoregressive model can match or exceed diffusion performance on these tasks.
Method
Generative shape completion pipeline. (1) Given an input point cloud sampled from an object's surface, we apply positional encoding and aggregate into a farthest-point-sampled set to form a latent code. (2) We model these latents either as a diagonal multivariate Gaussian for diffusion models or quantize them into a codebook for autoregressive models, forming our (VQ-)VAE encoder. For shape completion, we condition the generative model on the encoding of a partial view using a pre-trained feature extractor. (3) We predict occupancy probabilities through cross-attention between query points and latents sampled from the generative model. (4) Optionally, a mesh can be extracted using Marching Cubes. During inference, we sample latent codes either autoregressively or via denoising.
Real-World Results
Input
Ground Truth
Generative (Best)
Discriminative








Results on the Automatica/YCB dataset. When generating multiple completions and selecting the best one, the generative model consistently outperforms the discriminative baseline across all metrics. The generative approach produces plausible completions that better capture fine details like handles and spouts, especially when the input view is highly ambiguous.
BibTeX
@inproceedings{humt2026evaluating,
title={Evaluating Latent Generative Paradigms for High-Fidelity 3D Shape Completion from a Single Depth Image},
author={Humt, Matthias and Hillenbrand, Ulrich and Triebel, Rudolph},
booktitle={International Conference on 3D Vision (3DV)},
year={2026}
}

















