Combining Shape Completion and Grasp Prediction for Fast and Versatile Grasping with a Multi-Fingered Hand

Humt, Matthias; Winkelbauer, Dominik; Hillenbrand, Ulrich; Bäuml, Berthold

Combining Shape Completion and Grasp Prediction for Fast and Versatile Grasping with a Multi-Fingered Hand

Matthias Humt^1,2*, Dominik Winkelbauer^1,2*, Ulrich Hillenbrand¹, Berthold Bäuml^1,3

¹German Aerospace Center (DLR) ²TU Munich ³Deggendorf Institute of Technology
Humanoids 2023

* Equal contribution

arXiv Paper Code Video Poster

Pipeline overview: inference and training flow for shape completion and grasp prediction

Overview of our pipeline. During inference, a partial point cloud from a depth sensor is completed using VQDIF, then grasps are predicted using our two-stage architecture. During training, we generate realistic simulated Kinect depth images and use an analytical grasp planner to create ground truth labels. The complete pipeline takes only ~1 second from perception to grasp.

Video

Abstract

Grasping objects with limited or no prior knowledge about them is a highly relevant skill in assistive robotics. Still, in this general setting, it has remained an open problem, especially when it comes to only partial observability and versatile grasping with multi-fingered hands.

We present a novel, fast, and high fidelity deep learning pipeline consisting of a shape completion module that is based on a single depth image, and followed by a grasp predictor that is based on the predicted object shape. The shape completion network is based on VQDIF and predicts spatial occupancy values at arbitrary query points. As grasp predictor, we use our two-stage architecture that first generates hand poses using an autoregressive model and then regresses finger joint configurations per pose.

Critical factors turn out to be sufficient data realism and augmentation, as well as special attention to difficult cases during training. Experiments on a physical robot platform demonstrate successful grasping of a wide range of household objects based on a depth image from a single viewpoint. The whole pipeline is fast, taking only about 1 second for completing the object's shape (0.7s) and generating 1000 grasps (0.3s).

Pipeline Overview

Our pipeline consists of two steps during inference. From a partial point cloud of an object obtained through rendering (training) or a depth sensor (inference), we use a shape completion network based on VQDIF to predict the full object geometry implicitly as occupancy probabilities at query points on a grid. Once this network is trained, the predicted occupancy probabilities are used as input for the grasp predictor both during training on ground truth grasps as well as during deployment on our robot Agile Justin.

Architecture Details

Shape Completion

Grasp Prediction

Left: The shape completion network uses VQDIF to predict occupancy probabilities at continuous query points from a partial point cloud. Right: The two-stage grasp predictor first generates hand pose distributions using a generative network, then regresses finger joint configurations and grasp quality scores for each pose.

Realistic Depth Simulation

Comparison of standard depth rendering, Kinect simulation, and real Kinect data

Comparison of depth rendering methods. Left: Standard depth rendering. Middle: Kinect depth simulation using an optimized Blensor model. Right: Real Kinect depth image. The simulation more closely resembles the real depth image, showing similar noise patterns and regions of missing data compared to the standard depth rendering.

Training Improvements

Effect of Kinect Simulation

Basic Depth Rendering

With Kinect Simulation

Shape completion trained on Kinect simulation succeeds

Effects of Kinect depth simulation. Training on unrealistic depth data leads to failure in challenging regions far away from input data and training distribution (left). Accurately simulating sensor characteristics resolves this (right).

Effect of Finetuning

Without Finetuning

With Finetuning

Effects of finetuning. The standard training methodology can lead to failure cases on challenging objects and viewpoints. Automated finetuning on difficult examples through importance sampling alleviates or greatly reduces this problem.

Results on Realistic Synthetic Data

Evaluated on 6,700 simulated depth images from 67 novel objects (29 Automatica + 38 YCB)
High-quality meshes from laser scanned household objects

Volumetric Metrics

Model	IoU ↑	F1 ↑	Precision ↑	Recall ↑
Kinect Finetune	66.7	75.7	73.5	83.3
Kinect Scale	60.4	71	75.5	73.6
Kinect	58	68.9	75.7	70.6
Basic	49.5	61.1	74	59.5

Mesh Surface Metrics (CD ×10)

Model	CD ↓	F1 ↑	Precision ↑	Recall ↑
Kinect Finetune	0.258	41.5	43.4	40.2
Kinect Scale	0.276	43.4	44.6	42.8
Kinect	0.29	42.7	44.1	42
Basic	0.455	34.1	36.8	32.6

Shape completion results on diverse objects from Automatica and YCB datasets

Results on Real Sensor Data

Evaluated on 25 objects with real Kinect scans
Ground truth alignment via global-to-local pose estimation with manual verification

Volumetric Metrics on Real Scans (CD ×10)

Model	CD ↓	IoU ↑	F1 ↑	Precision ↑	Recall ↑
Kinect Finetune	0.134	74.7	83	79.2	89.2
Kinect Scale	0.147	72	81.1	79.8	84.2
Kinect	0.128	74.2	82.5	81.3	85.3
Basic	0.143	71.2	80.4	82.3	80.7

Key findings: The Kinect Finetune model achieves the best volumetric metrics (IoU 74.7%, F1 83%) while Kinect achieves the lowest Chamfer distance (0.128). Due to small sample size and high variance (~0.05 CD, ~2.4% IoU), further investigation is needed to differentiate between Kinect variants.

Qualitative Shape Completion Results

Qualitative results on the Automatica/YCB dataset. Our approach yields detailed object geometry from partial, noisy inputs. The complete meshes correspond to our kinect finetune model. The system successfully reconstructs a diverse range of household objects including bottles, cans, bowls, and complex shapes.

Handling Object Pose Uncertainty

Without Pose Uncertainty

Grasp without considering pose uncertainty - finger on edge

With Pose Uncertainty

Comparing grasp predictions with and without considering object pose uncertainty during training data generation. When not accounting for pose uncertainty, fingers are sometimes placed close to or on object edges, which can cause the grasp to fail due to small positioning errors. By adapting our grasp planner's objective to diminish the quality of grasps that are prone to missing the object due to small pose deviations, fingers are placed further from edges, resulting in more robust grasps.

Handling Joint Configuration Ambiguity

Single Output

s = 7.85

Head 0

s=1.37, c=3%

Head 1

s=8.93, c=50%

Head 2

s=5.34, c=6%

Head 3

s=5.45, c=39%

Head 4

s=1.26, c=2%

Comparing single-output network (left) with multi-output network (right) on predicting the grasp for a butter box. The single-output network predicts a mixture of all modes, leading to one finger intersecting the object. The multi-output network with 5 heads can explicitly handle ambiguities: two modes are actively used (heads 1 and 3), corresponding to a 4-finger grasp and a 3-finger grasp respectively. The classification logits (c) indicate which modes are most prominent in the training data.

Grasping Performance

We evaluate the grasping stage in simulation on 66 objects from the Automatica/YCB dataset. Each object is grasped 10 times with random rotations around the up-axis. Of all tested 660 grasps, 95.2% were successful.

The full pipeline of predicting 1000 grasps for a given object takes only about 1 second:

Shape completion: 0.7 seconds
Grasp prediction (1000 grasps): 0.3 seconds

All evaluated objects are unknown to the grasping network and not part of the training dataset. Even for unusual objects (e.g., stuffed toys), the completion only gradually breaks down, leading to locally correct completions that still allow for successful grasps.

Poster

BibTeX

@inproceedings{humt2023combining,
  title={Combining Shape Completion and Grasp Prediction for Fast and Versatile Grasping with a Multi-Fingered Hand},
  author={Humt, Matthias and Winkelbauer, Dominik and Hillenbrand, Ulrich and B{\"a}uml, Berthold},
  booktitle={IEEE-RAS International Conference on Humanoid Robots (Humanoids)},
  year={2023}
}