Video
Abstract
Grasping objects with limited or no prior knowledge about them is a highly relevant skill in assistive robotics. Still, in this general setting, it has remained an open problem, especially when it comes to only partial observability and versatile grasping with multi-fingered hands.
We present a novel, fast, and high fidelity deep learning pipeline consisting of a shape completion module that is based on a single depth image, and followed by a grasp predictor that is based on the predicted object shape. The shape completion network is based on VQDIF and predicts spatial occupancy values at arbitrary query points. As grasp predictor, we use our two-stage architecture that first generates hand poses using an autoregressive model and then regresses finger joint configurations per pose.
Critical factors turn out to be sufficient data realism and augmentation, as well as special attention to difficult cases during training. Experiments on a physical robot platform demonstrate successful grasping of a wide range of household objects based on a depth image from a single viewpoint. The whole pipeline is fast, taking only about 1 second for completing the object's shape (0.7s) and generating 1000 grasps (0.3s).
Pipeline Overview
Our pipeline consists of two steps during inference. From a partial point cloud of an object obtained through rendering (training) or a depth sensor (inference), we use a shape completion network based on VQDIF to predict the full object geometry implicitly as occupancy probabilities at query points on a grid. Once this network is trained, the predicted occupancy probabilities are used as input for the grasp predictor both during training on ground truth grasps as well as during deployment on our robot Agile Justin.
Architecture Details
Shape Completion
Grasp Prediction
Left: The shape completion network uses VQDIF to predict occupancy probabilities at continuous query points from a partial point cloud. Right: The two-stage grasp predictor first generates hand pose distributions using a generative network, then regresses finger joint configurations and grasp quality scores for each pose.
Realistic Depth Simulation
Comparison of depth rendering methods. Left: Standard depth rendering. Middle: Kinect depth simulation using an optimized Blensor model. Right: Real Kinect depth image. The simulation more closely resembles the real depth image, showing similar noise patterns and regions of missing data compared to the standard depth rendering.
Training Improvements
Effect of Kinect Simulation
Basic Depth Rendering
With Kinect Simulation
Effects of Kinect depth simulation. Training on unrealistic depth data leads to failure in challenging regions far away from input data and training distribution (left). Accurately simulating sensor characteristics resolves this (right).
Effect of Finetuning
Without Finetuning
With Finetuning
Effects of finetuning. The standard training methodology can lead to failure cases on challenging objects and viewpoints. Automated finetuning on difficult examples through importance sampling alleviates or greatly reduces this problem.
Results on Realistic Synthetic Data
- Evaluated on 6,700 simulated depth images from 67 novel objects (29 Automatica + 38 YCB)
- High-quality meshes from laser scanned household objects
Volumetric Metrics
| Model | IoU ↑ | F1 ↑ | Precision ↑ | Recall ↑ |
|---|---|---|---|---|
| Kinect Finetune | 66.7 | 75.7 | 73.5 | 83.3 |
| Kinect Scale | 60.4 | 71 | 75.5 | 73.6 |
| Kinect | 58 | 68.9 | 75.7 | 70.6 |
| Basic | 49.5 | 61.1 | 74 | 59.5 |
Mesh Surface Metrics (CD ×10)
| Model | CD ↓ | F1 ↑ | Precision ↑ | Recall ↑ |
|---|---|---|---|---|
| Kinect Finetune | 0.258 | 41.5 | 43.4 | 40.2 |
| Kinect Scale | 0.276 | 43.4 | 44.6 | 42.8 |
| Kinect | 0.29 | 42.7 | 44.1 | 42 |
| Basic | 0.455 | 34.1 | 36.8 | 32.6 |
Results on Real Sensor Data
- Evaluated on 25 objects with real Kinect scans
- Ground truth alignment via global-to-local pose estimation with manual verification
Volumetric Metrics on Real Scans (CD ×10)
| Model | CD ↓ | IoU ↑ | F1 ↑ | Precision ↑ | Recall ↑ |
|---|---|---|---|---|---|
| Kinect Finetune | 0.134 | 74.7 | 83 | 79.2 | 89.2 |
| Kinect Scale | 0.147 | 72 | 81.1 | 79.8 | 84.2 |
| Kinect | 0.128 | 74.2 | 82.5 | 81.3 | 85.3 |
| Basic | 0.143 | 71.2 | 80.4 | 82.3 | 80.7 |
Key findings: The Kinect Finetune model achieves the best volumetric metrics (IoU 74.7%, F1 83%) while Kinect achieves the lowest Chamfer distance (0.128). Due to small sample size and high variance (~0.05 CD, ~2.4% IoU), further investigation is needed to differentiate between Kinect variants.
Qualitative Shape Completion Results
Qualitative results on the Automatica/YCB dataset. Our approach yields detailed object geometry from partial, noisy inputs. The complete meshes correspond to our kinect finetune model. The system successfully reconstructs a diverse range of household objects including bottles, cans, bowls, and complex shapes.
Handling Object Pose Uncertainty
Without Pose Uncertainty
With Pose Uncertainty
Comparing grasp predictions with and without considering object pose uncertainty during training data generation. When not accounting for pose uncertainty, fingers are sometimes placed close to or on object edges, which can cause the grasp to fail due to small positioning errors. By adapting our grasp planner's objective to diminish the quality of grasps that are prone to missing the object due to small pose deviations, fingers are placed further from edges, resulting in more robust grasps.
Handling Joint Configuration Ambiguity
Single Output
s = 7.85
Head 0
s=1.37, c=3%
Head 1
s=8.93, c=50%
Head 2
s=5.34, c=6%
Head 3
s=5.45, c=39%
Head 4
s=1.26, c=2%
Comparing single-output network (left) with multi-output network (right) on predicting the grasp for a butter box. The single-output network predicts a mixture of all modes, leading to one finger intersecting the object. The multi-output network with 5 heads can explicitly handle ambiguities: two modes are actively used (heads 1 and 3), corresponding to a 4-finger grasp and a 3-finger grasp respectively. The classification logits (c) indicate which modes are most prominent in the training data.
Grasping Performance
We evaluate the grasping stage in simulation on 66 objects from the Automatica/YCB dataset. Each object is grasped 10 times with random rotations around the up-axis. Of all tested 660 grasps, 95.2% were successful.
The full pipeline of predicting 1000 grasps for a given object takes only about 1 second:
- Shape completion: 0.7 seconds
- Grasp prediction (1000 grasps): 0.3 seconds
All evaluated objects are unknown to the grasping network and not part of the training dataset. Even for unusual objects (e.g., stuffed toys), the completion only gradually breaks down, leading to locally correct completions that still allow for successful grasps.
Poster
BibTeX
@inproceedings{humt2023combining,
title={Combining Shape Completion and Grasp Prediction for Fast and Versatile Grasping with a Multi-Fingered Hand},
author={Humt, Matthias and Winkelbauer, Dominik and Hillenbrand, Ulrich and B{\"a}uml, Berthold},
booktitle={IEEE-RAS International Conference on Humanoid Robots (Humanoids)},
year={2023}
}