Combining Shape Completion and Grasp Prediction for Fast and Versatile Grasping with a Multi-Fingered Hand

* Equal contribution
Pipeline overview: inference and training flow for shape completion and grasp prediction

Overview of our pipeline. During inference, a partial point cloud from a depth sensor is completed using VQDIF, then grasps are predicted using our two-stage architecture. During training, we generate realistic simulated Kinect depth images and use an analytical grasp planner to create ground truth labels. The complete pipeline takes only ~1 second from perception to grasp.

Video

Abstract

Grasping objects with limited or no prior knowledge about them is a highly relevant skill in assistive robotics. Still, in this general setting, it has remained an open problem, especially when it comes to only partial observability and versatile grasping with multi-fingered hands.

We present a novel, fast, and high fidelity deep learning pipeline consisting of a shape completion module that is based on a single depth image, and followed by a grasp predictor that is based on the predicted object shape. The shape completion network is based on VQDIF and predicts spatial occupancy values at arbitrary query points. As grasp predictor, we use our two-stage architecture that first generates hand poses using an autoregressive model and then regresses finger joint configurations per pose.

Critical factors turn out to be sufficient data realism and augmentation, as well as special attention to difficult cases during training. Experiments on a physical robot platform demonstrate successful grasping of a wide range of household objects based on a depth image from a single viewpoint. The whole pipeline is fast, taking only about 1 second for completing the object's shape (0.7s) and generating 1000 grasps (0.3s).

Pipeline Overview

Pipeline overview: from partial point cloud through shape completion to grasp prediction

Our pipeline consists of two steps during inference. From a partial point cloud of an object obtained through rendering (training) or a depth sensor (inference), we use a shape completion network based on VQDIF to predict the full object geometry implicitly as occupancy probabilities at query points on a grid. Once this network is trained, the predicted occupancy probabilities are used as input for the grasp predictor both during training on ground truth grasps as well as during deployment on our robot Agile Justin.

Architecture Details

Shape Completion

Shape completion architecture: encoder processes point cloud, VQDIF decoder predicts occupancy probabilities

Grasp Prediction

Grasp prediction architecture: generative network for hand poses, joint predictor for finger configurations

Left: The shape completion network uses VQDIF to predict occupancy probabilities at continuous query points from a partial point cloud. Right: The two-stage grasp predictor first generates hand pose distributions using a generative network, then regresses finger joint configurations and grasp quality scores for each pose.

Realistic Depth Simulation

Comparison of standard depth rendering, Kinect simulation, and real Kinect data

Comparison of depth rendering methods. Left: Standard depth rendering. Middle: Kinect depth simulation using an optimized Blensor model. Right: Real Kinect depth image. The simulation more closely resembles the real depth image, showing similar noise patterns and regions of missing data compared to the standard depth rendering.

Training Improvements

Effect of Kinect Simulation

Basic Depth Rendering

Shape completion trained on basic depth rendering fails in challenging regions

With Kinect Simulation

Shape completion trained on Kinect simulation succeeds

Effects of Kinect depth simulation. Training on unrealistic depth data leads to failure in challenging regions far away from input data and training distribution (left). Accurately simulating sensor characteristics resolves this (right).

Effect of Finetuning

Without Finetuning

Shape completion without finetuning shows failures

With Finetuning

Shape completion with finetuning succeeds

Effects of finetuning. The standard training methodology can lead to failure cases on challenging objects and viewpoints. Automated finetuning on difficult examples through importance sampling alleviates or greatly reduces this problem.

Results on Realistic Synthetic Data

  • Evaluated on 6,700 simulated depth images from 67 novel objects (29 Automatica + 38 YCB)
  • High-quality meshes from laser scanned household objects

Volumetric Metrics

Model IoU ↑ F1 ↑ Precision ↑ Recall ↑
Kinect Finetune 66.7 75.7 73.5 83.3
Kinect Scale 60.4 71 75.5 73.6
Kinect 58 68.9 75.7 70.6
Basic 49.5 61.1 74 59.5

Mesh Surface Metrics (CD ×10)

Model CD ↓ F1 ↑ Precision ↑ Recall ↑
Kinect Finetune 0.258 41.5 43.4 40.2
Kinect Scale 0.276 43.4 44.6 42.8
Kinect 0.29 42.7 44.1 42
Basic 0.455 34.1 36.8 32.6
Shape completion results on diverse objects from Automatica and YCB datasets

Results on Real Sensor Data

  • Evaluated on 25 objects with real Kinect scans
  • Ground truth alignment via global-to-local pose estimation with manual verification

Volumetric Metrics on Real Scans (CD ×10)

Model CD ↓ IoU ↑ F1 ↑ Precision ↑ Recall ↑
Kinect Finetune 0.134 74.7 83 79.2 89.2
Kinect Scale 0.147 72 81.1 79.8 84.2
Kinect 0.128 74.2 82.5 81.3 85.3
Basic 0.143 71.2 80.4 82.3 80.7

Key findings: The Kinect Finetune model achieves the best volumetric metrics (IoU 74.7%, F1 83%) while Kinect achieves the lowest Chamfer distance (0.128). Due to small sample size and high variance (~0.05 CD, ~2.4% IoU), further investigation is needed to differentiate between Kinect variants.

Qualitative Shape Completion Results

Qualitative shape completion results on Automatica/YCB dataset showing diverse objects

Qualitative results on the Automatica/YCB dataset. Our approach yields detailed object geometry from partial, noisy inputs. The complete meshes correspond to our kinect finetune model. The system successfully reconstructs a diverse range of household objects including bottles, cans, bowls, and complex shapes.

Handling Object Pose Uncertainty

Without Pose Uncertainty

Grasp without considering pose uncertainty - finger on edge

With Pose Uncertainty

Grasp with pose uncertainty - fingers away from edges

Comparing grasp predictions with and without considering object pose uncertainty during training data generation. When not accounting for pose uncertainty, fingers are sometimes placed close to or on object edges, which can cause the grasp to fail due to small positioning errors. By adapting our grasp planner's objective to diminish the quality of grasps that are prone to missing the object due to small pose deviations, fingers are placed further from edges, resulting in more robust grasps.

Handling Joint Configuration Ambiguity

Single Output

Single output network prediction showing finger collision

s = 7.85

Head 0

Multi-output head 0

s=1.37, c=3%

Head 1

Multi-output head 1 - 4 finger grasp

s=8.93, c=50%

Head 2

Multi-output head 2

s=5.34, c=6%

Head 3

Multi-output head 3 - 3 finger grasp

s=5.45, c=39%

Head 4

Multi-output head 4

s=1.26, c=2%

Comparing single-output network (left) with multi-output network (right) on predicting the grasp for a butter box. The single-output network predicts a mixture of all modes, leading to one finger intersecting the object. The multi-output network with 5 heads can explicitly handle ambiguities: two modes are actively used (heads 1 and 3), corresponding to a 4-finger grasp and a 3-finger grasp respectively. The classification logits (c) indicate which modes are most prominent in the training data.

Grasping Performance

We evaluate the grasping stage in simulation on 66 objects from the Automatica/YCB dataset. Each object is grasped 10 times with random rotations around the up-axis. Of all tested 660 grasps, 95.2% were successful.

The full pipeline of predicting 1000 grasps for a given object takes only about 1 second:

  • Shape completion: 0.7 seconds
  • Grasp prediction (1000 grasps): 0.3 seconds

All evaluated objects are unknown to the grasping network and not part of the training dataset. Even for unusual objects (e.g., stuffed toys), the completion only gradually breaks down, leading to locally correct completions that still allow for successful grasps.

Poster

BibTeX

@inproceedings{humt2023combining,
  title={Combining Shape Completion and Grasp Prediction for Fast and Versatile Grasping with a Multi-Fingered Hand},
  author={Humt, Matthias and Winkelbauer, Dominik and Hillenbrand, Ulrich and B{\"a}uml, Berthold},
  booktitle={IEEE-RAS International Conference on Humanoid Robots (Humanoids)},
  year={2023}
}