Deep Learning on photorealistic synthetic data

Let me preface this by encouraging you to keep reading regardless of you level of expertise in the field. I think the approach presented here is so general yet intuitive that it can benefit novices and experts alike while being supremely accessible.

Introduction

So, what is this and why is it exciting? You might be aware of the growing level of realism of computer generated contend in both the games and film industry, to the point where it’s sometimes indistinguishable from the real world. If this is completely new to you, I would encourage you to give it a quick search online. You will be amazed by how much of modern movies is actually not real but computer generated to the point where only the actors faces remain (if at all).

Now, if you are reading this, chances are you are neither a cinematographer nor game designer, so why should you care? Here is why I do: At work, I’m partly responsible for making our robots perceive the world. This is mostly done through images from cameras and we use neural networks to extract meaning from them. But neural networks need to be trained and they are not exactly quick learners. This means you need to provide tons of examples of what you want the network to learn before it can do anything useful. The most common tasks a robotic perception system needs to solve are object detection and classification but sometimes we might also need segmentation and pose estimation.

Instance Segmentation: Every pixel of every instance of each object, e.g. couch or chair, needs to be labeled in every image. The resulting instance segmentation masks can be visualized as semi-transparent overlays (hover over the image to see them). This insane amount of work leads to imperfect results: Only a subset of all visible objects gets labeled, not all instances get labeled (one of the two vases in the bookshelf is missing) or are lumped together (lower rows of books in the shelf), masks are not pixel-perfect (the light blue mask of the arm-chair) and objects get wrong labels (the fireplace is labeled as tv, the armchair in front is also labeled as couch).

How do we get training data for these tasks? Well, depending at where you work and what your budget looks like you might enlist friends, coworkers, students or paid workers online to draw those gorgeous bounding boxes around each object of interest in each image and additionally label them with a class name. For segmentation this becomes a truly daunting task and for pose estimation, you can’t even do it by any normal means1.

Apart from fatiguing fellow human beings by forcing them to do such boring work, they also get tired and make mistakes resulting in wrong class labels, too large or small bounding boxes and forgotten objects. You probably see where this is going: What if we could automate this task by generating training data with pixel-perfect annotations in huge quantities? Let’s explore the potential and accompanying difficulties of this idea through a running example: The cup.

By the end of this article, we want to be able to detect the occurrence and position of this cup in real photographs (and maybe even do segmentation and pose estimation) without hand-annotating even a single training datum.

Making some data

Before we can make synthetic training data, we first need to understand what it is. It all starts with a 3D model of the object(s) we want to work with. There is a lot of great 3D modeling software out there but we will focus on Blender because it is free, open source, cross-platform and, to be honest, simply awesome.