Learning from Voxels

Welcome to part two of this four part series on learning from 3D data. In the previous post we’ve seen how to learn from point clouds after some motivation why we would want to and which obstacles need to be overcome in order to do so. Here, we will have a look at another way to represent and work with 3D data, namely the voxel (volumetric element) grid. The agenda remains unchanged: First we need to get some background on the voxel representation including advantages and disadvantages. Then we will have to understand how the learning actually works (spoiler: 3D convolutions) and how to put it to use. Finally, we see how to overcome some additional problems and maybe peak into some advanced ideas. A lot to do, but don’t worry, I’ll do my best to not get bogged down into the nitty gritty and as usual, there will be as many visualizations as reasonably justifiable.


If you haven’t come across voxel grids before, simply think Minecraft. In a voxel grid, everything is made up of equally sized cubes, the voxels. Below you see the same object—the Stanford Bunny from The Stanford 3D Scanning Repository—represented as a point cloud (left) and inside a voxel grid (right)1. More precisely, the second representation is referred to as occupancy grid, where only the occupied voxels are displayed. This is easy to obtain from other representations like point clouds by storing a binary variable for each voxel, setting it to $1$ for each voxel which contains at least one point. This corresponds to a black and white image in the 2D domain, i.e. there is a single channel (as opposed to three for the amount of red, green and blue in each pixel) and there are only two “colors”, or states, black ($0$) and white ($1$). You can drag to rotate and zoom in to reveal individual points and voxels.

Point cloud bunny
Voxel grid bunny

While conceptually simple, this binary representation is not the only one. Just as each pixel in an image2 can be binary, grayscale (each pixel value lies between $0$ and $1$), RGB colored (three values, each between $0$ and $1$) or even RGB-D, where D corresponds to the depth, i.e. the distance to the sensor (so we are at four values per pixel now), each voxel in a voxel grid can be described by an arbitrary number of values, also called features. For example, instead of setting each voxel to $1$ as soon as a single point happens to be inside, we could instead use some linear interpolation where more points correspond to a value closer to $1$ while a single point corresponds to a value close to $0$.

Another typical extension are normals, i.e. vectors orthogonal to the surrounding surface, which can be associated with each voxel by averaging the normal direction of all points residing within. There is an infinite number of things one can try, but the good news is, that it doesn’t matter too much3. As soon as we throw a data representation at a deep neural network, it will extract its own features from it and usually it does a much better job than we ever could on our own, which is the whole point of using them in the first place.

Now that we are up to speed on voxel grids, let’s move on to the question why we would want to use them as opposed to other kinds of 3D representations and also why we would rather not.

Convolution! Out of memory…

It all starts with structure. As it turns out, structured information is not only good for computation, where stuff we want to access should be stored in adjacent blocks of memory to speed up retrieval, but it is also good for learning.

Convoluted information

Going back to images, we find that adjacent pixels are usually highly correlated while far away ones are not. This means knowing about one pixel provides some amount of information about its neighbors4. Now, we can extract this neighborhood information by applying a convolution, i.e. a weight matrix, or kernel, to a patch of the image. As the “information” is represented by pixel values, performing arithmetic on those values, like multiplication and addition, corresponds to information processing, because different pixel and weight values produces different results. For example, keeping the weight matrix constant, as is done for inference, i.e. after the network is trained, we can extract “edge information” from the image by convolving it with an appropriate kernel (e.g. zeros on the left, ones on the right to extract vertical edges).

Crucially, one filter can extract the same information from everywhere in the image, meaning we only need one “vertical edge detection kernel” to extract all vertical edges5. If you wonder why we can’t apply the same trick on point clouds have a look at this section from the previous article. Hint: Point clouds are unstructured due to varying density and permutation invariance. Luckily though, we can apply convolutions on voxel grids, as they are simply three dimensional extensions of 2D pixel grids, i.e. images.

Some numbers

In the simplest case, you have a grayscale “image”, say of size $5\times5$ and a convolutional layer with a single filter, e.g. of size $3\times3$. Each filter has as many kernels as the input has channels, so in this case one. Adding a third input dimension changes almost nothing. Instead of a plane of $5\times5$ pixels we now have a cube with $5\times5\times5$ voxels. Our filter also becomes three dimensional, i.e. $3\times3\times3$. The resulting feature map, i.e. the result of convolving the filter with the input, is of size $4\times4$ assuming a stride of one and zero padding in case of the image, and of size $5\times5\times5$ for the voxel grid (where the zero padding is depicted as empty voxels). Easy.