A sense of uncertainty

In this second part of my informal three part mini-series on probabilistic machine learning, we will be looking at Bayesian Neural Networks, i.e. the result of probability theory taking an interest in deep learning. Be sure to also have a look at the first part, especially if you are unfamiliar with the basics of probability theory. As usual, you can have a look at the code I used to generate the figures for this article and also play around with it Binder.

Some background

Let’s start with the topics that won’t be covered but for which I’ll supply some resources so you can brush up your knowledge if needed (just click on the small black arrow below). If you are like me and long resource lists give you anxiety because you feel obliged to read, watch and understand all of it before you can even start reading the article (which often results in an infinite regression into the depth of the Internet), don’t. Just pick whatever looks interesting or especially unclear or simply start reading the article and come back to the resources if something doesn’t make sense.

Show resources
  1. Linear algebra & calculus: Okay, I know, you see this everywhere and for me at least, it always feels discomforting. What is it supposed to mean anyway? Do I need to know all of linear algebra and calculus to understand anything? And what does “know” mean? That I can solve matrix multiplications, determinants, Eigenvectors and 10th degree derivatives by hand in a few seconds? That I can proof the fundamental equations that underlie those fields? I don’t think so.

    Usually, and this is also true here, it just means that you have an intuitive understanding of what is happening when multiplying a vector and a matrix or what a 2nd order derivative represents. Luckily, this kind of understanding can be obtained conveniently and even enjoyably by watching the following three video series (by one of my YouTube idols 3Blue1Brown who we will probably encounter again and again throughout this section and even throughout this blog):

  2. Probability theory: As you might have expected from the introduction, where there is Bayes, probability theory can’t be far. Again, an intuitive understanding will suffice to grasp what’s going on. Consider having a look at my article on the topic, which is intended specifically as a primer to probabilistic machine learning.
  3. Neural Networks: This is the second ingredient next to probability theory you need to construct a Bayesian Neural Network. 3Blue1Brown one more time.
  4. Machine Learning: Not strictly needed, but so cool that I need to share it. A visual introduction to machine learning: Part 1 and 2

What is this and why bother?

The first question coming to mind if confronted with the concept of a Bayesian Neural Network is, why even bother? My standard Neural Networks are working just fine, thank you! The answer is, that the world itself is inherently uncertain and a run-of-the-mill Neural Net has no idea what it’s talking about when it classifies your cat as a Ferrari with $99.9\%$ certainty.

When confronted with a difficult problem like “What did you eat on Monday two weeks ago?” you will probably preface whatever answer comes to mind with a “I’m not quite sure but I think…” or “It could have been…”. A standard Neural Net can’t do this. It’s more likely to conclude “She often eats spaghetti, so that’s what it was!”

A note for the critical among you: You might object that even a standard Neural Network returns a score for each class it’s predicting and you might be tempted to treat those numbers as probabilities of being correct, but there are at least two problems:

  1. Theoretical: Simply squishing an arbitrary collection of numbers through a Softmax function doesn’t magically produce real probabilities.
  2. Practical: It has been observed time-and-again now, that modern deep Neural Networks are overconfident (a notion we will come back to soon) such that the “confidence” expressed by the “probabilities” of the output layer don’t match the networks empirical frequency of being correct. In other words: A prediction of $0.7$ or $70\%$ for the class cat does not translate into $70$ out of $100$ cat images being classified correctly.

The real world is ambiguous and uncertain in all sorts of ways due to extremely complex interactions of a large number of factors (think, e.g., weather forecasting) and because we only ever observe it through some kind of interface: a camera, a microphone, our eyes. Those interfaces, usually called “sensors” in robotics, have their own problems like struggling with low light or transmitting corrupted information. An agent, be it a biological or artificial, must takes those uncertainties into account when operating within such an environment.

Uncertainty flavors

Usually, uncertainty is put into two broad categories which makes it easier to think about and model. The first, often called model uncertainty1, is inherent to the model (or agent) and describes its ignorance towards its own stupidity. A standard neural net is maximally ignorant in that it chooses one, most likely way of explaining everything—which translates into one specific set of parameters or weights—and then runs with those.

A standard neural network with specific weights [source].

This is equivalent to an old person having figured out the answers to all the important questions and being impossible to convince otherwise. A Bayesian Neural Network, just as a biological Bayesian (the person), works differently. It considers all possible ways of looking at the problem2 and weighs them by the amount of evidence it has observed for each of those ways. It then integrates them into one coherent explanation. We will see what that looks like in practice a bit later. You can think about this as having a probability distribution for each weight (the little squiggly lines in yellow below) which determines the likely and less likely values each weight can take on. Usually, we have one multi-dimensional probability distribution for the entire network (where the number of dimensions equals the number of weights), also capturing (some of) the covariances between the weights.

A Bayesian neural network with distributions over weights [source].

The second type of uncertainty is commonly referred to as data uncertainty3 and it’s exactly what it sounds like: is the information provided by the data clearly discernible or not? You might think about a fogy night in the forest where you’re trying to convince yourself, that this moving shape is just a branch of a tree swaying in the wind. You can look at it hard and from multiple angles, possibly reducing your uncertainty about the thing (model uncertainty) but you can’t change the fact that it’s night, foggy and your eyes simply aren’t cut for this kind of task (data uncertainty). This also sheds light onto the fact that model uncertainty can be reduced (with more data) but data uncertainty cannot (as it’s inherent to the data).

Below are some examples of data with low data uncertainty—the images are of good quality and the animals are clearly visible—but with a great potential for model uncertainty—though different, the animals look very similar, so we would need many examples to learn to differentiate between them.

Finally, both uncertainty flavors can be combined into an overall uncertainty about your decision: the predictive uncertainty. This is usually what one refers to when speaking about the topic of uncertainty and it is often simpler to obtain than the former two.

Modeling uncertainty

Now that we are certain about our need of uncertainty, we need to express it somehow. The only reason a human being doesn’t need a blueprint to do so is, that it has been indirectly hammered in by evolution and experience. In the sciences, this is done through the language of probability theory.

Before we can go any further, we need to sharpen up our vocabulary used to refer to specific things. Let’s first introduce our main protagonist: The neural network. It’s getting a bit more technical now, so feel free to review some of the necessary background knowledge if you’re struggling to follow.


A neural network is a non-linear mapping from input $\boldsymbol{x}$ to (predicted) output (or target) $\boldsymbol{\hat{y}}=f_W(\boldsymbol{x})$, parameterized by model parameters (or weights) $W$, where we assume the true target $\boldsymbol{y}$ was generated from our deterministic function $f$ plus noise $\epsilon$ such that $\boldsymbol{y}=f_W(\boldsymbol{x})+\epsilon$. The entirety of inputs and outputs is our data \(\mathcal{D}=\{(\boldsymbol{x}_i,\boldsymbol{y}_i)\}_{i=1}^N=X,Y\), i.e we have $N$ pairs of input and output where all the inputs are summarized in $X$ and all the outputs are summarized in $Y$. Bold symbols denote vectors while UPPERCASE symbols are matrices.

Excursus: Images

In our case, the inputs are images and the outputs are vectors of scalars, one for each possible class (or label) the network can predict (e.g. cat and dog), so our network provides a mapping from a bunch of real numbers (the RGB values of the pixels of the image) to a number of classes. This means we are dealing with a classification rather than a regression problem.

If this all makes perfect sense to you, skip ahead to the next section, otherwise, let’s quickly explore how a computer can “see” images to then tell us what’s there to be seen. As computers can only deal with bits, everything we throw at them needs to be in this format. If we are working with numbers, that’s easy to do, but images?

The first thing we need to understand is, that an image is made up of pixels. And by understand I don’t mean to know the fact that this is so, but to grasp the implications of it. Each pixel has a name (or ID) and a color. The name is its position, usually given by a $x$, $y$ coordinate in a grid with rows and columns, but can also be a single number, if an additional order, e.g. from top left to bottom right is given. The color is usually made up of a mixture of red, green and blue (RGB), one set of primary colors from which we can mix every other color. So if we are given a $32\times 32$ pixel image, what we actually work with are $5$ numbers per pixel, the row and column coordinates as well as the red, green and blue color values, times the number of pixels which turns out to be $32\times32\times5=5120$ numbers.

If you hover over the first image below, the position and RGB color values of the currently selected pixel will appear. Due to the low resolution ($128\times128$), the pixels are already visible. By selecting a portion of the image (you can click and drag a rectangle), they will become even more salient. On the right, I’ve split the image further into it’s color channels (red, green, blue). By overlaying them, the original colors appear4, but if you rotate the image (by clicking and dragging), you will notice, that in fact there are only different shades of the primary colors. You can also zoom in here (by using your mouse wheel) to expose the individual pixels again.