Looking for Lucy

You are on large ship looking for your friend Lucy. You already have some suspicion as to her whereabouts:

1. You expect her to be on the ship and not in the water (though possible, you deem it unlikely)
2. It is 1 pm, a time where she likes to eat lunch, so there is a good chance she’s in the ships restaurant in the middle of the ship1.

Probabilities and Priors

In probability theory, we call such beliefs for which we have not yet seen any evidence prior beliefs or simply priors. We’ve also implicitly established the parameter or variable we are trying to estimate: Lucy’s location. Let’s call it $\ell$. We can then write “I think Lucy is more likely to be on the ship than in the water” as $p(\ell=\mathrm{ship})=0.99$ and $p(\ell=\mathrm{water})=0.01$

This reads: “The probability of $\ell$ taking the value ship is $99\%$“ (and $1\%$ for water respectively). We can often drop the variable name whenever it takes on a specific value, like ship , so $p(\ell=\mathrm{ship})$ simply becomes $p(\mathrm{ship})$, to reduce clutter. Two more things to note here:

1. We’ve just defined a probability distribution of $\ell$, written $p(\ell)$, which maps each possible state of $\ell$ (ship and water) to a discrete probability ($99\%$ and $1\%$). In doing so, we promoted $\ell$ from mere variable to random variable. Why random? Because its value is not deterministic like $x=2+2$, but probabilistic, determined by some underlying random (or random seeming) cause. And because Lucy can only either be on the ship or not, the probabilities of these two states need to sum to 1 (or $100\%$).
2. She also can’t be a little bit on the ship and a little bit in the water (well, technically she probably could, but let’s keep it simple) which is why we call it a discrete probability distribution rather than a continuous one, where everything in between certain values is also possible.

Continuous, discrete, conditional

Let’s define such a continuous probability distribution for Lucy’s location on the ship. We could of course start enumerating all locations (restaurant, toilet, her room, sun deck, …), but because she could be everywhere on the ship, it’s tedious at best and impossible otherwise. We want to keep using $\ell$ to distinguish between ship and water so let’s use $e$ to talk about Lucy’s exact location on the ship. What do you think this means:

$p(e\vert \mathrm{ship})$?

It’s the probability distribution of Lucy’s exact location given she’s on the ship. The little bar $\vert$ means given or if as in: ”The probability that it will rain if there are clouds” $p(\mathrm{rain}\vert \mathrm{clouds})$. We call this a conditional probability distribution, because it depends on Lucy being on the ship. Therefore $p(e\vert \mathrm{water})=0$. No point in establishing an exact location if she’s in the ocean. Let’s say the ship is $50$ meters long, so Lucy could be anywhere between $e=0$ and $e=50$. So what’s $p(e=27\vert \mathrm{ship})$? Maybe surprisingly, it’s $0$. Why? Because for a continuous random variable, no exact values exist.

Interluding intervals: To understand this, think about forecasting the temperature for the next day. What’s the probability that it will be between $-40^\circ C$ and $+40^\circ C$? Probably close to $100\%$ (though never actually $100\%$). Between $0^\circ C$ and $30^\circ C$? Still quite high, say $70\%$ (the exact numbers don’t matter here). Between $25$ and $27$? Okay, that’s much, much less likely, let’s go with $2\%$. $25.1$ and $25.2$? Almost zero. $25.345363$ and $25.345364$? You get the point. So when talking about probabilities in the continuous case, always think in intervals.

If we belief that Lucy is somewhere in the middle of the ship if she in fact is on it, we could write it like this, $p(20\leq e\leq30\vert \mathrm{ship})=0.8$ but because this looks daunting lets call it $p(\mathrm{middle}\vert \mathrm{ship})$ instead.

The red line is called a probability density function and it describes our continuous probability distribution $p(e\vert\mathrm{ship})$. If the area under the curve between $e=20$ and $e=30$ is Lucys probability to be in the middle of the ship, what’s on the vertical axis then, you might think? It’s where the density comes into play. Like with a physical object, where the mass is determined by its volume and density2, so is probability mass determined by its area (in 2D) or volume (in 3D)3 and its density on the vertical axis.

If you are familiar with sums and integrals, it might be helpful to look a it this way: In the discrete case, where you simply enumerate all possible locations and attach your belief to each of them, you sum them up to get an overall estimate which can be written like this: $p(\mathrm{restaurant}\cap\mathrm{room})=p(\mathrm{restaurant})+p(\mathrm{room})$ where the flipped U means or. Imagine now you discretize the ship into smaller and smaller parts. In the limit, you have covered every micrometer of the ship through an infinite amount of discrete probabilities which is exactly how you can estimate the integral of a function, i.e. the area under the curve.

What if we want to take into account our prior beliefs about whether she’s on the ship or not? We multiply! If we’re $90\%$ certain that she’s on the ship and $80\%$ certain that she’s in the middle of it ($20\leq e < 30$) if she’s on it, than our overall belief for this scenario is $0.9\cdot 0.8 = 0.72$.4 We call this a joint probability distribution because it expresses our beliefs about two quantities at the same time: That Lucy is on the ship and in the middle of it. Using the quantities introduced earlier we can write it as:

$p(\mathrm{middle},\mathrm{ship})=p(\mathrm{middle}\vert\mathrm{ship})\cdot p(\mathrm{ship})$

See the little comma? That’s all there is to it. Why is this useful? Because often you only have information (or beliefs) about the individual statements but not about both of them together, or the the other way round. For example, what’s the probability that Lucy is in the ships restaurant and eating pizza? That’s kind of hard to reason about. It’s easier to think about her being in the restaurant, which you deem likely, say $70\%$, and that she’s eating pizza if shes indeed in the restaurant. Say there are two different meals, pizza and spaghetti and you know Lucy has a slight preference for the former, so $p(\mathrm{pizza})=0.6$ (and therefore $p(\mathrm{spaghetti})=0.4$ as those are the only two options). Now you can say:

\begin{aligned}p(\mathrm{restaurant},\mathrm{pizza})&=p(\mathrm{pizza}\vert\mathrm{restaurant})\cdot p(\mathrm{restaurant})\\&=0.6\cdot0.7=0.42\end{aligned}

Independence

In our example, Lucy can’t be in the middle of the ship if she’s in the ocean. Therefore, both statements depend on each other, i.e. having a belief about one influences the other. That’s not always the case though. Saturn is in line with Venus. What can you derive from this knowledge for your personal life? Nothing. Those statements are independent, so your probability of being happy $p(\mathrm{happy})$ stays the same, regardless of what Saturn (S) and Venus (V) are up to. This also means we are allowed to do the following:

\begin{aligned}p(\mathrm{happy},S\leftrightarrow V)&=p(\mathrm{happy}\vert S\leftrightarrow V)\cdot p(S\leftrightarrow V)\\&=p(\mathrm{happy})\cdot p(S\leftrightarrow V)\end{aligned}

The important part is the transition from the first to the second line. There is no conditional probability involved5, which is very useful, e.g. in machine learning, as it makes many calculations a lot easier.

Choosing a team: There is another interesting observation to be made here: Those probabilities we’ve chosen are, for the most part, beliefs about how the world is. We therefore make use of Bayesian statistics instead of Frequentist statistics, where probabilities are exclusively seen as intrinsic properties of the world to be measured. A fair coin for example has a $50\%$ chance of landing either head or tail and you can find out about this fact through the observation of repeated experiments. We’ll come back to the distinction between those two views on probabilities in future posts when talking about calibration.

More dimensions and variance

Now there is actually a better way to represent our beliefs of Lucys exact location on the ship using a two dimensional probability distribution! Instead of only saying where we expect her to be from back to front, i.e. rear to bow, we can now also express our belief about her position from left to right, i.e. port to starboard. Let’s call them $e_x$ for rear-bow and $e_y$ for port-starboard position.

Because the ship is longer than wide, there are more possible locations for Lucy to be in that direction.6 This can clearly be seen in the spread of the distribution, being more stretched out in $e_x$ direction. This spread is also called the variance of the probability distribution. The variance directly translate into our uncertainty about Lucys location. This it what it looks like from above:

Yellow signifies likely areas while purple values are unlikely.7 The lines connect coordinates of equal density, just as the lines on a map connect coordinates of equal altitude. Try rotating the figure to get a better understanding.

Another side effect of our new 2D distribution is, that we now simultaneously express our belief about Lucy being in the ocean, so we can do away with our additional variable $\ell$! You can also think about what it means that there is more probability mass near the ship than further away from it.8

Bayes’ Theorem

The final story I’d like to tell is this one: Suppose you ask another passenger if he has seen a hungry looking woman recently and he tells you that, while he couldn’t tell if she was hungry, he did speak to a woman called Lucy at the rear of the ship! What a coincidence. Such information is called evidence as it tells us something about the parameters we want to model and estimate, namely Lucys location. Let’s give this particular piece of information a name: $I$.

How should you deal with the new information? Intuitively you might think it’s a settled case. You are looking for Lucy and there is a Lucy at the rear of the ship. This however would only be true, if you were $100\%$ certain that the Lucy in question is in fact your friend.

Instead you need to ask the question: If this new information were true, how should it influence my belief? We can’t answer this question right away, so let’s first ask another question: How likely is it to obtain this new information if my current belief about the world is true? In our case, how likely is it that someone might have encountered a Lucy at the rear of the ship (evidence or information) if the Lucy I’m looking for is in the restaurant in the middle of the ship (prior or current belief)? The quantity we are talking about is a special kind of conditional probability and is called likelihood. We’ll soon see how it’s used, but first we need to estimate it.

Let’s start by visualizing the total space of possibilities by a square with side length 1:

Why? Because we can use it to visualize probabilities by the area they take up in the square. Half the square: $50\%$, a quarter: $25\%$.

The Prior: Now there are two possibilities: Either your friend Lucy is in the restaurant or not. The former is a quantity we’ve already estimated, which is our joint belief that she is on the ship and in the middle of it ($72\%$).

Let’s make this our new prior, because it was our belief before we obtained the new information from the other passenger. We’ll call it $p(\mathrm{middle})=0.72$ and put it in our possibility square, by taking up $72\%$ of it’s width and the entire height:

What’s the remaining area? It’s the probability (or, more precisely, how likely we think it is) that Lucy is not in the middle of the ship which is $1-0.72=0.28$, i.e. $28\%$.

The Likelihood: You conveniently know that there are 100 people on the ship and overheard a conversation about another Lucy who’s on the ship. How likely is it that she’s at the back? You don’t know anything about her and you assume, for simplicity, that all 100 people are spread out equally around the ship. We already established the ships length to be 50 meters and let’s say the rear is 5 meters long. If the ship has approximately equal width everywhere, the rear makes up $10\%$ of the entire ship. Therefore, you would expect around $10\%$ of all people to be there. As there are 100 people on board, that’s also conveniently $10\%$ of those, so the proportion of the area of the ship is the same as the probability to encounter any one specific person there. As there are two Lucys, you would have a $10\%$ chance to meet each of them, so $2\cdot0.1=0.2$ or $0.1+0.1=0.2$, i.e. $20\%$, to meet at least one.