Welcome to this little journey where we discover some fundamental concepts in the realm of (machine) learning, namely context and attention. The story is told in two acts: 1. Why and 2. How. We will cover the why and the legacy part of how in this article and then take a closer look at a modern approach in the next one.

Why? Conceiving context

It all begins with a little entity making up most of the (digital) world around you. It takes many names some calling it word, pixel or point, but we will simply call it element. Our little element is secretive, revealing almost nothing about itself in isolation. In that regard, it is like its sibling in the real world, the atom. Both are atomic1. It has emergent properties though: Throw a couple of thousand of them together and you get a story, an image, a 3D model. What has changed? The Context.

Context: The circumstances that form the setting for an event, statement, or idea, and in terms of which it can be fully understood.

Let’s look at a couple of examples. The simplest (an therefore the one we will see most frequently throughout the article) is the word. Try to guess the meaning of the word below, then hover over it with your mouse (or tap on it) to reveal the context:

Did you guess the meaning correctly? Or was it the financial institution or place to sit? The point is, of course, that you couldn’t have known without the context of the entire sentence, as many words are ambiguous. It doesn’t stop there though. Even the sentence is ambiguous if your goal is to determine the book title or author who wrote it. To do so, you might need a paragraph, a page or even an entire chapter of context. In machine learning lingo, such broad context is commonly called a long-range dependency. Here is another one. Pay attention to the meaning of the word it:

Seeing tired, we know it must refer to the animal, as roads are seldom so while it’s the opposite for wide2.

Below, there are two more examples of increasing dimensionality (use the little arrows to switch between them). While sentences can be interpreted as one-dimensional sequences of word-elements, an image is a two-dimensional grid of picture-elements (pixels) and a 3D model can be represented by a cloud of point-elements3 (or volumetric-elements: voxels). You will notice that you can’t discern what is represented by the closeup view of the individual elements but when zooming out (using the “Zoom out” buttons and your mousewheel or fingers) the interpretation becomes trivial.