Proximity Exists in N-Dimensional Space

This is a trivial math point but I’ve found it’s a useful tool for analysis in other areas.

Most high school students are familiar with the Pythagorean theorem which (by way of a right triangle) allows you to calculate the distance between points in a two-dimensional space. Given point A with coordinates (x1, y1) and point B with coordinates (x2, y2), the distance between the points can be expressed as:

$$ d_{A,B} = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2} $$

The formula for distance in higher dimensions can be derived easily – each dimension is orthogonal to the others so we can apply Pythagoras again:

$$ \begin{align} d_{A,B} & = \sqrt{ \begin{aligned} &\sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2}^2 \\ & + (z_1 - z_2)^2 \end{aligned} } \\ & = \sqrt{ \begin{aligned} &(x_1 - x_2)^2 \\ + &(y_1 - y_2)^2 \\ + &(z_1 - z_2)^2 \end{aligned} } \\ \end{align} $$

This generalizes to an arbitrary number of dimensions (though we lose our spatial intuition):

$$ \begin{align} d_{A,B} & = \sqrt{ \begin{aligned} &(x_{A,1} - x_{B,1})^ 2 \\ + &(x_{A,2} - x_{B,2})^ 2 \\ + &… \\ + &(x_{A,n} - x_{B,n})^ 2 \end{aligned} } \\ & = \sqrt{\sum_{i=1}^{n} (x_{A,i} - x_{B,i})^ 2} \\ \end{align} $$

One application of this is a simple form of image classification. Imagine a set of images, all 800x800 pixels. Let’s say each pixel is eight bits, so there are 256 possible color values. By serializing the pixel values from top-left to bottom right we can represent each image as a single point in 640,000-dimensional space! We can then identify similar images by their proximity to each other – images with similar structure will ‘cluster’ in this high-dimensional space.

If half of the photos are school portraits and half the photos are landscapes, we can identify the portraits simply by distance (either from a single reference image or via a clustering algorithm). In this high-dimensional space, images of different faces with different color backgrounds are much closer together than an image of a face and an image of a mountain range.

The generally applicable concept is not the Euclidean geometry, but the idea of items clustering in high dimensions where variance within the clusters is much less than variance between the clusters. I’ve found this concept is often applicable to sets of abstract concepts like companies or political orientations. Once you are aware of the nature of high-dimensional clustering it is easier to identify within-group differences as superficial rather than structural.