Saturday, February 24, 2024

Mean dinosaurs

There is a phenomenon about dinosaurs that seems interesting to explore. If you look at a poster of dinosaurs you'll notice that none of them have wattles, or extreme features like peacocks, or trunks or frill-necks or fleshy crests. In fact none of them are old or have a broken arm, none are overweight or missing a limb etc. This is odd because you would expect to see these characteristics occasionally, but in a collage of 100 dinosaurs you don't see any.

What's going on is that each dinosaur is individually reconstructed to be the best estimate from the data, moreover, each archetype is chosen to essentially be the average of all individuals of that species. So it has average age, average weight, and from our knowledge of that species it is unlikely to have any of the unusual features, so these are left out. Let's use the word mean(x) to be this best-guess average for that species.

This is well and good for the individual species, but when we present a set of dinosaurs we can't just use the mean for each species. We unconsciously ignore the fact that: 

mean({T-rex, triceratops, diplodicus, ..}) {mean(T-rex), mean(triceratops), mean(diplodicus), ..}

An average set of dinosaurs is not the same as the set of average dinosaurs. I'll show that this is related to averages over spaces with symmetries, so putting it in the same category as my previous 'mean xxx' posts.

Height

To explore this, let's start with a simpler case, a single species T-rex and a single characteristic, such as height. If we want to represent a single T-rex then using its mean height is the sensible choice. OK maybe we should use the modal height, or median height, or some other average, but for simplicity let's just use means. The same ideas apply, probably more so, for the other, less linear averages.

If however we want to show a set of two T-rex's, then both having mean height is not the average height distribution for that set. 

Here the mean height is 175cm (not a real T-rex! diagram from google) and the distribution of heights is approximately normal, with one standard deviation being 7 cm. 

It might be tempting to think that if you sample two T-rex's from the distribution, they will have on average heights 168 and 182 cm, i.e. plus and minus one standard deviation. This is not right.

You might also choose the two points shown in dark red, which are the half-way area points for the top and bottom half of the normal distribution. These are roughly 0.7979 standard deviations. This is also not right. 

The way to solve this is to take the 2D distribution of the two T-rex heights, and note that the two T-rex's are interchangeable, so rather than a 2D Euclidean space, it is a topological space with reflection symmetry on the diagonal; it identifies point pairs (x,y)=(y,x). This is a valid space on which to find a mean 2D point:

Here we see the 2D normal distribution for the two dinosaur heights, and the mirror symmetry down the magenta diagonal. As a consequence of this symmetric space, the mean 2D point is shown in magenta. Its value is sqrt(0.5) of the mid-area value, which is 0.564 standard deviations:
If we had three T-rexs on our poster then the mean set of heights could be found in a similar way by taking Euclidean space R3 but with the order symmetries (x,y,z)=(y,x,z)=(x,z,y)=(y,z,x)=(z,x,y)=(z,y,x).
In general can calculate the height for n T-rex's by sampling an n-dimensional normal distribution, sorting the elements in order, then taking the average over all ordered samples. Here are the sets for n = 1 up to 5, in standard deviations:
n:
1 {0}
2 {-0.564, 0.564}
3 {-0.84, 0, 0.84}
4 {-1.03, -0.297, 0.297, 1.03}
5 {-1.16, -0.495, 0, 0.495, 1.16}
You would of course multiply these by the standard deviation and add the mean, in order to get the correct mean set of T-rex heights.

Additionally, this approach is not just for normal distributions, it can apply to any distribution.

Height and width

What if we have two characteristics? For a single T-rex we simply use the mean height and mean length identified for that species. For a set of two T-rex's we have a 4D distribution function. The height and width are clearly interdependant characteristics, so is not just the product of the two individual distributions. That is fine. 

We then introduce the ordering symmetry (xh,xw, yh,yw) = (yh,yw, xh,xw), and find the mean 4D point, which is the weighted mean coordinate, weighted by the probability density at each coordinate.

At this point things get interesting. A mean on a non-Euclidean space is a different sort of beast. You can find it by randomly sampling the (4D) probability distribution, then choosing the closest ordering to the running average before adding it on. For the 2D case above this tends to a single result (or its reverse ordering), in 4D there is a remaining symmetry, that gets broken by this process. The result is a mean 4D point that changes each time you find the mean value.

As a result, your mean value (representing a set of two T-rex height and widths) itself follows a distribution. 

When the two distributions are uncorrelated and equal, the mean set is a uniform circular distribution around the mean, of radius 0.564, offset 0.564(sin a, cos a) with the other dinosaur in the set having offset 0.564(-sin a, -cos a). With a resulting from spontaneous symmetry breaking to a different angle each time the mean set is calculated.

If the two distributions are correlated, then the mean set distribution will be non-circular, and in fact elliptical if the distributions are a simple multivariate Gaussian. For highly correlated characteristics this random angle a will tend to give an offset in the long axis, giving consistent results each time it is calculated, in particular:

If height and length are positively correlated (as they usually are) then the distribution of means clusters around a set which is one short,narrow T-rex and one tall,wide T-rex. 

If height and length are negatively correlated, then the distribution of means clusters around one short, wide T-rex and one tall,narrow T-rex.

If there are three T-rex's in the set then the mean set still is a distribution over the (width,length) direction angles. But unlike the single-characteristic case, the three (width,length) vectors are not in a line. For equal unit sigma Gaussians, a calculated mean set is {(0.33,0.707), (-0.78,-0.06), (0.44,-0.647)}. This distribution of the variation onto the 3 dinosaurs is rotated each time it is calculated, with the sum of the squares of the values always being 1.84. These three points, as you might have guessed, form an equilateral triangle.
Above: width,height values for set of 3 T-rex's, with the standard deviation for the covariance in red.

In general, the mean set of m characteristics for n T-rex's distributes the n points over the m-dimensional ellipsoid of the distribution (or other shape), up to a rotation in m-dimensional space.

Species

Instead of characteristics, what if we have a known list of species with probabilities of occuring, and we wish to represent that as well as possible with a set of just two dinosaurs?

For independent classes like this, we have to use the mode rather than mean value. This is not just a big change, we already have integer means that round to the nearest integer, and integer weighted means. The mode is just a weighted mean like this, but for the n classes placed at the corners of an n-simplex.

Imagine the probability of spotting a T-rex is 50% and a triceratops is 50%, then to find the modal set {a,b} you have four possibilities: {T-rex, T-rex}, {T-rex, Triceratops}, {Triceratops, T-rex} and {Triceratops, Triceratops}, each with 1/4 likelihood.

But the middle two are equivalent in a set, so the {T-rex, Triceratops} set has twice the likelihood of the other two. 

This means that the modal set is {T-rex, Triceratops}. 

Even if the percentages are 66%, 34%, the modal set still contains one of each. Any fewer Triceratops and the modal set would be just two T-rex's.

For sets of three from two dinosaurs you have the options {a,a,a},{a,a,b},{a,b,a},{a,b,b},{b,a,a},{b,a,b},{b,b,a},{b,b,b}. But with the order symmetry there are only: {a,a,a},3{a,a,b},3{a,b,b},{b,b,b} with the relative likelihoods expressed with those coefficients. As a result now even if the Triceratops has probability 25.1%, it will appear in the set of 3 dinosaurs. 
Above: relative proportion of T-rex (b) with the mean set below it. 
This generalises to n set elements using the Pascal's triangle pattern. The result is that the sets distribute evenly over the possible relative proportions of the two dinosaurs. 

If there is a third dinosaur involved, and the mean set has three elements, then they have the following weightings: 1{a,a,a},3{a,a,b},3{a,b,b},1{b,b,b},3{a,a,c},3{a,a,c},1{c,c,c},3{b,b,c},3{b,c,c},1{c,c,c},6{a,b,c}. The result of this weighting is that the sets distribute evenly over the space of relative proportions of the three dinosaurs, which is a triangular space:
The mean set is therefore a proportional representation of the dinosaur weights. I fully expect the generalisation to n dinosaurs and m members of the set, to be a proportional representation.

Species with characteristics

What if we have a T-rex with a height distribution and a Triceratops with a different height distribution function? This becomes a hybrid of the discrete and continuous methods. 

First we look at the total probability (the area under the probability density function) for the two dinosaurs. These are our probabilities that will decide whether the mean set will be {a,a}, {a,b} or {b,b}. 

Then, let's say it is {a,b} and heights x,y, we generate the 2D probability distribution function P(x,y) = Pa(x)Pb(y) + Pb(x)Pa(y)   (the sum here due to the {a,b}={b,a} symmetry).

The centroid of this combined probability distribution gives you the height of the dinosaur a (x) and dinosaur b (y) to give the mean set {(a,x), (b,y)}. However, this mean set is itself a distribution, with some percentage also being {(b,x),(a,y)} depending on the relative total probabilities in the sum above.

If the T-rex is usually tall and the Triceratops usually short, then this will give an essentially singular mean set {(T-rex, tall), (Triceratops, short)} as expected. If they have the same distribution then there are two equally possible mean sets {(T-rex,mean+k),(Triceratops,mean-k)} or {(T-rex,mean-k),(Triceratops,mean+k)}.

General characteristics

If dinosaurs characteristics are highly correlated to their species, then your representative set mean{a,b,c} really is close to {mean(a), mean(b), mean(c)}. For example, if it was a microraptor, T-rex and brachiosaurus and the characteristics were height and neck length, then the set {mean{microraptor}, mean{T-rex}, mean(brachiosaurus)} is representative.

If however the characteristics are more uncorrelated with the species and with each other, then the mean set itself has a distribution of values, and your best strategy is to sample one at random. This is the case for things like 'contains a wattle', 'is overweight', 'is old', 'has a broken arm'. They all could happen to any species, and are uncorrelated with each other. 

So in this case your pair of dinosaurs should sample at a calculated k (Mahabolonis distance) from the mean, and produce the pair which are the equal and opposite from the mean in this random direction. 

Generalising to n dinosaurs in the set, the n dinosaurs sample n representative points in terms of sigmas (Mahabolonis space) from the mean, in a random direction.

This mean that sets of dinosaurs recover the variety of characteristics that are not evident in the {mean(a), mean(b), mean(c), ...} set.

In Short

Anyone who depicts a representative set of dinosaurs by using individual mean dinosaurs, is only correct when their characteristics are strongly correlated to the species and with each other. 

For more uncorrelated characteristics, larger sets are increasingly poor representations, and lack the diversity that they should show. 

We often see posters of dinosaurs where they are all the same age, all uninjured, none with unexpected fleshy features, etc etc. This all comes down to missing the key fact that:

  mean({a,b,c,...}) ≠ {mean(a), mean(b), mean(c), ...}

No comments:

Post a Comment