In general can calculate the height for n T-rex's by sampling an n-dimensional normal distribution, sorting the elements in order, then taking the average over all ordered samples. Here are the sets for n = 1 up to 5, in standard deviations:
You would of course multiply these by the standard deviation and add the mean, in order to get the correct mean set of T-rex heights.
Additionally, this approach is not just for normal distributions, it can apply to any distribution.
Height and width
What if we have two characteristics? For a single T-rex we simply use the mean height and mean length identified for that species. For a set of two T-rex's we have a 4D distribution function. The height and width are clearly interdependant characteristics, so is not just the product of the two individual distributions. That is fine.
We then introduce the ordering symmetry (xh,xw, yh,yw) = (yh,yw, xh,xw), and find the mean 4D point, which is the weighted mean coordinate, weighted by the probability density at each coordinate.
At this point things get interesting. A mean on a non-Euclidean space is a different sort of beast. You can find it by randomly sampling the (4D) probability distribution, then choosing the closest ordering to the running average before adding it on. For the 2D case above this tends to a single result (or its reverse ordering), in 4D there is a remaining symmetry, that gets broken by this process. The result is a mean 4D point that changes each time you find the mean value.
As a result, your mean value (representing a set of two T-rex height and widths) itself follows a distribution.
When the two distributions are uncorrelated and equal, the mean set is a uniform circular distribution around the mean, of radius 0.564, offset 0.564(sin a, cos a) with the other dinosaur in the set having offset 0.564(-sin a, -cos a). With a resulting from spontaneous symmetry breaking to a different angle each time the mean set is calculated.
If the two distributions are correlated, then the mean set distribution will be non-circular, and in fact elliptical if the distributions are a simple multivariate Gaussian. For highly correlated characteristics this random angle a will tend to give an offset in the long axis, giving consistent results each time it is calculated, in particular:
If height and length are positively correlated (as they usually are) then the distribution of means clusters around a set which is one short,narrow T-rex and one tall,wide T-rex.
If height and length are negatively correlated, then the distribution of means clusters around one short, wide T-rex and one tall,narrow T-rex.
If there are three T-rex's in the set then the mean set still is a distribution over the (width,length) direction angles. But unlike the single-characteristic case, the three (width,length) vectors are not in a line. For equal unit sigma Gaussians, a calculated mean set is {(0.33,0.707), (-0.78,-0.06), (0.44,-0.647)}. This distribution of the variation onto the 3 dinosaurs is rotated each time it is calculated, with the sum of the squares of the values always being 1.84. These three points, as you might have guessed, form an equilateral triangle.
Above: width,height values for set of 3 T-rex's, with the standard deviation for the covariance in red.
In general, the mean set of m characteristics for n T-rex's distributes the n points over the m-dimensional ellipsoid of the distribution (or other shape), up to a rotation in m-dimensional space.
Species
Instead of characteristics, what if we have a known list of species with probabilities of occuring, and we wish to represent that as well as possible with a set of just two dinosaurs?
For independent classes like this, we have to use the mode rather than mean value. This is not just a big change, we already have integer means that round to the nearest integer, and integer weighted means. The mode is just a weighted mean like this, but for the n classes placed at the corners of an n-simplex.
Imagine the probability of spotting a T-rex is 50% and a triceratops is 50%, then to find the modal set {a,b} you have four possibilities: {T-rex, T-rex}, {T-rex, Triceratops}, {Triceratops, T-rex} and {Triceratops, Triceratops}, each with 1/4 likelihood.
But the middle two are equivalent in a set, so the {T-rex, Triceratops} set has twice the likelihood of the other two.
This means that the modal set is {T-rex, Triceratops}.
Even if the percentages are 66%, 34%, the modal set still contains one of each. Any fewer Triceratops and the modal set would be just two T-rex's.
For sets of three from two dinosaurs you have the options {a,a,a},{a,a,b},{a,b,a},{a,b,b},{b,a,a},{b,a,b},{b,b,a},{b,b,b}. But with the order symmetry there are only: {a,a,a},3{a,a,b},3{a,b,b},{b,b,b} with the relative likelihoods expressed with those coefficients. As a result now even if the Triceratops has probability 25.1%, it will appear in the set of 3 dinosaurs.
Above: relative proportion of T-rex (b) with the mean set below it.
This generalises to n set elements using the Pascal's triangle pattern. The result is that the sets distribute evenly over the possible relative proportions of the two dinosaurs.
If there is a third dinosaur involved, and the mean set has three elements, then they have the following weightings: 1{a,a,a},3{a,a,b},3{a,b,b},1{b,b,b},3{a,a,c},3{a,a,c},1{c,c,c},3{b,b,c},3{b,c,c},1{c,c,c},6{a,b,c}. The result of this weighting is that the sets distribute evenly over the space of relative proportions of the three dinosaurs, which is a triangular space:
The mean set is therefore a
proportional representation of the dinosaur weights. I fully expect the generalisation to n dinosaurs and m members of the set, to be a proportional representation.
Species with characteristics
What if we have a T-rex with a height distribution and a Triceratops with a different height distribution function? This becomes a hybrid of the discrete and continuous methods.
First we look at the total probability (the area under the probability density function) for the two dinosaurs. These are our probabilities that will decide whether the mean set will be {a,a}, {a,b} or {b,b}.
Then, let's say it is {a,b} and heights x,y, we generate the 2D probability distribution function P(x,y) = Pa(x)Pb(y) + Pb(x)Pa(y) (the sum here due to the {a,b}={b,a} symmetry).
The centroid of this combined probability distribution gives you the height of the dinosaur a (x) and dinosaur b (y) to give the mean set {(a,x), (b,y)}. However, this mean set is itself a distribution, with some percentage also being {(b,x),(a,y)} depending on the relative total probabilities in the sum above.
If the T-rex is usually tall and the Triceratops usually short, then this will give an essentially singular mean set {(T-rex, tall), (Triceratops, short)} as expected. If they have the same distribution then there are two equally possible mean sets {(T-rex,mean+k),(Triceratops,mean-k)} or {(T-rex,mean-k),(Triceratops,mean+k)}.
General characteristics
If dinosaurs characteristics are highly correlated to their species, then your representative set mean{a,b,c} really is close to {mean(a), mean(b), mean(c)}. For example, if it was a microraptor, T-rex and brachiosaurus and the characteristics were height and neck length, then the set {mean{microraptor}, mean{T-rex}, mean(brachiosaurus)} is representative.
If however the characteristics are more uncorrelated with the species and with each other, then the mean set itself has a distribution of values, and your best strategy is to sample one at random. This is the case for things like 'contains a wattle', 'is overweight', 'is old', 'has a broken arm'. They all could happen to any species, and are uncorrelated with each other.
So in this case your pair of dinosaurs should sample at a calculated k (Mahabolonis distance) from the mean, and produce the pair which are the equal and opposite from the mean in this random direction.
Generalising to n dinosaurs in the set, the n dinosaurs sample n representative points in terms of sigmas (Mahabolonis space) from the mean, in a random direction.
This mean that sets of dinosaurs recover the variety of characteristics that are not evident in the {mean(a), mean(b), mean(c), ...} set.
In Short
Anyone who depicts a representative set of dinosaurs by using individual mean dinosaurs, is only correct when their characteristics are strongly correlated to the species and with each other.
For more uncorrelated characteristics, larger sets are increasingly poor representations, and lack the diversity that they should show.
We often see posters of dinosaurs where they are all the same age, all uninjured, none with unexpected fleshy features, etc etc. This all comes down to missing the key fact that:
mean({a,b,c,...}) ≠ {mean(a), mean(b), mean(c), ...}
No comments:
Post a Comment