A random vector is a function X : Ω → ℝ^{p} where Ω is a probability space. This means that there
exists a σ algebra of measurable sets ℱ and a probability measure P : ℱ→

[0,1]

. In practice,
people often don’t worry too much about the underlying probability space and instead pay more
attention to the distribution measure of the random variable. For E a suitable subset of ℝ^{p}, this
measure gives the probability that X has values in E. There are often excellent reasons for
believing that a random vector is normally distributed. This means that the probability that X
has values in a set E is given by

The expression in the integral is called the normal probability density function. There are two
parameters, m and Σ where m is called the mean and Σ is called the covariance matrix. It is a
symmetric matrix which has all real eigenvalues which are all positive. While it may be reasonable
to assume this is the distribution, in general, you won’t know m and Σ and in order to
use this formula to predict anything, you would need to know these quantities. I am
following a nice discussion given in Wikipedia which makes use of the existence of square
roots.

What people do to estimate these is to take n independent observations x_{1},

⋅⋅⋅

,x_{n} and try to
predict what m and Σ should be based on these observations. One criterion used for making
this determination is the method of maximum likelihood. In this method, you seek to
choose the two parameters in such a way as to maximize the likelihood which is given
as

n∏ ( )
---1-1∕2 exp − 1(xi− m )∗Σ− 1(xi− m ) .
i=1 det(Σ ) 2

For convenience the term

(2π)

^{p∕2} was ignored. Maximizing the above is equivalent to maximizing
the ln of the above. So taking ln,

n
n-ln (det(Σ−1))− 1 ∑ (x− m )∗ Σ−1(x − m )
2 2 i=1 i i

Note that the above is a function of the entries of m. Take the partial derivative with respect
to m_{l}. Since the matrix Σ^{−1} is symmetric this implies

n∑ ∑ −1
(xir − mr )Σrl = 0 each l.
i=1 r

Written in terms of vectors,

∑n ∗
(xi − m ) Σ− 1 = 0
i=1

and so, multiplying by Σ on the right and then taking adjoints, this yields

∑n ∑n ∑n
(xi − m ) = 0, nm = xi, m = 1- xi ≡ ¯x.
i=1 i=1 n i=1

Now that m is determined, it remains to find the best estimate for Σ.

(xi− m )

^{∗}Σ^{−1}

(xi− m)

is a
scalar, so since trace

(AB )

= trace

(BA )

,

∗ −1 ( ∗ −1 )
(xi− m) Σ (xi− m ) = trace ((xi− m ) Σ (xi− m ))
= trace (xi− m )(xi− m )∗Σ− 1

Therefore, the thing to maximize is

n
( ( −1)) ∑ ( ∗ −1)
nln det Σ − trace (xi− m )(xi− m) Σ
i=1 ( S )
|◜(---------◞◟-------)◝ |
( ( −1)) || ∑n ∗ −1||
= nln det Σ − trace|( (xi− m )(xi− m ) Σ |)
i=1

We assume that S has rank p. Thus it is a self adjoint matrix which has all positive eigenvalues.
Therefore, from the property of the trace, the thing to maximize is

( ( −1)) ( 1∕2 −1 1∕2)
nln det Σ − trace S Σ S

Now let B = S^{1∕2}Σ^{−1}S^{1∕2}. Then B is positive and self adjoint also and so there exists U unitary
such that B = U^{∗}DU where D is the diagonal matrix having the positive scalars λ_{1},

⋅⋅⋅

,λ_{p} down
the main diagonal. Solving for Σ^{−1} in terms of B, this yields S^{−1∕2}BS^{−1∕2} = Σ^{−1} and
so

( ( )) ( ( ) ( )) ( ( ))
ln detΣ −1 = ln det S −1∕2 det(B )det S− 1∕2 = ln det S −1 + ln (det(B ))

which yields

C(S) + nln(det(B ))− trace(B )

as the thing to maximize. Of course this yields

( )
p∏ ∑p
C (S )+ nln λi − λi
p i=1 ip=1
∑ ∑
= C (S )+ n ln(λi)− λi
i=1 i=1

as the quantity to be maximized. To do this, take ∂∕∂λ_{k} and set equal to 0. This yields λ_{k} = n.
Therefore, from the above, B = U^{∗}nIU = nI. Also from the above,

−1 1- −1∕2 −1∕2
B = nI = S ΣS

and so

n
Σ = -1S = 1-∑ (xi − m )(xi − m )∗
n n i=1

This has shown that the maximum likelihood estimates are

∑n ∑n
m = ¯x ≡ 1- xi, Σ = 1 (xi − m )(xi − m )∗.
n i=1 n i=1