356 CHAPTER 13. MATRICES AND THE INNER PRODUCT
The expression in the integral is called the normal probability density function. There aretwo parameters, m and Σ where m is called the mean and Σ is called the covariance matrix.It is a symmetric matrix which has all real eigenvalues which are all positive. While it maybe reasonable to assume this is the distribution, in general, you won’t know m and Σ andin order to use this formula to predict anything, you would need to know these quantities.I am following a nice discussion given in Wikipedia which makes use of the existence ofsquare roots.
What people do to estimate m, and Σ is to take n independent observations x1, · · · ,xnand try to predict what m and Σ should be based on these observations. One criterion usedfor making this determination is the method of maximum likelihood. In this method, youseek to choose the two parameters in such a way as to maximize the likelihood which isgiven as
n
∏i=1
1
det(Σ)1/2 exp(−1
2(xi−m)∗Σ
−1 (xi−m)
).
For convenience the term (2π)p/2 was ignored. Maximizing the above is equivalent tomaximizing the ln of the above. So taking ln,
n2
ln(det(Σ−1))− 1
2
n
∑i=1
(xi−m)∗Σ−1 (xi−m)
Note that the above is a function of the entries of m. Take the partial derivative withrespect to ml . Since the matrix Σ−1 is symmetric this implies
n
∑i=1
∑r(xir−mr)Σ
−1rl = 0 each l.
Written in terms of vectors,n
∑i=1
(xi−m)∗Σ−1 = 0
and so, multiplying by Σ on the right and then taking adjoints, this yieldsn
∑i=1
(xi−m) = 0, n m=n
∑i=1
xi, m=1n
n
∑i=1
xi ≡ x̄.
Now that m is determined, it remains to find the best estimate for Σ.
(xi−m)∗Σ−1 (xi−m)
is a scalar, so since trace(AB) = trace(BA) ,
(xi−m)∗Σ−1 (xi−m) = trace
((xi−m)∗Σ
−1 (xi−m))
= trace((xi−m)(xi−m)∗Σ
−1)Therefore, the thing to maximize is
n ln(det(Σ−1))− n
∑i=1
trace((xi−m)(xi−m)∗Σ
−1)
= n ln(det(Σ−1))− trace
S︷ ︸︸ ︷(
n
∑i=1
(xi−m)(xi−m)∗)
Σ−1