Kenneth Kuttler

356 CHAPTER 13. MATRICES AND THE INNER PRODUCT

The expression in the integral is called the normal probability density function. There aretwo parameters, m and Σ where m is called the mean and Σ is called the covariance matrix.It is a symmetric matrix which has all real eigenvalues which are all positive. While it maybe reasonable to assume this is the distribution, in general, you won’t know m and Σ andin order to use this formula to predict anything, you would need to know these quantities.I am following a nice discussion given in Wikipedia which makes use of the existence ofsquare roots.

What people do to estimate m, and Σ is to take n independent observations x1, · · · ,xnand try to predict what m and Σ should be based on these observations. One criterion usedfor making this determination is the method of maximum likelihood. In this method, youseek to choose the two parameters in such a way as to maximize the likelihood which isgiven as

∏i=1

det(Σ)1/2 exp(−1

2(xi−m)∗Σ

−1 (xi−m)

For convenience the term (2π)p/2 was ignored. Maximizing the above is equivalent tomaximizing the ln of the above. So taking ln,

ln(det(Σ−1))− 1

∑i=1

(xi−m)∗Σ−1 (xi−m)

Note that the above is a function of the entries of m. Take the partial derivative withrespect to ml . Since the matrix Σ−1 is symmetric this implies

∑i=1

∑r(xir−mr)Σ

−1rl = 0 each l.

Written in terms of vectors,n

∑i=1

(xi−m)∗Σ−1 = 0

and so, multiplying by Σ on the right and then taking adjoints, this yieldsn

∑i=1

(xi−m) = 0, n m=n

∑i=1

xi, m=1n

∑i=1

xi ≡ x̄.

Now that m is determined, it remains to find the best estimate for Σ.

(xi−m)∗Σ−1 (xi−m)

is a scalar, so since trace(AB) = trace(BA) ,

(xi−m)∗Σ−1 (xi−m) = trace

((xi−m)∗Σ

−1 (xi−m))

= trace((xi−m)(xi−m)∗Σ

−1)Therefore, the thing to maximize is

n ln(det(Σ−1))− n

∑i=1

trace((xi−m)(xi−m)∗Σ

−1)

= n ln(det(Σ−1))− trace

S︷︸︸︷(

∑i=1

(xi−m)(xi−m)∗)

Σ−1



356 CHAPTER 13. MATRICES AND THE INNER PRODUCTThe expression in the integral is called the normal probability density function. There aretwo parameters, m and X where m is called the mean and ¥ is called the covariance matrix.It is a symmetric matrix which has all real eigenvalues which are all positive. While it maybe reasonable to assume this is the distribution, in general, you won’t know m and ¥ andin order to use this formula to predict anything, you would need to know these quantities.I am following a nice discussion given in Wikipedia which makes use of the existence ofsquare roots.What people do to estimate m, and ¥ is to take n independent observations %1,--- ,Zyand try to predict what m and ¥ should be based on these observations. One criterion usedfor making this determination is the method of maximum likelihood. In this method, youseek to choose the two parameters in such a way as to maximize the likelihood which isgiven asnI 1 * _Niger? (-; (x;—m)*E («:-m) .For convenience the term (27)? /? was ignored. Maximizing the above is equivalent tomaximizing the In of the above. So taking In,sin (det (Z~!)) — = (2;—m)* =! (a;—m)aNote that the above is a function of the entries of m. Take the partial derivative withrespect to m;. Since the matrix £~! is symmetric this impliesny (Xir — ry! =Oeach /.i=l rWritten in terms of vectors,Y (ai —m)*=7!=0and so, multiplying by © on the right and then taking “on this yields(a; =0,nm= Yanm=> basNow that m is determined, it remains to find the best estimate for X.i1(a;—m)* =~! (a;—m)is a scalar, so since trace (AB) = trace (BA),(a;—m)*="!(ai—m) = trace ((aj—m)*2~! (a;—m))= trace ((a;—m) (a;—m)* =!)Therefore, the thing to maximize isnin (det (2 - ¥tace((x m) (wj—m)* =~!)RY= nin(det(Z~')) —trace - seeramyi=]