the sum of the squares of the singular values of A.
Why is the singular value decomposition important? It implies
( )
σ 0 ∗
A = U 0 0 V
where σ is the diagonal matrix having the singular values down the diagonal. Now sometimes A is
a huge matrix, 1000×2000 or something like that. This happens in applications to situations where
the entries of A describe a picture. What also happens is that most of the singular values are very
small. What if you deleted those which were very small, say for all i ≥ l and got a new
matrix
( σ′ 0 )
A′ ≡ U V ∗?
0 0
Then the entries of A^{′} would end up being close to the entries of A but there is much less
information to keep track of. This turns out to be very useful. More precisely, letting
Thus A is approximated by A^{′} where A^{′} has rank l < r. In fact, it is also true that out of
all matrices of rank l, this A^{′} is the one which is closest to A in the Frobenius norm.
Thus A is approximated by A^{′} where A^{′} has rank l < r. In fact, it is also true that
out of all matrices of rank l, this A^{′} is the one which is closest to A in the Frobenius
norm.
Here is roughly why this is so. Suppose
B˜
approximates A =
( )
σr×r 0
0 0
as well as
possible out of all matrices
B˜
having rank no more than l < r the size of the matrix σ_{r×r}. Suppose
the rank of
˜B
is l. Then obviously no column x_{j} of
B˜
in a basis for the column space can have
j > r since if so, the approximation of A could be improved by simply making this column into a
zero column. Therefore there are
( )
r
l
choices for columns for a basis for the column space of
B˜
.
Suppose you pick the first l for instance. Thus the first column of
B˜
should be σ_{1}e_{1} to make the
approximation up to the first column as good as possible. Now consider approximating as well as
possible up to the first two columns. Clearly the second column should be σ_{2}e_{2} and in this way,
the approximation up to the first two columns is exact. Continue this way till the l^{th}
column. Then since
B˜
has rank l, all other columns should be zero columns since you
cannot have a nonzero entry in any diagonal position and keep the rank of
˜B
only l.
Then since it is desired to get the best approximation of A you wouldn’t want any off
diagonal nonzero terms either. The square of the error in doing this, picking the first
l columns as a basis would be ∑_{j=l+1}^{r}σ_{j}^{2}. On the other hand, if you picked other
columns than the first l in the basis for the column space of
B˜
, you would have a larger
error because you would include sums involving the larger singular values. Thus letting
σ^{′} denote the l × l upper left corner of σ,
˜B
should be of the form
( )
σ ′ 0
0 0
. For
example,
( )
3 0 0 0
|( 0 2 0 0 |)
0 0 1 0
is best approximated by the rank 2 matrix
( )
| 3 0 0 0 |
( 0 2 0 0 )
0 0 0 0
Now suppose A is an m × n matrix. Let U,V be unitary and of the right size such
that
( )
U∗AV = σr×r 0
0 0
Then suppose B approximates A as well as possible in the Frobenius norm. Then you would
want
∥( ) ∥
∗ ∗ ∥∥ σr×r 0 ∗ ∥∥
∥A − B∥ = ∥U AV − U BV ∥ = ∥∥ 0 0 − U BV ∥∥
to be as small as possible. Therefore, from the above discussion, you should have