第 4 章
關燈
小
中
大
第 3 章
Now, we can ignore the constant terms since they won't affect the maximization result. Therefore, we only need to focus on the terms that are related to $\boldsymbol{\Sigma}_k$, which is
$$
-\frac{1}{2} \sum_{n=1}^N \gamma\left(z_{n k}\right) (\mathbf{x}_n - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k)+ \frac{1}{2} \sum_{n=1}^N \gamma\left(z_{n k}\right) \ln \boldsymbol{\Sigma}_k^{-1}.
$$
To maximize this expression, we can take the derivative with respect to $\boldsymbol{\Sigma}_k$ and set it equal to zero. This will give us the solution for maximization. Specifically, we have
$$
\frac{\partial}{\partial \boldsymbol{\Sigma}_k} \left(-\frac{1}{2} \sum_{n=1}^N \gamma\left(z_{n k}\right) (\mathbf{x}_n - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) + \frac{1}{2} \sum_{n=1}^N \gamma\left(z_{n k}\right) \ln \boldsymbol{\Sigma}_k^{-1}\right) = 0.
$$
First, let's consider the first term:
\[
-\frac{1}{2} \sum_{n=1}^N \gamma(z_{nk}) (\mathbf{x}_n - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k).
\]
Using matrix calculus, the derivative of \(\mathbf{a}^T \mathbf{A}^{-1} \mathbf{a}\) with respect to \(\mathbf{A}\) is:
\[
\frac{\partial}{\partial \mathbf{A}} (\mathbf{a}^T \mathbf{A}^{-1} \mathbf{a}) = -\mathbf{A}^{-1} \mathbf{a} \mathbf{a}^T \mathbf{A}^{-1}.
\]
Therefore, for the first term:
\[
\frac{\partial}{\partial \boldsymbol{\Sigma}_k} \left( -\frac{1}{2} \sum_{n=1}^N \gamma(z_{nk}) (\mathbf{x}_n - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) \right) = \frac{1}{2} \sum_{n=1}^N \gamma(z_{nk}) \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) (\mathbf{x}_n - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1}.
\]
Next, consider the second term:
\[
\frac{1}{2} \sum_{n=1}^N \gamma(z_{nk}) \ln \boldsymbol{\Sigma}_k^{-1}.
\]
We know that \(\ln \mathbf{A}^{-1} = -\ln \mathbf{A}\), and the derivative of \(\ln \mathbf{A}\) with respect to \(\mathbf{A}\) is \(\mathbf{A}^{-1}\). Thus,
\[
\frac{\partial}{\partial \boldsymbol{\Sigma}_k} \left( \frac{1}{2} \sum_{n=1}^N \gamma(z_{nk}) \ln \boldsymbol{\Sigma}_k^{-1} \right) = -\frac{1}{2} \sum_{n=1}^N \gamma(z_{nk}) \boldsymbol{\Sigma}_k^{-1}.
\]
bining the derivatives of both terms, we get:
\[
\frac{1}{2} \sum_{n=1}^N \gamma(z_{nk}) \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) (\mathbf{x}_n - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} - \frac{1}{2} \sum_{n=1}^N \gamma(z_{nk}) \boldsymbol{\Sigma}_k^{-1} = 0.
\]
\newpage
To find the optimal value of \(\pi_k\), we can use the Lagrange multiplier method.
Consider the maximization of the following expression with respect to \(\pi_k\) while keeping the responsibilities \(\gamma(z_{nk})\) fixed:
\[
\mathbb{E}_{\mathbf{Z}}[\ln p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}, \boldsymbol{\pi})] = \sum_{n=1}^N \sum_{k=1}^K \gamma(z_{nk}) \left\{\ln \pi_k + \ln \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\right\}.
\]
Let's define the Lagrangian function as:
\[
\mathcal{L}(\boldsymbol{\pi}, \lambda) = \sum_{n=1}^N \sum_{k=1}^K \gamma(z_{nk}) \left\{\ln \pi_k + \ln \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\right\} + \lambda \left(\sum_{k=1}^K \pi_k - 1\right).
\]
Taking the derivative of \(\mathcal{L}\) with respect to \(\pi_k\) and setting it to zero:
\[
\frac{\partial}{\partial \pi_k} \mathcal{L}(\boldsymbol{\pi}, \lambda) = \sum_{n=1}^N \frac{\gamma(z_{nk})}{\pi_k} + \lambda = 0.
\]
Rearranging the equation, we have:
\[
\sum_{n=1}^N \frac{\gamma(z_{nk})}{\pi_k} = -\lambda.
\]
Knowing that \(\sum_{k=1}^K \pi_k = 1\), we can write:
\[
\sum_{k=1}^K \sum_{n=1}^N \frac{\gamma(z_{nk})}{\pi_k} = -\lambda \sum_{k=1}^K \pi_k = -\lambda.
\]
Thus, \(-\lambda = N\).
Substitute \(-\lambda\) back into the previous equation:
\[
\sum_{n=1}^N \frac{\gamma(z_{nk})}{\pi_k} = N.
\]
Solving for \(\pi_k\), we get:
\[
\pi_k = \frac{N_k}{N},
\]
where \(N_k = \sum_{n=1}^N \gamma(z_{nk})\).
Therefore, using the Lagrange multiplier method, we have shown that \(\pi_k = \frac{N_k}{N}\) maximizes the given expression while keeping \(\gamma(z_{nk})\) fixed.
\newpage
Next, let's consider the maximization with respect to \(\pi_k\). \\
Here, We need to find the derivative of the following expression with respect to \(\pi_k\), and then solve for \(\pi_k\):
\[
\ln p(\mathbf{X} \mid \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma}) + \lambda \left( \sum_{k=1}^K \pi_k - 1 \right)
\]
First, let's write out the log-likelihood function. Suppose \(\mathbf{X} = \{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_N\}\) are the observations, and \(\boldsymbol{\pi} = (\pi_1, \pi_2, \ldots, \pi_K)\) are the mixing coefficients for the Gaussian distributions. The log-likelihood function is:
\[
\ln p(\mathbf{X} \mid \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \sum_{n=1}^N \ln \left( \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right)
\]
Next, consider the objective function with the Lagrange multiplier:
\[
\mathcal{L} = \sum_{n=1}^N \ln \left( \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right) + \lambda \left( \sum_{k=1}^K \pi_k - 1 \right)
\]
We take the derivative of \(\mathcal{L}\) with respect to \(\pi_k\):
\[
\frac{\partial \mathcal{L}}{\partial \pi_k} = \sum_{n=1}^N \frac{\mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)} + \lambda
\]
Define \(\gamma(z_{nk})\) as the posterior probability that data point \(\mathbf{x}_n\) belongs to the \(k\)-th Gaussianponent:
\[
\gamma(z_{nk}) = \frac{\pi_k \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}
\]
Thus, we can rewrite the derivative as:
\[
\frac{\partial \mathcal{L}}{\partial \pi_k} = \sum_{n=1}^N \frac{\gamma(z_{nk})}{\pi_k} + \lambda
\]
To find the optimal value, set the derivative to zero:
\[
\sum_{n=1}^N \frac{\gamma(z_{nk})}{\pi_k} + \lambda = 0
\]
Rewrite the equation and solve for \(\pi_k\):
\[
\sum_{n=1}^N \gamma(z_{nk}) = -\lambda \pi_k
\]
We know that \(\sum_{k=1}^K \pi_k = 1\), hence:
\[
\sum_{k=1}^K \sum_{n=1}^N \gamma(z_{nk}) = -\lambda \sum_{k=1}^K \pi_k = -\lambda
\]
Thus,
\[
-\lambda = N
\]
Substitute \(-\lambda\) back into the previous equation:
\[
\sum_{n=1}^N \gamma(z_{nk}) = N \pi_k
\]
Solve for \(\pi_k\):
\[
\pi_k = \frac{\sum_{n=1}^N \gamma(z_{nk})}{N}
\]
Define \(N_k = \sum_{n=1}^N \gamma(z_{nk})\) as the effective number of samples belonging to the \(k\)-th Gaussianponent, then:
\[
\pi_k = \frac{N_k}{N}
\]
Thus, we have derived the expression for \(\pi_k\):
\[
\pi_k = \frac{N_k}{N}
\]
本站無廣告,永久域名(danmei.twking.cc)
Now, we can ignore the constant terms since they won't affect the maximization result. Therefore, we only need to focus on the terms that are related to $\boldsymbol{\Sigma}_k$, which is
$$
-\frac{1}{2} \sum_{n=1}^N \gamma\left(z_{n k}\right) (\mathbf{x}_n - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k)+ \frac{1}{2} \sum_{n=1}^N \gamma\left(z_{n k}\right) \ln \boldsymbol{\Sigma}_k^{-1}.
$$
To maximize this expression, we can take the derivative with respect to $\boldsymbol{\Sigma}_k$ and set it equal to zero. This will give us the solution for maximization. Specifically, we have
$$
\frac{\partial}{\partial \boldsymbol{\Sigma}_k} \left(-\frac{1}{2} \sum_{n=1}^N \gamma\left(z_{n k}\right) (\mathbf{x}_n - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) + \frac{1}{2} \sum_{n=1}^N \gamma\left(z_{n k}\right) \ln \boldsymbol{\Sigma}_k^{-1}\right) = 0.
$$
First, let's consider the first term:
\[
-\frac{1}{2} \sum_{n=1}^N \gamma(z_{nk}) (\mathbf{x}_n - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k).
\]
Using matrix calculus, the derivative of \(\mathbf{a}^T \mathbf{A}^{-1} \mathbf{a}\) with respect to \(\mathbf{A}\) is:
\[
\frac{\partial}{\partial \mathbf{A}} (\mathbf{a}^T \mathbf{A}^{-1} \mathbf{a}) = -\mathbf{A}^{-1} \mathbf{a} \mathbf{a}^T \mathbf{A}^{-1}.
\]
Therefore, for the first term:
\[
\frac{\partial}{\partial \boldsymbol{\Sigma}_k} \left( -\frac{1}{2} \sum_{n=1}^N \gamma(z_{nk}) (\mathbf{x}_n - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) \right) = \frac{1}{2} \sum_{n=1}^N \gamma(z_{nk}) \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) (\mathbf{x}_n - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1}.
\]
Next, consider the second term:
\[
\frac{1}{2} \sum_{n=1}^N \gamma(z_{nk}) \ln \boldsymbol{\Sigma}_k^{-1}.
\]
We know that \(\ln \mathbf{A}^{-1} = -\ln \mathbf{A}\), and the derivative of \(\ln \mathbf{A}\) with respect to \(\mathbf{A}\) is \(\mathbf{A}^{-1}\). Thus,
\[
\frac{\partial}{\partial \boldsymbol{\Sigma}_k} \left( \frac{1}{2} \sum_{n=1}^N \gamma(z_{nk}) \ln \boldsymbol{\Sigma}_k^{-1} \right) = -\frac{1}{2} \sum_{n=1}^N \gamma(z_{nk}) \boldsymbol{\Sigma}_k^{-1}.
\]
bining the derivatives of both terms, we get:
\[
\frac{1}{2} \sum_{n=1}^N \gamma(z_{nk}) \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) (\mathbf{x}_n - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} - \frac{1}{2} \sum_{n=1}^N \gamma(z_{nk}) \boldsymbol{\Sigma}_k^{-1} = 0.
\]
\newpage
To find the optimal value of \(\pi_k\), we can use the Lagrange multiplier method.
Consider the maximization of the following expression with respect to \(\pi_k\) while keeping the responsibilities \(\gamma(z_{nk})\) fixed:
\[
\mathbb{E}_{\mathbf{Z}}[\ln p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}, \boldsymbol{\pi})] = \sum_{n=1}^N \sum_{k=1}^K \gamma(z_{nk}) \left\{\ln \pi_k + \ln \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\right\}.
\]
Let's define the Lagrangian function as:
\[
\mathcal{L}(\boldsymbol{\pi}, \lambda) = \sum_{n=1}^N \sum_{k=1}^K \gamma(z_{nk}) \left\{\ln \pi_k + \ln \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\right\} + \lambda \left(\sum_{k=1}^K \pi_k - 1\right).
\]
Taking the derivative of \(\mathcal{L}\) with respect to \(\pi_k\) and setting it to zero:
\[
\frac{\partial}{\partial \pi_k} \mathcal{L}(\boldsymbol{\pi}, \lambda) = \sum_{n=1}^N \frac{\gamma(z_{nk})}{\pi_k} + \lambda = 0.
\]
Rearranging the equation, we have:
\[
\sum_{n=1}^N \frac{\gamma(z_{nk})}{\pi_k} = -\lambda.
\]
Knowing that \(\sum_{k=1}^K \pi_k = 1\), we can write:
\[
\sum_{k=1}^K \sum_{n=1}^N \frac{\gamma(z_{nk})}{\pi_k} = -\lambda \sum_{k=1}^K \pi_k = -\lambda.
\]
Thus, \(-\lambda = N\).
Substitute \(-\lambda\) back into the previous equation:
\[
\sum_{n=1}^N \frac{\gamma(z_{nk})}{\pi_k} = N.
\]
Solving for \(\pi_k\), we get:
\[
\pi_k = \frac{N_k}{N},
\]
where \(N_k = \sum_{n=1}^N \gamma(z_{nk})\).
Therefore, using the Lagrange multiplier method, we have shown that \(\pi_k = \frac{N_k}{N}\) maximizes the given expression while keeping \(\gamma(z_{nk})\) fixed.
\newpage
Next, let's consider the maximization with respect to \(\pi_k\). \\
Here, We need to find the derivative of the following expression with respect to \(\pi_k\), and then solve for \(\pi_k\):
\[
\ln p(\mathbf{X} \mid \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma}) + \lambda \left( \sum_{k=1}^K \pi_k - 1 \right)
\]
First, let's write out the log-likelihood function. Suppose \(\mathbf{X} = \{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_N\}\) are the observations, and \(\boldsymbol{\pi} = (\pi_1, \pi_2, \ldots, \pi_K)\) are the mixing coefficients for the Gaussian distributions. The log-likelihood function is:
\[
\ln p(\mathbf{X} \mid \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \sum_{n=1}^N \ln \left( \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right)
\]
Next, consider the objective function with the Lagrange multiplier:
\[
\mathcal{L} = \sum_{n=1}^N \ln \left( \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right) + \lambda \left( \sum_{k=1}^K \pi_k - 1 \right)
\]
We take the derivative of \(\mathcal{L}\) with respect to \(\pi_k\):
\[
\frac{\partial \mathcal{L}}{\partial \pi_k} = \sum_{n=1}^N \frac{\mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)} + \lambda
\]
Define \(\gamma(z_{nk})\) as the posterior probability that data point \(\mathbf{x}_n\) belongs to the \(k\)-th Gaussianponent:
\[
\gamma(z_{nk}) = \frac{\pi_k \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}
\]
Thus, we can rewrite the derivative as:
\[
\frac{\partial \mathcal{L}}{\partial \pi_k} = \sum_{n=1}^N \frac{\gamma(z_{nk})}{\pi_k} + \lambda
\]
To find the optimal value, set the derivative to zero:
\[
\sum_{n=1}^N \frac{\gamma(z_{nk})}{\pi_k} + \lambda = 0
\]
Rewrite the equation and solve for \(\pi_k\):
\[
\sum_{n=1}^N \gamma(z_{nk}) = -\lambda \pi_k
\]
We know that \(\sum_{k=1}^K \pi_k = 1\), hence:
\[
\sum_{k=1}^K \sum_{n=1}^N \gamma(z_{nk}) = -\lambda \sum_{k=1}^K \pi_k = -\lambda
\]
Thus,
\[
-\lambda = N
\]
Substitute \(-\lambda\) back into the previous equation:
\[
\sum_{n=1}^N \gamma(z_{nk}) = N \pi_k
\]
Solve for \(\pi_k\):
\[
\pi_k = \frac{\sum_{n=1}^N \gamma(z_{nk})}{N}
\]
Define \(N_k = \sum_{n=1}^N \gamma(z_{nk})\) as the effective number of samples belonging to the \(k\)-th Gaussianponent, then:
\[
\pi_k = \frac{N_k}{N}
\]
Thus, we have derived the expression for \(\pi_k\):
\[
\pi_k = \frac{N_k}{N}
\]
本站無廣告,永久域名(danmei.twking.cc)