Physvillain

MLE and MAP 본문

Machine Learning

MLE and MAP

Physvillain 2021. 7. 31. 11:57

In this post, I'm planning to cover the basic statistics required for ML, which is maximum likelihood estimator(MLE) & maximum a posteriori(MAP) estimator. The latter one is based on the Bayesian approach. 

 

[Definition 1] If $X_1, \cdots, X_n$ are independent and identically distributed(call it i.i.d. for convenience) random variables with the pdf or pmf $f(x\vert\theta)$, then the function of $\theta$ defined by $L(\theta\vert\mathbf{x}):=f(\mathbf{x}\vert\theta)=\prod_{i=1}^{n} f\left(x_i \vert \theta \right)$, $\mathbf{x}=\left( x_1, \cdots, x_n \right)$ is called the likelihood function. Also, the maximum likelihood estimator of $\theta$ is defined by $\theta_{MLE}(\mathbf{x}):=\mathrm{argmax}_\theta L\left(\theta\vert\mathbf{x}\right)$, i.e., the value of $\theta$ that maximizes the likelihood function $L\left( \theta \vert \mathbf{x} \right)$.

 

[Definition 2] Let $X_1, \cdots, X_n$ be i.i.d. random variables with pdf of pmf $f(x\vert\theta)$. In the Bayesian statistics, the variation of $\theta$ can be described by a probability distribution $\pi(\theta)$ which is called prior. Then, with the sampling distribution $f\left( \mathbf{x}\vert\theta \right) = \prod_{i=1}^{n} f\left( x_i \vert \theta \right)$, we get the posterior distribution

$$\pi(\theta\vert\mathbf{x})=\frac{f\left(\mathbf{x}\vert\theta\right)\pi(\theta)}{\int f\left( \mathbf{x} \vert \theta \right) \pi(\theta) d\theta}$$

Also, the maximum a posteriori estimator of $\theta$ is defined by $\theta_{MAP} (\mathbf{x}) := \mathrm{argmax}_{\theta} \pi(\theta\vert\mathbf{x})$.

 

We often calcuate MLE and MAP as $\theta_{MLE}=\mathrm{argmax}_\theta \log\left( L(\theta\vert\mathbf{x}) \right)$ and $\theta_{MAP}(\mathbf{x})=\mathrm{argmax}_\theta \log \left( \pi \left( \theta\vert\mathbf{x} \right)\right)$. This holds since log is a monotonically increasing function. Also, for MAP, note $\int f(\mathbf{x}\vert\theta) \pi(\theta) d\theta$ is not a function of $\theta$, so it does not affect the calculation of $\mathrm{argmax}_\theta$. Here are some examples.

 

[Example 1] Let $X_1, \cdots, X_n$ be i.i.d. exponential random variables with pdf $f(x\vert \lambda)=\lambda e^{-\lambda x} I(\{x<0\})$. Then MLE of $\lambda$ would be

$$\begin{align*} L(\lambda\vert\mathbf{x}) & = \lambda^n \prod_{i=1}^{n} e^{-\lambda x_i} I(\{x_i >0\}) \\ l(\lambda\vert\mathbf{x}) & = \log \left( L(\lambda \vert \mathbf{x})\right) = n \log \lambda - \sum_{i=1}^{n} \left( \lambda x_i -\log \left( I(\{x_i > 0\}) \right) \right) \\ \frac{dl}{d\lambda} & = \frac{n}{\lambda} - \sum_{i=1}^{n} x_i \end{align*}$$

Since log is monotonically increasing, finding MLE of $\lambda$ is equivalent to finding $\lambda$ that maximizes $l(\lambda\vert\mathbf{x})$. Thus we can get $\lambda_{MLE} (\mathbf{x}) = \frac{1}{n} \sum x_i$.

 

[Example 2] Let $Y$ be a $\chi_1^2$ random variable and $X_1, \cdots, X_n$ be i.i.d. random variables with the pdf that is equal to the pdf of $\theta Y$ with $\theta>0$. First we need to know about the distribution of $\theta Y$. Since the pdf of $Y$ is

$$f_Y(y)=\frac{1}{\Gamma \left(\frac{1}{2}\right)\sqrt{2}} y^{-\frac{1}{2}} e^{-\frac{y}{2}} I\left(\{y>0\}\right)$$

using $X=\theta Y$, we get

$$ f_X(x\vert\theta)=\frac{1}{\Gamma\left(\frac{1}{2}\right)\sqrt{2}} \left(\frac{x}{\theta}\right)^{-\frac{1}{2}} \frac{e^{-\frac{x}{2\theta}}}{\theta} I\left(\{x>0\}\right)=\frac{1}{\Gamma\left(\frac{1}{2}\right)(2\theta)^{\frac{1}{2}}} x^{-\frac{1}{2}} e^{-\frac{x}{2\theta}} I\left( \{ x>0 \} \right)$$

First, we can find the MLE of $\theta$. Since the pdf of $X_i$ is equal to the pdf of $\theta Y$,

$$\begin{align*} L(\theta\vert\mathbf{x}) & =\prod_{i} f_{X_i}\left(x_i \vert \theta \right) = \left( \Gamma \left(\frac{1}{2}\right) (2\theta)^{\frac{1}{2}} \right)^{-n} \prod_i x_i^{-\frac{1}{2}} e^{-\frac{x_i}{2\theta}} I\left(\{x_i>0\}\right) \\ l(\theta\vert\mathbf{x}) & =\log \left( L(\theta\vert\mathbf{x})\right) = C - \frac{n}{2} \log\theta + \sum_i -\frac{x_i}{2\theta} \\ \frac{dl}{d\theta} &= -\frac{n}{2\theta} + \frac{1}{2\theta^2}\sum_i x_i \end{align*}$$

To maximize it, we easily get $\theta_{MLE}(\mathbf{x})=\frac{1}{n}\sum_i x_i$.

Now, suppose $\theta$ have a prior distribution $\pi(\theta)=\frac{1}{\theta^2} e^{-\frac{1}{\theta}} I\left(\{\theta>0\}\right)$. In this case (for any given prior generally), we can find not only the posterior distribution of $\theta$ also the MAP estimator of $\theta$. The posterior can be aquired trivially by the definition.

$$\begin{align*} \pi(\theta\vert\mathbf{x}) & =\frac{f(\mathbf{x}\vert\theta)\pi(\theta)}{\int f(\mathbf{x}\vert\theta) \pi(\theta) d\theta} = C'\theta^{-\frac{n}{2}} \left( \prod_{i} e^{-\frac{x_i}{2\theta}} \right) \frac{e^{-\frac{1}{\theta}}}{\theta^2} I\left( \{ \theta>0 \}\right) \\ &= C' \theta^{-\frac{n}{2}+2}e^{-((\sum_i x_i/2)+1)/\theta} I\left(\{ \theta>0 \}\right) \end{align*}$$

Note that $\int \pi(\theta\vert\mathbf{x}) d\theta=1$ and $\int_0^{\infty} x^{-\alpha-1}e^{-1/{\beta x}} dx = \Gamma(\alpha) \beta^\alpha$. Using these identities, we get

$$C'=\Gamma\left(\frac{n+3}{2}\right)^{-1} \left(\left(\sum_i \frac{x_i}{2}\right) +1 \right)^{\frac{n+3}{2}}$$

Therefore, the posterior distribution of $\theta$ is

$$\pi(\theta\vert\mathbf{x})=\Gamma\left(\frac{n+3}{2}\right)^{-1} \left(\left(\sum_i \frac{x_i}{2}\right) +1 \right)^{\frac{n+3}{2}} \theta^{-\frac{n+4}{2}} e^{-\frac{\left(\sum_i \frac{x_i}{2}\right)+1}{\theta}} I\left(\{ \theta > 0 \}\right)$$

And the MAP estimator of $\theta$ is

$$\begin{align*} \theta_{MAP}(\mathbf{x}) & =\mathrm{argmax}_\theta \log \left(\pi(\theta\vert\mathbf{x})\right) \\ & = \mathrm{argmax}_\theta \left[ -\frac{n+4}{2} \log\theta - \frac{1}{\theta}\left( \frac{1}{2}\sum_i x_i + 1 \right) \right] \end{align*}$$

So, to maximize this, we have to calculate $\frac{d\text{(inside the argmax)}}{d\theta}$. Then we easily get $\theta_{MAP}(\mathbf{x})=\frac{\sum_i x_i +2}{n+4}$. 

Comments