Maximum Likelihood Estimator

Introduction

The maximum likelihood estimator (MLE) is a popular approach to estimation problems. Firstly, if an efficient unbiased estimator exists, it is the MLE. Secondly, even if no efficient estimator exists, the mean and the variance converges asymptotically to the real parameter and CRLB as the number of observation increases. Thus, the MLE is asymptotically unbiased and asymptotically efficient. The principle of the MLE is to maximize the so-called likelihood function, which is a known function. Another important function, in the context of the MLE, is the log-likelihood function. It reveals the fundamental connection of the MLE to the CRLB. Finding the maximum can be achieved for simple problems analytically. However, for more complex estimation problems, we rely on numerical methods addressed in the module on numerical methods.

Maximum Likelihood Estimation

Before defining the MLE, we define the likelihood function $\mathcal{L}(\mathbf{x};\theta)$. The likelihood function is the PDF $\mathcal{L}(\mathbf{x};\theta)=p(\mathbf{x};\theta)$ for a given observation $\mathbf{x}$. Since we fix the observation $\mathbf{x}$, the PDF $p(\mathbf{x};\theta)$ depends only on the unknown parameter. The value of $\theta$ that maximizes the likelihood function is the maximum likelihood estimate $\hat{\theta}_{\text{ML}}$, i.e., \begin{equation} \hat\theta_{\text{ML}} = \underset{\theta}{\operatorname{arg max}} p(\mathbf{x};\theta). \end{equation} In other words, the maximum likelihood estimate is the value of $\theta$ that most likely caused the observation $\mathbf{x}$. Note that depending on the estimation problem, no maximum or multiple maxima exist.


Example:

Let $x[0],x[1],\dots,x[N-1]$ be IID random variables with PDF \begin{equation} p(x[i];\theta) = \begin{cases} \theta^{-1}, & 0 \leq x[i] \leq \theta,\\\
0 & \text{else}. \end{cases} \end{equation} The unknown parameter $\theta>0$ determines the length of the interval. Due to the IID assumption, we further have that \begin{equation} p(\mathbf{x};\theta) = \begin{cases} \theta^{-N}, & \text{if } 0 < x[i] < \theta \text{ for } 0\leq i \leq N-1 \\\
0, & \text{else}. \end{cases} \end{equation}

The value of $\theta$ must be larger than or equal to the largest value in our observed data. Otherwise, we have zero probability of obtaining the data. Thus, we can equivalently express the PDF as \begin{equation} p(\mathbf{x};\theta) = \begin{cases} \theta^{-N}, & 0 < \max(x[0],x[1],\dots,x[N-1]) < \theta \\\
0 & \text{else}. \end{cases} \end{equation} The likelihood function is a strictly monotonically decreasing function and is maximized by minimizing the value of $\theta$. However, since $\theta \geq \max(x[0],x[1],\dots,x[N-1]) $, we get \begin{equation} \hat\theta_{\text{ML}} = \max(x[0],x[1],\dots,x[N-1]). \end{equation}


Instead of maximizing the likelihood function, we can also maximize the so-called log-likelihood function defined as \begin{equation} \mathcal{l}(\mathbf{x};\theta) = \ln \mathcal{L}(\mathbf{x};\theta). \end{equation} Since the logarithm is a monotonically increasing function, $\ln p(\mathbf{x};\theta)$ and $p(\mathbf{x};\theta)$ have their maxima for the same value of $\theta$. If the log-likelihood function is differentiable and the maximum is at an interior point, we further have that the derivative is equal to zero at its maxima, i.e., \begin{equation} \left.\frac{\partial}{\partial \theta} \mathcal{l}(\mathbf{x};\theta) \right\rvert_{\theta = \hat\theta_{\text{ML}}} = 0. \label{eq:log_ll_max} \end{equation} The above equation is referred to as the likelihood equation. We already encountered the log-likelihood function in the module on the CRLB, which indicates its fundamental importance in estimation theory. If the log-likelihood function is not differentiable, other techniques have to be applied.

We have shown that an efficient estimator can be obtained if the log-likelihood function can be expressed as \begin{equation} \frac{\partial}{\partial\theta} \mathcal{l}(\mathbf{x};\theta)=\mathcal{I}(\theta)(g(\mathbf{x})-\theta). \label{eq:efficient} \end{equation} Combining \eqref{eq:log_ll_max} and \eqref{eq:efficient} yields \begin{equation} \left.\mathcal{I}(\theta)(g(\mathbf{x})-\theta) \right\rvert_{\theta = \hat\theta_{\text{ML}}} = 0. \label{eq:ml_efficient} \end{equation} Since the Fisher information is a strictly positive quantity $(\mathcal{I}(\theta)>0)$, we require that \begin{equation} g(\mathbf{x})=\hat\theta_{\text{ML}}. \end{equation} Consequently, if an efficient estimator exists, then it is the MLE.


Example:

We have already seen that the sample mean is an efficient estimator for estimating the DC level in the presence of additive white Gaussian noise. Moreover, we have just shown that if an efficient estimator exists, it is the MLE, and thus the sample mean is also the MLE. We can verify this by looking at the partial derivative of the log-likelihood function, which is \begin{equation} \frac{\partial}{\partial \theta} \mathcal{l}(\mathbf{x};\theta) = \frac{1}{\sigma^2} \sum_{n=0}^{N-1}(x[n]-A). \end{equation} After equating with zero and solving for $A$, we have that \begin{equation} \hat{A}_{\text{ML}} = \frac{1}{N}\sum_{n=0}^{N-1}x[n], \end{equation} which is the efficient estimator found in the module on the CRLB.


Properties of the Maximum Likelihood Estimator

The MLE has many appealing properties for large sample sizes that can be summarized as follows: \begin{align} \mathbb{E}\left[\hat\theta_{\text{ML}}\right] &\rightarrow \theta\\\
\mathrm{Var}\left[\hat\theta_{\text{ML}}\right] &\rightarrow \mathcal{I}^{-1}(\theta) \end{align} Moreover, $\hat\theta_{\text{ML}}$ is asymptotically normal. Combining these properties, we have for $N\rightarrow\infty$ that \begin{equation} \hat\theta_{\text{ML}}\sim \mathcal{N}\left(\theta,\mathcal{I}^{-1}(\theta)\right), \end{equation} i.e., the MLE is asymptotically consistent and the maximum likelihood estimate is normal distributed.

For some specific problems, we are interested in finding a transformed parameter $\psi(\theta)$, i.e., a parameter which depends on $\theta$. If $\psi$ is a one-to-one mapping, then the maximum likelihood estimate of $\psi$ is \begin{equation} \hat\psi_{\text{ML}} = \psi\left(\hat\theta_{\text{ML}}\right) \end{equation} which is known as the invariance property of the maximum likelihood estimator.

Maximum Likelihood Estimator for Vector Parameter

The concept of the MLE can also in estimating multiple parameters. If the maximum is interior and if the partial derivative with respect to all parameters exists, then a necessary condition for the maximum is \begin{equation} \frac{\partial}{\partial\boldsymbol\theta} \mathcal{l}(\mathbf{x};\boldsymbol\theta) = \mathbf{0}. \end{equation} Thus, the maximum can be found by equating the gradient of the log-likelihood function with zero.

Properties of the Maximum Likelihood Estimator for Vector Parameter

The MLE for vector parameter possesses the same asymptotical properties as the MLE for scalar parameter. We summarize the properties as follows: For $N\rightarrow \infty$, the maximum likelihood estimate is \begin{equation} \hat{\boldsymbol\theta}_{\text{ML}} \sim \mathcal{N}\left(\boldsymbol\theta,\mathbf{I}^{-1}(\boldsymbol\theta)\right), \end{equation} where $\mathbf{I}(\boldsymbol\theta)$ is the Fisher information matrix evaluated at the true value of the unknown parameter.


Example

Consider the example of estimating the DC level and the noise variance presented in the module on the CRLB. There we already evaluated the expressions we need to evaluate the gradient given as \begin{align} \frac{\partial}{\partial A} \mathcal{l}(\mathbf{x};\boldsymbol\theta) &= \frac{1}{\sigma^2}\sum_{n=0}^{N-1}(x[n]-A)\\\
\frac{\partial}{\partial \sigma^2} \mathcal{l}(\mathbf{x};\boldsymbol\theta) &= -\frac{N}{2\sigma^2}+\frac{1}{2\sigma^4}\sum_{n=0}^{N-1}(x[n]-A)^2 \label{eq:variance} \end{align}

We already obtained the maximum likelihood estimate for the DC level which is the sample mean \begin{equation} \hat{A} = \bar{x} = \frac{1}{N}\sum_{n=0}^{N-1}x[n]. \label{eq:samplemean} \end{equation} Equating \eqref{eq:variance} to zero and solving for $\sigma^2$ using $\bar{x}$ yields \begin{equation} \hat{\sigma^2} = \frac{1}{N}\sum_{n=0}^{N-1}(x[n]-\bar{x})^2 \end{equation}

The sample mean $\bar{x}$ is a scaled sum of IID Gaussian random variables, and thus, Gaussian again with mean $A$ and variance $\sigma^2/N$, i.e., \begin{equation} \bar{x} \sim \mathcal{N}\left(A,\frac{\sigma^2}{N}\right) \end{equation}

On the other hand, by the central limit theorem, it can be shown that \begin{equation} \hat{\sigma^2} \sim \mathcal{N}\left(\frac{N-1}{N}\sigma^2, \frac{2(N-1)}{N^2}\sigma^4\right), \end{equation} which, for large $N$, can be approximated by \begin{equation} \hat{\sigma^2} \sim \mathcal{N}\left(\sigma^2, \frac{2}{N}\sigma^4\right). \end{equation} Thus, we have that our estimates are asymptotically unbiased, and their variance approaches the CRLB.