Bayesian Estimator

Introduction

Bayesian estimator takes its name from the Bayes’ theorem, which connects the conditional probabilities involving two events: \begin{equation} P[B|A]=\frac{P[A|B]P[B]}{P[A]}. \end{equation} The origin of the Bayes’ Theorem is the question: “How is the belief updated after a new observation?”

The answer to this question is not only the Bayes’ theorem, but also the Bayesian approach to probability. In this module, two estimation methods will be covered: Minimum mean squared error (MMSE) and maximum aposteriori (MAP). Both methods are described in detail in the following sections. Before that, the fundamental difference of the Bayesian estimators from the other estimators covered in this part of the course is explained below.

Deterministic vs. Random Parameters

In the estimation methods presented so far, the parameter $\theta$ was considered deterministic but unknown. These approaches are referred to as classical estimation. In Bayesian estimation, on the other hand, $\theta$ is considered a random variable with a probability distribution. For example, consider the observations $\mathbf{x}$, which are assumed to be generated by a signal model with additive noise: \begin{equation} x[n]=s[n,\theta]+w[n]. \end{equation}

When the parameter $\theta$ is a deterministic but unknown value, we consider the PDF of observing the samples $\mathbf{x}$ for a specific value of $\theta$, which is formally expressed as $p(\mathbf{x};\theta)$. When the value of $\theta$ changes, so does the PDF $p(\mathbf{x};\theta)$.

When the parameter $\theta$ is a random variable itself, we consider the conditional probability of observing a set of samples $\mathbf{x}$ that is conditional on the value of $\theta$, which is formally expressed as the PDF $p(\mathbf{x}|\theta)$. Using Bayes’ theorem, we can start with an initial PDF $p(\theta)$ and update it by calculating $p(\theta|\mathbf{x})$.

Bayesian Estimation Terminology

Rewriting the Bayes’ theorem with the notation adopted in other estimation modules, \begin{equation} p(\theta|\mathbf{x})=\frac{p(\mathbf{x}|\theta)p(\theta)}{p(\mathbf{x})}, \end{equation}

  • The prior $p(\theta)$ is the PDF of the parameters controling the model that generates the data, before any observation is made,
  • The posterior $p(\theta|\mathbf{x})$ is the updated PDF after the observation is made.
  • Evidence is the data $\mathbf{x}$.

The Maximum Aposteriori (MAP) Estimation

In the Bayesian setting, the parameter $\theta$ to be estimated has apriori PDF $p(\theta)$, which is updated based on the data $\mathbf{x}$ to obtain aposteriori PDF $p(\theta|\mathbf{x})$. An intuitive approach is to choose the argument $\hat\theta$ that maximizes the aposteriori PDF: \begin{equation} \hat\theta=\underset{\theta}{\operatorname{arg max}}p(\theta|\mathbf{x}). \end{equation} Bayes’ theorem is utilized to rewrite this expression as \begin{equation} \hat\theta=\underset{\theta}{\operatorname{arg max}}\frac{p(\mathbf{x}|\theta)p(\theta)}{p(\mathbf{x})}, \end{equation} which is further simplified by the fact that $p(\mathbf{x})$ is independent of $\theta$: \begin{equation} \hat\theta=\underset{\theta}{\operatorname{arg max}}p(\mathbf{x}|\theta)p(\theta). \end{equation} The PDF for the data vector $p(\mathbf{x})$ can be written as the multiplication of PDFs for the individual samples: \begin{equation} \hat\theta=\underset{\theta}{\operatorname{arg max}}\prod_n p(x[n]|\theta)p(\theta). \end{equation}

To maximize the aposteriori, the first derivative of the aposteriori has to be set equal to zero and solved for $\theta$. The logarithm of a function can be maximized instead of the function itself, which simplifies the expression to be maximized in this case: \begin{equation} \hat\theta=\underset{\theta}{\operatorname{arg max}}\ln{\prod_n p(x[n]|\theta)p(\theta)}=\underset{\theta}{\operatorname{arg max}} \sum_n \ln p(x[n]|\theta) + \ln p(\theta). \end{equation}

The Minimum Mean Square Error Estimation

The minimum mean square error is a powerful concept that allows us to analyze the error we make in our estimations. We consider the variation of the estimates for $\theta$ by stating that the squared estimation error is $(\hat\theta-\theta)^2$. The squared error is considered, because the deviation of $\hat\theta$ from $\theta$ is important rather than the sign of the deviation. The minimum mean squared error is the starting point to describe the minimum variance unbiased estimators (MVUE). The classical mean squared error (MSE) accounts for the randomness of the estimation by \begin{equation} \mathrm{mse}\left(\hat\theta\right)=\mathbb{E}\left[\left(\hat\theta-\theta\right)^2\right]=\int\left(\hat\theta-\theta\right)^2p(\mathbf{x};\theta)d\mathbf{x}. \end{equation} where $\theta$ is considered a deterministic unknown variable.

The Bayesian mean squared error considers the parameter $\theta$ also as a random variable, and the joint probability $p(x,\theta)$ of the observed data $\mathbf{x}$ and parameter $\theta$ is considered: \begin{equation} \mathrm{bmse}\left(\hat\theta\right)=\int\int\left(\hat\theta-\theta\right)^2p(\mathbf{x},\theta)d\mathbf{x}d\theta. \end{equation}

Just as the MVUE is defined through the attempt to minimize the classical MSE term for the estimate $\hat\theta,$ the Bayesian MSE is based on minimizing the Bayesian MSE. The parameter value that minimizes a function is found by setting its first derivative equal to zero and solving for $\hat\theta$. Towards this goal, the first step is to simplify the expression that needs to be minimized. The joint probability is written as \begin{equation} p(\mathbf{x},\theta)=p(\theta|\mathbf{x})p(\mathbf{x}). \end{equation} This trick allows us to write the integral such that \begin{equation} \frac{\partial}{\partial\hat\theta}\int\left[\int(\hat\theta-\theta)^2p(\theta|\mathbf{x})d\theta\right]p(\mathbf{x})d\mathbf{x} \end{equation}

There are two reasons why this expression leads to a simpler solution. First, the integrals are grouped such that the outer integral is only about the parameter $\mathbf{x}$. The derivative with respect to $\hat\theta$ acts only on the inner integral. Second, the function $p(\mathbf{x})$ is a probability density function, which is strictly positive. So, no value of $p(\mathbf{x})$ can make the derivative equal to zero. We have to focus on the derivative of the inner integral and set that as equal to zero.

As the integral and the derivative are acting on different variables, assuming the functions are smooth, the order of derivative and integration can be changed: \begin{equation} \frac{\partial}{\partial\hat\theta}\int(\hat\theta-\theta)^2p(\theta| \mathbf{x})d\theta=2\int(\hat\theta-\theta)p(\theta|\mathbf{x})d\theta=0 \end{equation} As the integral is distributive over summation, we get \begin{equation} \hat\theta\int p(\theta|\mathbf{x})d\theta=\int\theta p(\theta|\mathbf{x})d\theta. \end{equation}

The final simplification is due to the fact that the integration of the PDF $p(\theta|\mathbf{x})$ over the whole sample space is equal to 1. The Bayesian MMSE estimator is obtained to be: \begin{equation} \hat\theta_{\text{MMSE}}=\int\theta p(\theta|\mathbf{x})d\theta=\mathbb{E}\left[\theta|\mathbf{x}\right] \end{equation}

In other words, Bayesian MMSE is the mean value of the aposteriori PDF. Obviously, the aposteriori $p(\theta|\mathbf{x})$ is found using the Bayes’ Theorem.

Conclusion

Bayesian estimators are founded on a major paradigm shift from the other estimators: The parameter to be estimated is also a random variable. We can consider the Bayesian approach as the formalization of our assumptions regarding the parameters to be estimated.

Bayesian estimators are powerful tools, however, constructing Bayesian estimators to allow closed form or sequential solutions is possible only for a limited number of cases. For example, Gaussian distribution for the noise and parameters is commonly assumed in many applications to allow a tractable implementation of the Bayesian estimators. It should also be noted that the primary function of Bayesian estimators is to update the probability density function for the parameters. There are different methods to extract a value for the estimate from the probability density function; we covered two such methods. The utility of each method depends on the underlying probability distributions for the parameters.