Random vectors

Introduction

Previously we saw how we could grasp this notion of uncertainty in probability functions and we saw what happened when our outcome was conditioned. Finally, a brief discussion was provided for the case when multiple random variables were involved. In these examples, only the situation where we had 2 random variables was discussed. This module will generalize this theory for multiple random variables. In this case an outcome of an experiment comprises $N$ observed quantities. An example of such an observation is a noisy electroencephalography (EEG) measurement, which resembles the electrical activity of the brain and is measured over several channels.

Screencast video [⯈]

Multivariate joint probability distributions

Let us denote each of these measured quantities by the random variable $X_{n}$ , where $n$ ranges from $1$ up to $N$ . Using this definition, the multivariate (meaning that multiple variables are involved) joint probability functions can be introduced. For notation purposes, all random variables $X_{n}$ can be grouped in a random vector $X = [X_{1}, X_{2}, \dots, X_{N}]^{⊤}$ , where the $\cdot^{⊤}$ operator denotes the transpose, turning this row vector into a column vector. The bold capital letter distinguishes the random vector containing multiple random variables from a single random variable. Similarly, a specific realization of this random vector can be written in lower-case as $x = [x_{1}, x_{2}, \dots, x_{N}]^{⊤}$ .

Multivariate joint cumulative distribution function

The multivariate joint cumulative distribution function of the random vector

X

containing random variables

X_{1}, X_{2}, \dots, X_{N}

is defined as

\begin{matrix} P_{X} (x) = P_{X_{1}, \dots, X_{N}} (x_{1}, \dots, x_{N}) = Pr [X_{1} \leq x_{1}, \dots, X_{N} \leq x_{N}] . \\ (1) \end{matrix}

This definition holds for both discrete as continuous random variables.

Multivariate joint probability mass function

The multivariate joint probability mass function of the random vector

X

containing discrete random variables

X_{1}, X_{2}, \dots, X_{N}

is similarly defined as

\begin{matrix} p_{X} (x) = p_{X_{1}, \dots, X_{N}} (x_{1}, \dots, x_{N}) = Pr [X_{1} = x_{1}, \dots, X_{N} = x_{N}] . \\ (2) \end{matrix}

Multivariate joint probability density function

The multivariate joint probability density function of the random vector

X

containing continuous random variables

X_{1}, X_{2}, \dots, X_{N}

is defined from the multivariate joint cumulative distribution function as

\begin{matrix} p_{X} (x) = p_{X_{1}, \dots, X_{N}} (x_{1}, \dots, x_{N}) = \frac{\partial^{N} P_{X_{1}, \dots, X_{N}} (x_{1}, \dots, x_{N})}{\partial x_{1} \dots \partial x_{N}} . \\ (3) \end{matrix}

Generalized probability axioms for multivariate joint distributions

From these definitions several multivariate joint probability axioms can be determined, which are similar to the case of two random variables as discussed in the last reader.

It holds that $p_{X} (x) \geq 0$ , where $X$ is a continuous or discrete random vector.
From the multivariate joint probability density function it follows that $P_{X_{1}, \dots, X_{N}} (x_{1}, \dots, x_{N}) = \int_{- \infty}^{x_{1}} \dots \int_{- \infty}^{x_{N}} p_{X_{1}, \dots, X_{N}} (x_{1}, \dots, x_{N}) d x_{1} \dots d x_{N}$ holds for continuous random vectors.
Through the law of total probability it holds that
1. $\sum x_{1} \in S_{X_{1}} \dots \sum x_{N} \in S_{X_{N}} p_{X_{1}, \dots, X_{N}} (x_{1}, \dots, x_{N}) = 1$ for discrete random vectors and
2. $\int_{- \infty}^{\infty} \dots \int_{- \infty}^{\infty} p_{X_{1}, \dots, X_{N}} (x_{1}, \dots, x_{N}) d x_{N} \dots d x_{1} = 1$ for continuous random vectors.
The probability of an event A $A$ can be determined as
1. $Pr [A] = \sum X \in S p_{X_{1}, \dots, X_{N}} (x_{1}, \dots, x_{N})$ for discrete random variables and
2. $Pr [A] = \int \dots \int A p_{X_{1}, \dots, X_{N}} (x_{1}, \dots, x_{N}) d x_{1} \dots d x_{N}$ for continuous random variables.

Axiom 1 simply states that a probability (density) cannot be smaller than 0, since no negative probabilities exist by definition. The second axiom is a direct consequence of integrating both sides of the multivariate joint probability density function allowing us to determine the multivariate joint cumulative distribution function from the multivariate joint probability density function. The third axiom is a direct consequence of the law of total probability, where the probability of all events together equal 1. The final axiom tells us to sum or integrate over all possible outcomes of an event $A$ in order to calculate its probability.

Probability distributions of multiple random vectors

The notation of a random vector allows us to easily include multiple random variables in a single vector. Suppose now that our random vector $Z$ contains 2 different types of random variables, where for example each random variable corresponds to a different type of measurement. If we were to distinguish between these type of random variables using two generalized random variables $X_{i}$ and $Y_{i}$ , the random vector $Z$ could be written as $Z = [X_{1}, X_{2}, \dots, X_{N}, Y_{1}, Y_{2}, \dots, Y_{M}]^{⊤}$ . If we now were to define the random vectors $X = [X_{1}, X_{2}, \dots, X_{N}]^{⊤}$ and $Y = [Y_{1}, Y_{2}, \dots, Y_{M}]^{⊤}$ , it becomes evident that we could simplify the random vector $Z$ as $Z = [X^{⊤}, Y^{⊤}]^{⊤}$ .

This shows that it is also possible for joint probability distributions to depend on multiple random vectors, which each can be regarded as a subset of all random variables. This can prove useful in some cases when there is a clear distinction between the subsets and is purely for notation purposes. A probability distribution depending on multiple random vectors can be regarded in all aspects as a probability distribution depending on a single random vector (which is a concatenation of all different random variables). All calculations can be performed by regarding the probability distribution as if it depends on a single (concatenated) random vector. A probability distribution involving multiple random vectors can be written for example as $p_{X, Y} (x, y)$ .

Conditional probabilities

Similarly as in the previous reader, the conditional probability can be determined by normalizing the joint probability with the probability of the conditional event through

$\begin{matrix} p_{X | B} (x) = {\begin{matrix} \frac{p_{X} (x)}{Pr [B]}, & when x \in B 0. & otherwise \end{matrix} \\ (4) \end{matrix}$

Marginal probabilities

Because the notation of a random vector $X$ is just a shorter notation for the set of random variables $X_{1}, X_{2}, \dots, X_{N}$ , it is possible to calculate the marginalized probability distribution of a subset of random variables. This subset can also just consist of a single random variable. Again this operation is performed through marginalization as discussed in the previous reader. For the case that we are given the probability distribution $p_{X} (x)$ and we would like to know the marginalized probability distribution $p_{X_{2}, X_{3}} (x_{2}, x_{3})$ this can be calculated as $\begin{matrix} p_{X_{2}, X_{3}} (x_{2}, x_{3}) = \int_{- \infty}^{\infty} \dots \int_{- \infty}^{\infty} p_{X} (x) d x_{1} d x_{4} \dots d x_{N} \\ (5) \end{matrix}$ for continuous random variables and as $\begin{matrix} p_{X_{2}, X_{3}} = \sum x_{1} \in S_{X_{1}} \sum x_{4} \in S_{X_{4}} \dots \sum x_{N} \in S_{X_{N}} p_{X} (x) \\ (6) \end{matrix}$ for discrete random variables. Here we have integrated or summed over all possible values of all random variables except for the ones that we are interested in.

Independence

Independence is a term in probability theory which reflects that the probability of an event $A$ is not changed after observing an event $B$ , meaning that $Pr [A | B] = Pr [A]$ . In other words, the occurrence of an event $B$ has no influence on the probability of an event $A$ . Keep in mind that this does not mean that the physical occurrence of event $A$ and $B$ are unrelated, it just means that the probability of the occurrence of event $A$ is unrelated to whether event $B$ occurs or not.

Independent random variables

This notion of independence can be extended to probability functions. The random variables $X_{1}, X_{2}, \dots, X_{N}$ can be regarded as independent if and only if the following factorization holds $\begin{matrix} p_{X_{1}, X_{2}, \dots, X_{N}} (x_{1}, x_{2}, \dots, x_{N}) = p_{X_{1}} (x_{1}) p_{X_{2}} (x_{2}) \dots p_{X_{N}} (x_{N}) . \\ (7) \end{matrix}$ This equation represents that the total probability can be written a multiplication of the individual probabilities of the random variables. From a probability point of view (not a physical one) we can conclude that the random variables are independent, because the total probability solely depends on all the individual contributions of the random variables. Random variables that satisfy the independence equation and are distributed under the same probability density function are regarded as independent and identically distributed (IID or iid) random variables. This notion will become important later on when discussing random signals.

Independent random vectors

It is also possible to extend the definition of independence to random vectors. Two random vectors $X$ and $Y$ can be regarded as independent if and only if the probability function can be written as $\begin{matrix} p_{X, Y} (x, y) = p_{X} (x) p_{Y} (y) . \\ (8) \end{matrix}$

Statistical characterization of random vectors

In the previous reader on random variables, several characteristics were discussed for probability distributions depending on a single random variable, such as the mean, the variance and the moments of a random variable. This section will extend these characterization to random vectors.

Expected value

The expected value of a random vector $X$ is defined as the vector containing the expected values of the individual random variables $X_{1}, X_{2}, \dots, X_{N}$ as $\begin{matrix} E [X] = μ_{X} = [μ_{1}, μ_{2}, \dots, μ_{N}]^{⊤} = {[E [X_{1}], E [X_{2}], \dots, E [X_{N}]]}_{1}^{⊤} . \\ (9) \end{matrix}$

Expected value of a function

When we are interested in the expected value of a certain function

g (X)

, which accepts a random vector as argument and transforms it to a single value, this can be determined by multiplying the functions result with its corresponding probability and summing or integrating over all possible realizations of

X

. For a discrete random vector

X

consisting of random variables

X_{1}, X_{2}, \dots, X_{N}

the expected value of a function

g (X)

can be determined as

\begin{matrix} E [g (X)] = \sum x_{1} \in S_{X_{1}} \dots \sum x_{N} \in S_{X_{N}} g (x) \cdot p_{X} (x) \\ (10) \end{matrix}

and for a continuous random vector as

\begin{matrix} E [g (X)] = \int_{- \infty}^{\infty} \dots \int_{- \infty}^{\infty} g (x) \cdot p_{X} (x) d x_{1} \dots d x_{N} . \\ (11) \end{matrix}

Covariance

Previously, we have introduced the second centralized moment of a univariate random variable as the variance. While the variance denotes the spread in the univariate case, it has a different meaning in the multivariate case. Have a look at Fig. 1, where two contour plots are shown for two distinct multivariate Gaussian probability density functions. The exact mathematical description of a multivariate Gaussian probability density function is introduced here.

It can be noted that the distributions in Fig. 1 are different, since the first contour plot shows clear circles and the second contour plot shows tilted ellipses. For both distributions, 10000 random realizations $x = [x_{1}, x_{2}]^{⊤}$ are generated and the marginal distributions of both $X_{1}$ and $X_{2}$ are shown using a histogram and a fitted Gaussian distribution. It can be seen that the individual distributions of $X_{1}$ and $X_{2}$ are exactly equal for both multivariate distributions, but still the multivariate distributions are different. This difference can be explained by the covariance between the random variables $X_{1}$ and $X_{2}$ of the random vector $X$ .

The covariance is a measure of the relationship between 2 random variables. The second distribution in Fig. 1 has a negative covariance between $X_{1}$ and $X_{2}$ , because if $X_{1}$ increases $X_{2}$ decreases. No such thing can be said about the first distribution in Fig. 1, where $X_{1}$ and $X_{2}$ seem to have no relationship and behave independently from each other.

The formal definition of the covariance between two random variables $X_{1}$ and $X_{2}$ is given by $\begin{matrix} Cov [X_{1}, X_{2}] = E [(X_{1} - μ_{X_{1}}) (X_{2} - μ_{X_{2}})], \\ (12) \end{matrix}$ which is very similar to the definition of the variance, even to such an extent that it actually represents the variance if $X_{1} = X_{2}$ . Intuitively, one might regard the covariance as the expected value of the multiplication of $X_{1}$ and $X_{2}$ from which the means are subtracted. If both the normalized $X_{1}$ and $X_{2}$ have the same sign, then their multiplication would be positive and if $X_{1}$ and $X_{2}$ have different signs, then their multiplication would be negative. The covariance may therefore be regarded as a measure to indicate how $X_{2}$ behaves if $X_{1}$ increases or decreases.

Exercise

Recalling the exercise from previous section, let random variables

X

and

Y

have joint PDF

f_{X, Y} (x, y) = {\begin{matrix} 5 x^{2} / 2 & - 1 \leq x \leq 1; 0 \leq y \leq x^{2}, 0 & otherwise. \end{matrix}

In the previous section, we found that:

$E [X] = 0$ and $V a r [X] = \frac{10}{14}$
$E [X] = \frac{5}{14}$ and $V a r [X] = 5 / 27 - (5 / 14)^{2} = .0576$ .

With this knowledge, can you compute Var[

X + Y

The variance of $X + Y$ is $\begin{matrix} V a r [X + Y] & 1 = V a r [X] + V a r [Y] + 2 E [(X - μ_{X}) (Y - μ_{Y})] & = 5 / 7 + 0.0576 = 0.7719 \end{matrix}$

Correlation

The definition of the covariance can be rewritten as $\begin{matrix} \begin{matrix} Cov [X_{1}, X_{2}] & = E [(X_{1} - μ_{X_{1}}) (X_{2} - μ_{X_{2}})], = E [X_{1} X_{2} - μ_{X_{1}} X_{2} - μ_{X_{2}} X_{1} + μ_{X_{1}} μ_{X_{2}}], = E [X_{1} X_{2}] - μ_{X_{1}} E [X_{2}] - μ_{X_{2}} E [X_{1}] + μ_{X_{1}} μ_{X_{2}}, = E [X_{1} X_{2}] - μ_{X_{1}} μ_{X_{2}} - μ_{X_{1}} μ_{X_{2}} + μ_{X_{1}} μ_{X_{2}}, = E [X_{1} X_{2}] - μ_{X_{1}} μ_{X_{2}} . \end{matrix} \\ (13) \end{matrix}$ The term $E [X_{1} X_{2}]$ is called the correlation $r_{X_{1}, X_{2}}$ of $X_{1}$ and $X_{2}$ and is defined as $\begin{matrix} r_{X_{1}, X_{2}} = E [X_{1} X_{2}] . \\ (14) \end{matrix}$ This correlation can be regarded as a non-normalized version of the covariance. These two terms are related through $\begin{matrix} Cov [X_{1}, X_{2}] = r_{X_{1}, X_{2}} - μ_{X_{1}} μ_{X_{2}} . \\ (15) \end{matrix}$ It can be noted that the correlation and covariance of two random variables are equal if the mean values of both random variables are $0$ .

Uncorrelated random variables

Two random variables are called uncorrelated if the covariance between both random variables equals $0$ as $\begin{matrix} Cov [X_{1}, X_{2}] = 0. \\ (16) \end{matrix}$ Although the term suggests that it is related to the correlation between two random variables, it is defined as a zero covariance. The former has a different definition.

Orthogonality

Two random variables are called orthogonal if the correlation between both random variables equals $0$ as $\begin{matrix} r_{X_{1}, X_{2}} = 0. \\ (17) \end{matrix}$

Correlation coefficient

The value of the covariance depends significantly on the variance of both random variables and is therefore unbounded. In order to express the relationship between two random variables without having a dependence on the variances, it first has to be normalized. Therefore the correlation coefficient is introduced as $\begin{matrix} ρ_{X_{1}, X_{2}} = \frac{Cov [X_{1}, X_{2}]}{\sqrt{Var [X_{1}] Var [X_{2}]}} = \frac{Cov [X_{1}, X_{2}]}{σ_{X_{1}} σ_{X_{2}}} . \\ (18) \end{matrix}$ Please note that this represents the normalized covariance and not the normalized correlation between two random variables, although the name suggests otherwise. Because of this normalization, the correlation coefficient has the property to be bounded between $- 1$ and $1$ as $\begin{matrix} - 1 \leq ρ_{X_{1}, X_{2}} \leq 1. \\ (19) \end{matrix}$ Fig. 2 shows realizations of three different probability distributions with negative, zero and positive correlation coefficients.

Exercise

X

and

Y

are identically distributed random variables with

E [X] = E [Y] = 0

and covariance

C o v [X, Y] = 3

and correlation coefficient

ρ_{X, Y} = 1 / 2

. For nonzero constants

a

and

b

U = a X

and

V = b Y

Find $C o v [U, V]$ .
Find the correlation coefficient $ρ_{U, V}$ .
Let $W = U + V$ . For what values of $a$ and $b$ are $X$ and $W$ uncorrelated?

Since $X$ and $Y$ have zero expected value, $C o v [X, Y] = E [X Y] = 3$ , $E [U] = a E [X] = 0$ and $E [V] = b E [Y] = 0$ . It follows that $\begin{matrix} C o v [U, V] & = E [U V] = E [a b X Y] = a b E [X Y] = a b C o v [X, Y] = 3 a b . \end{matrix}$
We start by observing that $Var [U] = a^{2} Var [X]$ and $Var [V] = b^{2} Var [Y]$ . It follows that $\begin{matrix} ρ_{U, V} & = \frac{C o v [U, V]}{\sqrt{V a r [U] V a r [V]}} = \frac{a b C o v [X, Y]}{\sqrt{a^{2} V a r [X] b^{2} V a r [Y]}} = \frac{a b}{\sqrt{a^{2} b^{2}}} ρ_{X, Y} = \frac{1}{2} \frac{a b}{| a b |} . \end{matrix}$ Not that $a b / | a b |$ is 1 if $a$ and $b$ have the same sign or is -1 if they have opposite signs.
Since $E [X] = 0$ , $\begin{matrix} C o v [X, W] & = E [X W] - E [X] E [U] = E [X W] = E [X (a X + b Y)] = a E [X^{2}] + b E [X Y] = a V a r [X] + b C o v [X, Y] . \end{matrix}$ Since $X$ and $Y$ are identically distributed, $V a r [X] = V a r [Y]$ and $\frac{1}{2} = ρ_{X, Y} = \frac{C o v [X, Y]}{\sqrt{V a r [X] V a r [Y]}} = \frac{C o v [X, Y]}{V a r [X]} = \frac{3}{V a r [X]} .$ This implies $V a r [X] = 6$ . From (3), $C o v [X, W] = 6 a + 3 b = 0$ , or $b = - 2 a$ .

Cross-covariance matrix

We previously discussed how we could determine the covariance of two random variables. Let us turn now to the covariance of two random vectors $X = [X_{1}, X_{2}, \dots, X_{N}]^{⊤}$ and $Y = [Y_{1}, Y_{2}, \dots, Y_{N}]^{⊤}$ . Intuitively, one might say that this covariance cannot be described by a single number, because there are more than one combinations of random variables of which we want to calculate the covariance. As an example, we could determine the covariances of $X_{1}$ and $Y_{1}$ , $X_{1}$ and $Y_{23}$ and $X_{N}$ and $Y_{1}$ . In order to facilitate for all these possible combinations, there is a need to introduce the cross-covariance matrix $Γ_{X Y}$ , which contains the covariance of all the possible combinations of the random variables in random vectors $X$ and $Y$ .

The covariance matrix is formally defined as $\begin{matrix} Γ_{X Y} = E [(X - μ_{X}) (Y - μ_{Y})^{⊤}] = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} γ_{11} & γ_{12} & \dots & γ_{1 N} γ_{21} & γ_{22} & \dots & γ_{2 N} ⋮ & ⋮ & ⋱ & ⋮ γ_{N 1} & γ_{N 2} & \dots & γ_{N N} \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦, \\ (20) \end{matrix}$ where the individual coefficients correspond to $\begin{matrix} γ_{n m} = Cov [X_{n}, Y_{m}] = E [(X_{n} - μ_{X_{n}}) (Y_{m} - μ_{Y_{m}})] . \\ (21) \end{matrix}$ The transpose operator in the first equation creates a matrix from the two column vectors filled with the covariances of all possible combinations of random variables. For each of these covariances, the correlation coefficient $ρ_{n m}$ can be calculated similarly using the definition of the correlation coefficient. Two random vectors are called uncorrelated if $Γ_{X Y} = 0$ .

Auto-covariance matrix of a random vector

For the special case that

X = Y

, the cross-covariance matrix is called the auto-covariance matrix, which calculates the covariances between all random variables in

X

. The definition is the same as the definition of the covariance matrix, where

Γ_{X X}

is often simplified as

Γ_{X}

Exercise

n

-dimensional Gaussian vector

W

has a block diagonal covariance matrix

\begin{matrix} C_{W} = [\begin{matrix} C_{X} & 0 0 & C_{Y} \end{matrix}], \\ (22) \end{matrix}

where

C_{X}

m \times m

C_{Y}

(n - m) \times (n - m)

. Show that

W

can be written in terms of component vectors

X

and

Y

in the form

\begin{matrix} W = [\begin{matrix} X Y \end{matrix}], \\ (23) \end{matrix}

such that

X

and

Y

are independent Gaussian random vectors.

As given in the problem statement, we define the

m

-dimensional vector

X

, the

n

-dimensional vector

Y

and

W = {[\begin{matrix} X^{⊤}, & Y^{⊤} \end{matrix}]}^{⊤}

. Note that

W

has expected value

_{W} = E [W] = E [\begin{matrix} X Y \end{matrix}] = [\begin{matrix} E [X] E [Y] \end{matrix}] = [\begin{matrix} _{X}_{Y} \end{matrix}] .

The covariance matrix of

W

The assumption that

and

are independent implies that

This also implies that

. Thus

Cross-correlation matrix

Similarly to the cross-covariance matrix, the cross-correlation matrix can be defined containing the correlations of all combinations between the random variables in and . The cross-correlation matrix of random vectors and is denoted by and is defined as where the individual coefficients correspond to the individual correlations Two random vectors are called orthogonal if . Furthermore it can be proven that the cross-covariance matrix and the cross-covariance matrix are related through

Auto-correlation matrix of a random vector

For the special case that

, the cross-correlation matrix is called the auto-correlation matrix, which calculates the correlations between all random variables in

. The definition is the same as the definition of the cross-correlation matrix, where

is often simplified as

Linear transformations of random vectors

In the previous reader some calculation rules were determined for the mean and variance of a linearly transformed random variable. This subsection will continue with this line of thought, but now for random vectors. We will define an invertible transformation matrix , with dimensions , which will linearly map a random vector of length to a random vector again with length after adding an equally long column vector through

Probability density function

From the initial multivariate probability density function of , , the new probability density function of can be determined as where is the absolute value of the determinant of .

Mean vector

The new mean vector of the random vector can be determined as

Cross-covariance and cross-correlation matrix

By the definition of the cross-covariance matrix, the cross-covariance matrices and can be determined from the original auto-covariance matrix of through and similarly we can find the result The new cross-correlation matrices and can be determined as and similarly as

Auto-covariance and auto-correlation matrix

The auto-covariance matrix of can be determined through In a similar fashion the new auto-correlation matrix of can be calculated as

Proof that : ↩︎

Last updated on Sep 13, 2021