Functions and pairs of random variables

Functions of random variables

When a signal generated under a certain probability distribution is manipulated, the probability distribution that follows from this transformation is usually difficult to determine; however, the expected value and variance are usually easier to compute.

Linear combinations of random variables

One common category of these transformation consists of the linear combinations of a continuous random variable $X$ , with mean $μ_{X}$ and variance $σ_{X}^{2}$ . Suppose that the random variable $Y$ is a linear combination of $X$ defined by $\begin{matrix} (1) & Y = a X + b, \end{matrix}$ with $a$ and $b$ deterministic constants. The mean of the random variable $Y$ can be determined as $\begin{matrix} (2) & \begin{aligned} E [Y] & = E [a X + b] \\ = \int_{- \infty}^{\infty} (a x + b) \cdot p_{X} (x) d x \\ = a \int_{- \infty}^{\infty} x \cdot p_{X} (x) d x + b \int_{- \infty}^{\infty} p_{X} (x) d x \\ = a \cdot E [X] + b = a \cdot μ_{X} + b \end{aligned} \end{matrix}$ From this we have now obtained the rules that allowed us to rewrite the variance in an earlier subsection. Similarly the variance of $Y$ can be determined using the previous result as $\begin{matrix} (3) & \begin{aligned} Var [Y] & = E [(Y - μ_{Y})^{2}] \\ = E [(a X + b - (a \cdot μ_{X} + b))^{2}] \\ = E [a^{2} (X - μ_{X})^{2}] \\ = \int_{- \infty}^{\infty} a^{2} (x - μ_{X})^{2} \cdot p_{X} (x) d x \\ = a^{2} \int_{- \infty}^{\infty} (x - μ_{X})^{2} \cdot p_{X} (x) d x \\ = a^{2} Var [X] \end{aligned} \end{matrix}$ A very similar derivation also holds for discrete random variables. Please note that the constant $b$ did not have an effect on the variance of $Y$ . Intuitively this makes sense since the variance is a measure of the spread of a variable. By adding a constant the spread of the probability distribution will not change.

Continuous elementary functions

Sometimes it can be useful to define probability distributions in the continuous domain more flexibly. Up until now we were limited by the given families of distributions, but one might for example want to alter these distributions slightly according to the situation. Suppose there is a random variable $X$ that behaves as a uniform distribution for $X < 0$ and an exponential distribution for $X \geq 0$ . How can we describe this distribution mathematically? For this the delta pulse function and the unit step function need to be introduced.

Dirac delta pulse function

The Dirac delta pulse function is a function that only exists for one single value on its domain. The value at this point is undefined, but it guarantees that the integral over its domain equals 1. The dirac delta pulse is therefore defined as $\begin{matrix} (4) & δ (x) = {\begin{cases} + \infty, & for x = 0 \\ 0, & for x \neq 0 \end{cases} \end{matrix}$ under the constraint $\begin{matrix} (5) & \int_{- \infty}^{\infty} δ (x) d x = \int_{0^{-}}^{0^{+}} δ (x) d x = 1, \end{matrix}$ where $0^{-}$ and $0^{+}$ are the limits towards $0$ from below and above respectively.

Shifting property

The definition of the Dirac delta pulse function allows us to extract values from a function by simply multiplying with the delta pulse and integrating over the domain. This delta pulse can be shifted on its domain, allowing for the extraction at any point on the function. This process can be regarded as sampling. This is known as the sifting property and is defined as $\begin{matrix} (6) & \int_{- \infty}^{\infty} g (x) δ (x - x_{0}) d x = \int_{x_{0}^{-}}^{x_{0}^{+}} g (x) δ (x - x_{0}) d x = g (x_{0}) . \end{matrix}$

Discrete to continuous probability functions

With the introduction of the Dirac delta pulse, we are now also capable of transforming a discrete probability mass function into a continuous probability density distribution. Each of the values of the probability mass function can be represented as a delta pulse multiplied with the corresponding value. As an example, the probability mass function of the discrete Bernoulli(p) distribution can be written as a probability density function as $\begin{matrix} (7) & p_{X} (x) = (1 - p) \cdot δ (x) + p \cdot δ (x - 1) . \end{matrix}$ Similarly all other discrete probability mass functions can be rewritten as continuous probability density functions. For more details about the Bernoulli distribution, check the section Families of random variables.

Unit step function

It is also possible that a probability density function is zero up to a threshold $x_{0}$ , after which it nicely follows a known distribution $p_{X} (x)$ . The mathematical description of this piece-wise probability density function can be determined using the unit step function, which is defined as $\begin{matrix} (8) & u (x) = {\begin{cases} 1, & for x \geq 0 \\ 0. & for x < 0 \end{cases} \end{matrix}$ This function allows us to put a known probability density function $p_{X} (x)$ to zero up to a certain threshold $x_{0}$ , whilst retaining the other part of the function. A generalization of such a new probability density function $p_{X 1} (x)$ can be written as $\begin{matrix} (9) & p_{X 1} (x) = c \cdot p_{X} (x) \cdot u (x - x_{0}), \end{matrix}$ where $c$ is a constant that is required to satisfy the total probability axiom ( $\int_{- \infty}^{\infty} p_{X 1} (x) d x = 1$ ).

Mixed random variables

When the probability density function of a random variable consists of a combination of both delta pulses and probability density functions, it can be regarded as a mixed random variable. An example of a random variable is given in the following example. The definition of mixed random variable is

$X$ is a mixed random variable if and only if $p_{X} (x)$ consists of both impulses and non-zero, finite values.

Example

Suppose we are monitoring the duration of phone calls. The measured duration $x$ in minutes can be regarded as a continuous random variable $X$ , which is distributed as an exponential random variable with rate parameter $λ = 1$ . This can be written as $X \sim Exp (λ = 1)$ with probability density function $\begin{matrix} (10) & p_{X} (x) = {\begin{cases} e^{- x}, & for x \geq 0 \\ 0. & for x < 0 \end{cases} \end{matrix}$ The service provider decides for simplicity to round all phone calls lasting less than 60 seconds to exactly 1 minute. This administrative duration $y$ in minutes can be represented by a mixed random variable $Y$ . The administrative duration can be mathematically determined as $\begin{matrix} (11) & y = {\begin{cases} 1, & for x < 1 \\ x . & for x \geq 1 \end{cases} \end{matrix}$ Let us now derive the probability density function of $Y$ . In order to do so, the derivation is split for the cases $Y < 1$ , $Y = 1$ and $Y > 1$ .

We can easily see that $p_{Y} (y) = 0$ for $y < 1$ , because the duration is always rounded to at least 1 minute.

For $Y = 1$ a true probability is involved, since there is a non-zero probability that the administrative duration is exactly 1 minute. Because the integral of the probability density function should return a non-zero value at $y = 1$ when integrating $\int_{1^{-}}^{1^{+}} p_{Y} (y) d y$ , a Dirac delta pulse should be located at $y = 1$ in order to make sure that the integral returns the true probability. This delta pulse should be multiplied with the probability of the occurrence of this event. This probability can be determined as $\begin{matrix} (12) & Pr [Y = 1] = Pr [X < 1] = \int_{- \infty}^{1} p_{X} (x) d x = \int_{0}^{1} e^{- x} d x = {[- e^{- x}]}_{0}^{1} = 1 - e^{- 1} . \end{matrix}$ For $Y > 1$ the probability density function of $Y$ simply follows the exponential distribution of $X$ . In order to cope with this partial domain, $p_{X} (x)$ needs to be multiplied with the shifted unit step function.

So to conclude the probability density function of $Y$ can be written as $\begin{matrix} (13) & p_{Y} (y) = (1 - e^{- 1}) \cdot δ (y - 1) + e^{- y} \cdot u (y - 1) . \end{matrix}$

Pairs of random variables

So far this module has only focused on events that produce a single outcome and therefore can only have a single random variable. However, it is also possible for an event to return multiple random variables.

Screencast video [⯈]

Example

Suppose we have an experiment in which we throw 2 dice consecutively and we collect the total amount of eyes after the first and second dice are thrown. Assuming normal dice, the total amount of eyes observed after the first throw should be an integer between 1 and 6. But how can we now describe the total amount of eyes after throwing the second dice? It will be clear that this second observation will be between 2 and 12 eyes. However, we do need to keep in mind that the second observation depends greatly on the first observation. It is for example impossible to first observe a total of 5 eyes after throwing the first dice and then to observe a total of 4 eyes after throwing the second dice.

Joint random variables

Through the previous example it is clear that there is a need to combine multiple random variables into a single function in order to draw conclusions about observations and their probabilities. Multiple random variables that are all associated with an event are called joint random variables. These random variables are usually denoted with different capital letters (e.g. $X$ and $Y$ ). The theory that follows, holds for every number of joint random variables, however, for simplicity this reader will only discuss random processes described by 2 joint random variables.

Joint cumulative distribution function

In order to describe the probabilities that are associated with the observations ( $x$ , $y$ ), we will introduce the joint cumulative distribution function, which is a variant of the cumulative distribution function. This joint cumulative distribution function is defined for the random variables $X$ and $Y$ as $\begin{matrix} (14) & P_{X, Y} (x, y) = Pr [X \leq x, Y \leq y], \end{matrix}$ where the comma in the probability operator indicates that both conditions need to hold.

Joint probability axioms

Similarly to the single random variable case, axioms can be determined for the joint cumulative distribution function. These axioms include

It holds that $0 \leq P_{X, Y} (x, y) \leq 1$ .
It holds that $P_{X, Y} (\infty, \infty) = 1$ and that $P_{X, Y} (x, - \infty) = P_{X, Y} (- \infty, y) = 0$ .
It holds that $P_{X} (x) = P_{X, Y} (x, \infty)$ and similarly $P_{Y} (y) = P_{X, Y} (\infty, y)$ .
The function is non-decreasing (i.e. for $x \leq x_{0}$ and $y \leq y_{0}$ it holds $P_{X, Y} (x, y) \leq P_{X, Y} (x_{0}, y_{0})$ ).

The first axiom is a consequence of the total probability axiom, where a probability cannot be negative nor larger than 1.

A total probability of 1 only occurs when both $X \leq x$ and $Y \leq y$ are always satisfied, which is the case for $P_{X, Y} (\infty, \infty)$ . Similarly, there is by definition no event where $x \leq - \infty$ or $y \leq - \infty$ and therefore this (cumulative) probability is always 0.

When one of the conditions $X \leq x$ and $Y \leq y$ is always satisfied, the cumulative distribution does not depend on that variable anymore and therefore that variable can be removed from the equation, leaving a distribution that now depends on one single random variable. This process is called marginalization and will be discussed later on.

Finally as seen previously the cumulative distribution is a sum or integral over an ever-increasing domain with non-negative values, leading to a non-decreasing function.

Joint probability density function

The previous definition of the joint cumulative distribution function both holds for continuous as discrete random variables. For the continuous random variables the joint probability density function can be defined through the relationship $\begin{matrix} (15) & \int_{- \infty}^{x} \int_{- \infty}^{y} p_{X, Y} (u, v) d u d v = P_{X, Y} (x, y) \end{matrix}$ and can be calculated from the joint cumulative distribution function as $\begin{matrix} (16) & p_{X, Y} (x, y) = \frac{\partial^{2} P_{X, Y} (x, y)}{\partial x \partial y} . \end{matrix}$

Joint probability mass function

In the case of discrete random variables, the joint probability mass function can be defined explicitly as $\begin{matrix} (17) & p_{X, Y} (x, y) = Pr [X = x, Y = y], \end{matrix}$ because the values that the function returns are true probabilities.

Expected value

The expected value of the joint random variables can be determined similarly to the single variable case. Given a new random variable $W$ , which is an arbitrary function of the random variable $X$ and $Y$ through $W = g (X, Y)$ , the expected value of $W$ can be determined as $\begin{matrix} (18) & E [W] = \sum_{x \in S_{X}} \sum_{y \in S_{Y}} g (x, y) \cdot p_{X, Y} (x, y) \end{matrix}$ in the case of discrete random variables and as $\begin{matrix} (19) & E [W] = \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} g (x, y) \cdot p_{X, Y} (x, y) d x d y \end{matrix}$ in the case of continuous random variables.

Marginalization

In many engineering problems you are only interested in the outcome of one random variable. When a joint probability function is available it is possible to marginalize over all other random variables except for the one random variable in which you are interested, leaving you with the probability function of just that random variable. This marginalization is performed for discrete random variables through $\begin{matrix} (20) & p_{X} (x) = \sum_{y \in S_{Y}} p_{X, Y} (x, y) \end{matrix}$ and for continuous random variables through $\begin{matrix} (21) & p_{X} (x) = \int_{- \infty}^{\infty} p_{X, Y} (x, y) d y . \end{matrix}$ This operation can be understood when viewing the summation or integration as “taking all possibilities of all other random variables into account”. The proof for the continuous random variables follows $\begin{matrix} (22) & \begin{aligned} p_{X} (x) & = \frac{\partial P_{X} (x)}{\partial x}, \\ = \frac{\partial}{\partial x} Pr [X \leq x], \\ = \frac{\partial}{\partial x} Pr [X \leq x, Y \leq \infty], \\ = \frac{\partial}{\partial x} \int_{- \infty}^{x} (\int_{- \infty}^{\infty} p_{X, Y} (v, u) d u) d v, \\ = \int_{- \infty}^{\infty} p_{X, Y} (u, v) d v . \end{aligned} \end{matrix}$ A similar proof can be found for the marginalization of discrete random variables.

Conditional probabilities

When considering events with multiple outcomes we might be able to deduce one outcome from another. Take the example of the two dice. If we know that the first die rolls 5 eyes then we automatically know that the total amount of eyes after the second throw is 6 up to 11 with equal probability. Similarly we may deduce with 100% certainty that the first throw should have given 6 eyes, when 12 eyes are observed after the second throw.

This line of reasoning is an example of conditional probability. Conditional probability involves the probability of some event when another event is observed. In the reader on probability theory, the definition of conditional probability was given and explained. If this definition is combined with random variables, two separate cases can be identified.

First, it is possible to determine the conditional probability function, which is the probability mass function or probability density function of observing outcomes $(x, y)$ , when it is known that the outcomes can be found somewhere in an event set $B$ , which is a subset of the sample space $S$ . When the initial joint probability function is given by $p_{X, Y} (x, y)$ , the conditional probability function of $X$ and $Y$ given $B$ (i.e. $X, Y | B$ ) can be determined as $\begin{matrix} (23) & p_{X, Y | B} (x, y) = {\begin{cases} \frac{p_{X, Y} (x, y)}{Pr [B]}, & when (x, y) \in B \\ 0. & otherwise \end{cases} \end{matrix}$ This operation can be regarded as assigning all outcomes outside set $B$ to zero and normalizing the remaining function to satisfy the total probability axiom.

Secondly it is also possible that one outcome is observed and we would like to find the probability function of the other random variables. Let us now suppose that the random variable $Y$ is observed as $y$ and we are interested in finding the conditional probability function of $X$ , which is $p_{X | Y} (x | y)$ . For discrete random variables this conditional probability mass function can be determined as $\begin{matrix} (24) & p_{X | Y} (x | y) = Pr [X = x | Y = y] = \frac{Pr [X = x, Y = y]}{Pr [Y = y]} = \frac{p_{X, Y} (x, y)}{p_{Y} (y)} . \end{matrix}$ Keep in mind that the marginalized probability mass function $p_{Y} (y)$ can be calculated using marginalization. Similarly to the derivation for the discrete random variables, the conditional probability density function of continuous random variables can be determined as $\begin{matrix} (25) & p_{X | Y} (x | y) = \frac{p_{X, Y} (x, y)}{p_{Y} (y)} = \frac{p_{X, Y} (x, y)}{\int_{- \infty}^{\infty} p_{X, Y} (x, y) d x} \end{matrix}$

Example

Suppose we are given the joint probability density function $\begin{matrix} (26) & p_{X, Y} (x, y) = {\begin{cases} C e^{- x}, & for x \geq 0 and - x \geq y \geq x \\ 0, & otherwise \end{cases} \end{matrix}$ where $C$ is a constant to be determined. Let us now derive the value of $C$ , the marginal probability density functions $p_{X} (x)$ and $p_{Y} (y)$ and the conditional probability density function $p_{Y | X} (y | x)$ .

In order to calculate $C$ we need to make use of the total probability axiom, which is given as $\begin{matrix} (27) & \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} p_{X, Y} (x, y) d x d y = 1 \end{matrix}$ for the joint random variables case. If we substitute the given joint probability density function with the given constraints we find $\begin{matrix} (28) & \begin{aligned} \int_{0}^{\infty} \int_{- x}^{x} C e^{- x} d y d x & = C \int_{0}^{\infty} e^{- x} \int_{- x}^{x} 1 d y d x = C \int_{0}^{\infty} e^{- x} {[y]}_{- x}^{x} d x, \\ = 2 C \int_{0}^{\infty} x e^{- x} d x \overset{1}{=} 2 C {[- x e^{- x}]}_{0}^{\infty} - 2 C \int_{0}^{\infty} - e^{- x} d x, \\ = - 2 C {[x e^{- x}]}_{0}^{\infty} - 2 C {[e^{- x}]}_{0}^{\infty}, \\ \overset{2}{=} - 2 C (0 - 0) - 2 C (0 - 1) = 2 C = 1. \end{aligned} \end{matrix}$ From this follows that $C = \frac{1}{2}$ .

¹²

In order to calculate $p_{X} (x)$ we will marginalize $p_{X, Y} (x, y)$ over $Y$ as $\begin{matrix} (29) & \begin{aligned} \int_{- \infty}^{\infty} p_{X, Y} (x, y) d y & = \int_{- x}^{x} C e^{- x} d y = C e^{- x} \int_{- x}^{x} 1 d y = 2 C x e^{- x} . \end{aligned} \end{matrix}$ When substituting the value of $C$ and considering that the function only exists for $x \geq 0$ , the marginal probability density function can be determined as $\begin{matrix} (30) & p_{X} (x) = {\begin{cases} x e^{- x}, & for x \geq 0 \\ 0. & otherwise \end{cases} \end{matrix}$

Similarly we can find the marginal probability density function $p_{Y} (y)$ . One might say that the marginalization integral has a lower-bound for $x$ , namely $x \geq 0$ . However, this is not entirely correct. When rewriting the constraint $- x \geq y \geq x$ as $x \geq | y |$ it becomes clear that $x$ is not bounded by $0$ , but by $| y |$ . The marginal probability density function is not bounded on $y$ by a constant and can therefore be written as $\begin{matrix} (31) & \begin{aligned} p_{Y} (y) & = \int_{| y |}^{\infty} C e^{- x} d x = C [- e^{- x}]_{| y |}^{\infty} = C (- 0 + e^{- | y |}) = \frac{1}{2} e^{- | y |} . \end{aligned} \end{matrix}$

Now finally we can use our previous results in order to determine the conditional probability density function as $\begin{matrix} (32) & \begin{aligned} p_{Y | X} (y | x) & = \frac{p_{X, Y} (x, y)}{p_{X} (x)} \\ = {\begin{cases} \frac{\frac{1}{2} e^{- x}}{x e^{- x}}, & for x \geq 0 and - x \geq y \geq x \\ 0, & otherwise \end{cases} \\ = {\begin{cases} \frac{1}{2 x}, & for x \geq 0 and - x \geq y \geq x \\ 0. & otherwise \end{cases} \end{aligned} \end{matrix}$ One could apply the axiom of total probability to all marginalized and conditional probability density functions over their domain and should note that the axiom is satisfied.

Exercise

Random variables

X

and

Y

have joint PDF

f_{X, Y} (x, y) = {\begin{cases} 5 x^{2} / 2 & - 1 \leq x \leq 1; 0 \leq y \leq x^{2}, \\ 0 & otherwise. \end{cases}

Answer the following questions:

What are E[ $X$ ] and Var[ $X$ ]?
What are E[ $Y$ ] and Var[ $Y$ ]?
What is E[ $X + Y$ ]?

The first moment of $X$ is $E [X] = \int_{- 1}^{1} \int_{0}^{x^{2}} x \frac{5 x^{2}}{2} d y d x = \int_{- 1}^{1} \frac{5 x^{5}}{2} d x = \frac{5 x^{6}}{12} |_{- 1}^{1} = 0.$ Since $E [X] = 0$ , the variance of $X$ and the second moment are both $V a r [X] = E [X^{2}] = \int_{- 1}^{1} \int_{0}^{x^{2}} x^{2} \frac{5 x^{2}}{2} d y d x = \frac{5 x^{7}}{14} |_{- 1}^{1} = \frac{10}{14} .$
The first and second moments of $Y$ are $E [Y] = \int_{- 1}^{1} \int_{0}^{x^{2}} y \frac{5 x^{2}}{2} d y d x = \frac{5}{14},$ $E [Y^{2}] = \int_{- 1}^{1} \int_{0}^{x^{2}} y^{2} \frac{5 x^{2}}{2} d y d x = \frac{5}{27} .$ Therefore, $V a r [Y] = 5 / 27 - (5 / 14)^{2} = .0576$ .
The expected value of the sum $X + Y$ is $E [X + Y] = E [X] + E [Y] = \frac{5}{14} .$

Central limit theorem

Screencast video [⯈]

We have briefly discussed what happens with the signal statistics, especially the mean and variance, when a random variable is linearly transformed. Now let us focus on the specific situation where we take the sum of $N$ independent continuous random variables. We will define a random variable $Y$ as $\begin{matrix} (33) & Y = \frac{X_{1} + X_{2} + \dots + X_{N}}{N} \end{matrix}$ and have a look at what the probability density function of $Y$ looks like.

In Fig. 1, 100000 samples are generated from three different probability distributions and the resulting histograms are plotted. This is repeated for an increasing number $N$ , and the histograms are plotted after averaging $N$ similarly generated sets of data. It can be noted that after averaging multiple realizations of a random variable generated from the same (arbitrary) distribution, the distribution of the averaged random variables converges to a Gaussian distribution. This result is known as the central limit theorem (CLT).

Visualisation of the central limit theorem. — Demonstration of the CLT for three different probability distributions, i.e. $U (0, 5)$ , $U (- 10, 10)$ and Exp $(λ = 1)$ , where $N$ sets of 100000 realizations are averaged and the new distributions of the averaged 100000 samples are shown for different values of $N$ . A Gaussian function is fitted over the distributions and it can be seen that after averaging over more realizations, the new distribution will converge to the Gaussian distribution.

The formal definition is given as:

Let $X_{1}, X_{2}, \dots, X_{n}$ be a set of $N$ independent identically-distributed (i.i.d) random variables and each $X_{i}$ has an arbitrary probability distribution $p (x_{1}, x_{2}, \dots, x_{n})$ with finite mean $μ_{i} = μ$ and finite standard deviation $σ_{i} = σ$ . If the sample size $N$ is “sufficiently large”, then the CDF of the sum converges to a Gaussian CDF.

In simple words, the CLT states that the normalized sum of a sufficient number of i.i.d. random variables tends to a Gaussian distribution. It is also possible to predict the mean and the variance of the Gaussian probability density function after a sufficiently large set of $N$ random variables has been summed. The minimum value of $N$ for the CLT approximation to be valid depends on the distribution of the individual random variables.

Under certain conditions, the CLT can be extended for the sum of random variables that are independent but not necessarily identically distributed (i.e., Lyapunov condition). In both cases, the mean or expected value of the approximate Gaussian probability density function can be determined as $\begin{matrix} (34) & E [Y] = \frac{1}{N} \sum_{i = 1}^{N} E [X_{i}], \end{matrix}$ which is simply the average value of all individual expected values. Similarly the variance can be determined as $\begin{matrix} (35) & Var [Y] = σ_{Y}^{2} = \frac{1}{N^{2}} \sum_{i = 1}^{N} σ_{X_{i}}^{2} = \frac{1}{N^{2}} \sum_{i = 1}^{N} Var [X_{i}] . \end{matrix}$

Similarly the random variable $Z$ , which is the sum over normalized random variables, can be defined as $\begin{matrix} (36) & Z = \frac{1}{N} \sum_{i = 1}^{N} \frac{X_{i} - E [X_{i}]}{σ_{X_{i}}} . \end{matrix}$ Through the central limit theorem it follows that the probability density function of the random variable $Z$ converges to the normalized Gaussian distribution $N (0, 1)$ .

Integration by parts: $\int_{a}^{b} f (x) g^{'} (x) d x = {[f (x) g (x)]}_{a}^{b} - \int_{a}^{b} f^{'} (x) g (x) d x$ ↩︎
Growth property of exponentials: $lim_{x \to \infty} x^{a} e^{- x} = 0$ ↩︎

Last updated on Sep 13, 2021