Many events are filled with uncertainties and randomness. But some events are more likely than the others. In that case, we speak about the probability. “What is the likelihood of an event occurring?” For example, what is the likelihood that it is going to rain today? The answer to this question depends on various factors- e.g location, season, past weather and many more. But also we need a value that tells us if it is highly likely, less likely or very unlikely to rain today. This measure is the ‘probability’. This measure could be any value ranging from to but that has a very huge range and doesn’t express the likelihood so well. Hence, we always normalize the value and measure the probability between 0 and 1. ‘0’ represents that the event is unlikely and ‘1’ represents that the event is certain. All the values in between measures the ‘likeliness’ of events happening. If the probability is 0.5, there is only 50% chance of event happening.
Let us assume that we are tossing a coin. Is there any certainty that we get a head (H) every time ? What about getting a tail (T) ? Since we do not have any certain outcome, we say that such event of tossing a coin is a probabilistic event. We can only express the outcome in terms of probability. If we toss a coin for 100 times, what is more likely to be tossed H or T ? There is no absolute favorite, both H and T are equally likely and the event of getting H is independent with the event of getting T.
This approach can be generalized by stating that: “For independent events, the joint probability is the product of all individual probabilities.
If A, B and C are independent events, then the joint probability is:
The joint probability is also represented by placing intersection sign in between events as:
But what does this mean? The expression in the numerator is the probability of both A and B occurring and the denominator is the probability of A occurring. Dividing the joint probability by the marginal probability , intuitively, means that we restrict the sample space to the region where only event A has occurred. In simple words, this means first find the sample space where event A has occurred, then find the probability of an event B occurring within that sample space.
Consider an example: What is the probability of getting an ace given the count of all cards are less than 10 ?
At first we have,
Using formula, the conditional probability is:
Intuitively, we can easily calculate the by reducing the sample space which satisfy the condition . The total number of cards that satisfy is 36 and the total number of Aces which satisfy is 4. Therefore the probability of Ace given is .
Hence, we can express the joint probability as:
Combining, we can express another formula,
This is one of the most famous formulas in probability theory, known as “Bayes Theorem” for . We will discuss more about the significance of this formula.
We divide this formula into three terms: , , and . The probability independent of A is called prior probability and the conditional probability is the posterior probability. Bayes Theorem shows how posterior probability is related to the prior probability. The term is the factor by which the probability of B changes due to knowledge of A.
Example: Suppose you have been tested positive for a disease D. What is the probability that you actually have the disease ? This fact depends on how accurate and sensible is the test done and the knowledge of the prior probability of the disease. Let represents the knowledge of prior probability of the disease. During the test, there can be two cases:
The probability due to case I is represented as and the probability due to case II is represented as . We want to determine .
We can use Bayes Theorem for this as:
Assume the following values for given probabilities:
Now we can simply substitute the values in the Bayes formula to calculate the probability of actually suffering from disease given the test is +ve.
This means that even if you are tested positive for a disease there is only 8.7% chance that you actually have a disease for such system with true positive of 95% and false positive of 10%. If patients have a very good knowledge of Bayesian statistics, they do not need to completely believe the test results.
But while solving mathematical problems, we want to express these sample spaces as real numbers. A function that converts the sample space to some real values is called random variable, usually represented by capital letter alphabet. A random variable that represents a fair coin toss can be expressed as:
In this regard, we see that a random variable is neither random nor a variable but a representation of random outcomes. Random variables can be discrete or continuous . Discrete random variables take only finite numbers of discrete values, for example the outcome of a number of employees working in a company, number of students in a classroom, number of defective apples in a box etc. While, continuous random variables take infinite number of possible values, for example the height of students studying in a class, price of a house in Hamburg, amount of sugar added in a cup of tea, weight of people in street of Hamburg etc.
In case of discrete random variables, we assign probabilities to each of possible values of random variables. This is called probability distribution. Probability distribution is represented by a function, called probability mass function (PMF) in case of discrete random variable. If a random variable takes different random values with probability , then the pmf is represented by the probability histogram. Remember that the probabilities should satisfy following conditions:
Let us assume that a dice thrown 100 times gets 1 to 6 with following probabilities,
We can plot the probability histogram as shown below:
In many cases, we obtain the cumulative distribution function (CDF) of a random variable. CDF is the distribution function which gives the probability that any random variable is less than or equal to any value . If we calculate the CDF for above example, we can state as follows:
Thus, cdf is expressed as:
The probability that lies between semi-closed interval of any value can be expressed in terms of cdf as:
If we plot the histogram for cdf, it is obtained as shown below:
For continuous valued random variables, the probability at specific value can’t be defined because it can take infinite number of possible values. But we define, the probability over an interval of values and the probability is given by the area under the curve, which is mathematically obtained by calculating the integration over that range. The “curve” is represented by a function called “probability density function” (PDF) . Let us understand PDF by taking an example. Suppose we want to answer the question “What is the probability that the weight of students in a school is 60 kg ?” If we try to find the probability , by considering as a discrete value, we call this PMF, but since weight can be a continuous value, we are interested in finding the probability within the range of weights such as 55 to 75 as . In this case, the weights can take values within certain range but not a specific value. In terms of PDF, the specific value will be represented by a line with 0 width, hence giving area = 0, which is also the probability for specific value. A random variable has a probability density function , then probability between any two values and is given by the area between the curve in the interval and . Mathematically, this is the integration of PDF in the interval and as represented by:
As area under the PDF curve measures the probability and the total probability sums to 1, the total possible area under the curve of any PDF should always be equal to 1. This is stated as:
As defined earlier CDF is the probability that a random variable is less than or equal to any value . If CDF observed at is represented by for any random variable X, then we can express:
If the CDF is continuous over , the derivative exists. Hence, we can also express PDF in terms of CDF as:
where is the probability that the random variable takes value equal to 1. is also the parameter of the Bernoulli distribution.
If a random variable is assigned to a normal distribution, it has parameters mean () and variance () and is represented as:
If is not concentrated at a single point but distributed over a continuous region, then we obtain moment by integrating the density of Q over that region. If represents the density of physical quantity then the moment is expressed as:
If the distribution is discrete then the integral sign is replaced by summation and the moment is given by:
We use the same idea of physics to define moment in statistics, for any data point obtained from random variable , if is the center (mean value) then distance and the order moment can be expressed in terms of PDF as:
When we place in the above equation, the moment obtained is called the ‘raw moment’, while the moment with certain value of is called the ‘central moment’.
It is interesting to observe that the zeroth order moment of any distribution is always equal to the total area under the distribution curve which is always equal to 1, i.e., , while the first order moment is given by:
where is the expected value of any random variable , which is also equal to the mean value.
The higher order raw moment is not defined, while the second order central moment is equal to the variance of a random variable as given by:
The higher order central moment is not defined but the central moment is normalized with respect to the powers of standard deviation () to define moments with order higher than 2. The normalized moment is expressed as:
For normalized moment,
We have different moments for different distributions or in other words, moments are the parameters of a distribution function. As the name suggests moment generating functions (MGF) are some functions defined to generate moments in the case of continuous valued distribution. MGF for any random variable is defined as a function of some real value as given by following equation:
We can expand using Taylor series expansion as:
The moment is obtained by calculating the order derivative of MGF with respect to and substituting in the final answer. For example:
MGF thus is an alternative function to describe the probability distribution besides PDF and CDF. MGF makes calculations easy when we want to find the distribution of sum of different independent variables as the exponent term in MGF allows to express sum as product. For example, if we want to find MGF of , where and are two independent random variables:
The MGF do not always exist even though all the moments exist. But there is another function called characteristic function (CF) which can be expressed in similar way as MGF but always exist for given distribution even if the MGF and PDF do not exist. The CF of any random variable is equal to the MGF of that random variable evaluated over the imaginary axis. CF is defined as:
Thus a CF completely characterizes the probability distribution of a real valued random variable.
By now, we know that there are many different probability distributions and they have been given different names. In real life, some of the probability distributions are more common than others and have huge applications. Next we are going to discuss some of the important probability distributions along with some real life applications.
If is the possible outcome of Bernoulli distribution, we can express the PMF for this distribution as:
For a random variable which follows a Binomial distribution with independent events and successes is represented as . The PMF of a Binomial distribution with probability of success is represented as:
where is any natural number, and the binomial coefficient is
Using the PMF, we can now easily calculate the probability of getting exactly 4 heads in a 5 coin tosses as :
then with probabilities has multinomial distribution represented as . The PMF of such multinomial distribution is given by:
The distribution of each random variable, however is binomial distribution, i.e., , and so on.
Assume only three possible outcomes from an experiment, then random variable and then
Next, we need to express the probability of given , which is a conditional probability. That means we limit our total trials to and new probability is
Now, for the last event, is 1, as this is the only one possible outcome remaining given all other outcomes. Thus, we can express the joint probability using the chain rule as:
Assume that is the number of times an event occurred in an interval. Thus, we can represent by whole numbers . Let the average rate of occurrence of an event is , and the occurrence of one event is not affected by the other, the probability of observing the events in an interval is given by the following PMF,
If is the rate of an event occurring during the time , then and PMF can be expressed as:
Again, let us take an example of world cup football match. If the average number of goals scored per game in all world cup football matches ever played is , the probability of goals scored in a game obeys Poisson distribution. Hence,
The PMF of Poisson distribution for different values of is plotted as below with y-axis being the probability of distribution and the x axis being the number of goals scored per game:
The Expected value and variance of any random variable , which is Poisson distributed is equal to the average rate of occurrence of an event .
The PDF when and is plotted as shown below:
The CDF for such continuous random variable is given by:
The CDF for and is plotted as shown below:
The mean and variance of a uniform distributed random variable in the interval is given by:
We can represent many real life data with normal distribution. The ‘central limit theorem (CLT)’ states that when independent random variables are added, their normalized sum tends toward normal distribution as the number of variables grow even if the original random variables are not normally distributed. CLT provides theoretical basis for solving many different distributions by assuming their sum to be normally distributed. Also, there are many real life instances where values from different independent events is to be statistically analyzed, we can always assume such values to be normally distributed.
The PDF of a normally distributed random variable can be characterized in terms of its mean () and variance () as given by the following equation:
When and , the distribution is called standard normal distribution and the PDF of such standard normal distribution is given by:
Below, you can see the plot for PDF of a standard normal distributed random variable with 100000 samples. The black line is the mean value, which is very very close to zero. The green line represents the standard deviation of 1 in both directions. For any standard normal distributed random variable, the standard deviation of 1 covers of total area; the standard deviation of 2 covers of total area and the standard deviation of 3 covers of total area.
We can express general normal distribution as standard normal distribution.
The term is called the standard score or z score, which gives the value by which the data point differ from its mean value. For any normal distributed random variable , the mean is the expected value expressed as and the standard deviation is , then another random variable
is standard normal distributed.
When the variance increases, the spread of the curve increases as shown in the figure below for mean value . Higher spread of the curve indicates that the data is varying very widely or the range of data is large.
When the mean changes, the position of the curve shifts from the center as shown in the figure below for variance . When is zero, the data has equal amount of positive and negative values. If the data represents the error of measurement (difference between the estimated and the actual value), then it is likely to have both positive and negative values almost equally with mean value close to zero. Such data is obtained from the ‘unbiased estimator’ and the distribution is called unbiased. However, when the mean shifts towards positive or negative value from 0, the data is biased towards positive error or negative error and the estimator is not perfect. Such data is obtained from ‘biased estimator’.
While designing an estimator, we want it to minimize the bias as much as possible but we also want to minimize the variance. For a given data if we minimize the bias, we are likely to maximize the variance and vice-versa.
We will learn more about bias and variance while dealing with supervised learning models. There we will touch upon an interesting topic called “Bias Variance trade-off” while explaining terms such as “overfitting” and “underfitting”.
The CDF of a random variable X with normal distribution is expressed in terms of its PDF as:
and the CDF of a random variable X which has standard normal distribution is expressed in terms of its PDF as:
The CDF of a normal distribution with different values of mean and variance is plotted as shown below: