Truly understanding logistic regression - Part 2

In this article, we unveil how the mathematical principles of logistic regression emerge organically within the framework of probability, elucidating why it is extensively utilized.

What is supervised learning ?

Supervised learning is a type of machine learning paradigm where the algorithm is trained on a labeled dataset, which means the input data used for training is paired with corresponding output labels. The goal of supervised learning is for the algorithm to learn the mapping or relationship between the input features and the corresponding output labels. Once the model is trained, it can make predictions or decisions when given new, unseen input data.

The process involves the following key steps:

  • A labeled dataset is used for training, consisting of input-output pairs. The input represents the features or attributes of the data, while the output is the labeled or desired outcome.

  • An algorithm processes the training data to learn the patterns and relationships between the input features and output labels. During this training phase, the algorithm adjusts its internal parameters to minimize the difference between its predictions and the actual labels.

  • Once trained and validated, the model can be used to make predictions or decisions on new, unseen data. It takes the input features and generates output predictions based on the learned patterns.

Supervised learning is itself categorized into two main types: regression and classification.

Example of regression

In a regression task, the algorithm predicts a continuous value or quantity. As an example, we can forecast the price of a house by considering its size (in square meters). A sample of training data is presented below.

Size (in square meters)Price (in $)
120500000
40100000
42105000
64165000
76185000
55145000
......

What is the price of a 100-square-meter house ?

This example exemplifies precisely what we've defined: the training data is merely the table above with one input variable (size) and one output variable (price). The next step involves selecting an algorithm to establish the relationship between this input and output, enabling us to predict unforeseen data. For instance, we could determine the price of a 100-square-meter house using this trained model.

Example of classification

In a classification task, the algorithm assigns input data to predefined categories or classes. As an illustration, we can predict the gender (M or W) of an individual by taking into account their height (in centimeters). A sample of training data is presented below.

What is the gender of a 172-centimeter individual ?

This data highlights the challenge of a classification task. It is evident from the chart above that predicting the gender of an individual who is 172 centimeters tall is not straightforward—it could be either a man or a woman. Instead, the idea is to determine a probability for an individual to be a man or a woman based on their height. In mathematical terms, we aim to predict $P(M|h)$, and given the binary nature of the classes, we can also express that $P(W|h) = 1 - P(M|h)$.

Disclaimer 1

The latter situation is unrealistic. In a real-world context, there would typically be multiple observed variables, such as weight or foot size, for instance. However, calculations become more intricate in multidimensions, visualizations are more challenging to depict, and for illustrative purposes, we intentionally confine ourselves to a one-dimensional perspective.

Disclaimer 2

The example we used here pertains to a binary classification between two classes. However, it's important to note that the principles we're outlining remain applicable for scenarios involving $K$ classes, where $K \gt 2$.

In the subsequent parts of this series, our emphasis is directed toward classification tasks.

Enter the Bayes' theorem

Bayes' theorem is a fundamental principle in probability theory that describes the probability of an event based on prior knowledge of conditions that might be related to the event. Named after the Reverend Thomas Bayes, who formulated the theorem, it is particularly useful in updating probabilities when new evidence becomes available.

The theorem is expressed mathematically as $P(A|B)= \dfrac{P(B|A)P(A)}{P(B)}$

  • $P(A|B)$ is the probability of event $A$ occurring given that event $B$ has occurred.
  • $P(B|A)$ is the probability of event $B$ occurring given that event $A$ has occurred.
  • $P(A)$ is the prior probability of event $A$.
  • $P(B)$ is the prior probability of event $B$.

Bayes' theorem allows us to update our belief in the probability of an event ($A$) based on new evidence ($B$).

Applying Bayes's theorem to our earlier classification task, we can express it as $P(M|h)= \dfrac{P(h|M)P(M)}{P(h)}$.

From this point, we can utilize the law of total probability to represent $P(h)$. Indeed, men and women form are a set of pairwise disjoint events whose union is the entire sample space.

$$P(h)=P(h|M)P(M)+P(h|W)P(W)$$

It now remains to combine this formula with the Bayes' theorem.

$$P(M|h)=\dfrac{P(h|M)P(M)}{P(h|M)P(M)+P(h|W)P(W)}$$

$$P(M|h)=\dfrac{P(h|M)P(M)}{P(h|M)P(M)(1+\dfrac{P(h|W)P(W)}{P(h|M)P(M)})}$$

$$P(M|h)=\dfrac{1}{1+e^{-\ln{\dfrac{P(h|M)P(M)}{P(h|W)P(W)}}}}$$

Thus, $P(M|h)=\sigma(\ln{\dfrac{P(h|M)P(M)}{P(h|W)P(W)}})$ where $\sigma$ is conventionally referred to as the sigmoid.

$$\sigma(x)=\dfrac{1}{1+e^{-x}}$$

We are simply engaging in a straightforward rewriting of mathematical functions, and at first glance, this process may appear somewhat trivial. However, firstly, it serves the purpose of demonstrating that the sigmoid function is not arbitrarily chosen but emerges naturally in the formulas. Moreover, we will explore specific cases to illustrate the judiciousness of such rewrites.

What happens if the distribution is gaussian ?

In the preceding formulas, we need to specify the values of $P(h|M)$, $P(h|W)$, $P(M)$ and $P(W)$.

  • The $P(h|M)$ and $P(h|W)$ quantities represent class-conditional probabilities: given that we are in the M category for example, what is the distribution of probabilities ?

A picture is worth a thousand words, so let's visualize the distribution of points belonging to the $M$ category.

Caution: The distribution above does not constitute a probability distribution.

The distribution bears a strong resemblance to a Gaussian, indicating that we can effectively model the class-conditional probability using such a distribution.

$$P(h|M)=\dfrac{1}{\sigma_{M}\sqrt{2\pi }}e^{-{\dfrac{(h-\mu_{M})^2}{2\sigma_{M}^2}}}$$

$\mu_{M}$ represents the mean of the Gaussian ($\mu_{M}=180$ from the figure above), and $\sigma_{M}$ is the standard deviation.

Similarly, we can model the class-conditional probability for the $W$ category.

Caution: The distribution above does not constitute a probability distribution.

Once again, the distribution bears a strong resemblance to a Gaussian, indicating that we can effectively model the class-conditional probability using such a distribution.

$$P(h|W)=\dfrac{1}{\sigma_{W}\sqrt{2\pi }}e^{-{\dfrac{(h-\mu_{W})^2}{2\sigma_{W}^2}}}$$

$\mu_{W}$ represents the mean of the Gaussian ($\mu_{M}=165$ from the figure), and $\sigma_{W}$ is the standard deviation.

We can reasonably assume that $\sigma_{M}=\sigma_{W}=\sigma$ (indicating the same standard deviations in both cases).

$$\ln{\dfrac{P(h|M)}{P(h|W)}}=\ln{\dfrac{\sigma\sqrt{2\pi}}{\sigma\sqrt{2\pi}}e^{\dfrac{(h-\mu_{M})^2}{2\sigma^2}-\dfrac{(h-\mu_{W})^2}{2\sigma^2}}}$$

$$\ln{\dfrac{P(h|M)}{P(h|W)}}=2h(\mu_{W}-\mu_{M})+(\mu_{M}^2-\mu_{W}^2)$$

  • $P(M)$ and $P(W)$ are the prior probabilities of being in the $M$ or $W$ category. Given that there are an equal number of men and women in the overall population, we will assume that $P(M)=P(W)$.

Therefore, by incorporating all these formulas into the calculations above, we can ultimately express $P(M|h)$ leading to a linear function of $h$ in the argument of the sigmoid.

$$P(M|h)=\sigma(wh+b)$$

What happens with discrete features ?

In the case of the Bernoulli distribution, for example, we can demonstrate that the outcome is still a linear function of the features in the argument of the sigmoid. Readers interested in a detailed proof of this assertion can refer to Pattern Recognition and Machine Learning.

Under rather general assumptions, the posterior probability of class $C_{1}$ can be written as a logistic sigmoid acting on a linear function of the feature vector $\phi$ so that $P(C_{1}|\phi)=\sigma(w^T\phi$).
Bishop (Pattern Recognition and Machine Learning)

What is logistic regression ?

The goal is to extend the previous findings by stipulating that the outcome $P(C|inputvariables)$ is a linear function of the features within the sigmoid's argument. It's important to note that we do not assume the class-conditional probabilities to follow a Gaussian or Bernoulli distribution. Instead, we draw inspiration from these models to prescribe the probability.

This method of a priori imposing a linear function of features for the sigmoid is termed logistic regression. The sigmoid is sometimes termed as the link function.

Important 1

This approach can be more broadly generalized by allowing flexibility in choosing the link function (in this case, we mandated it to be a sigmoid). This extension gives rise to the concept of a generalized linear model (GLM), a topic extensively explored in the literature.

Important 2

Logistic regression belongs to the family of models known as discriminative models. In this approach, we a priori impose the probability of the outcome without making specific assumptions about the class-conditional probabilities, and we do not employ Bayes' theorem. In contrast, models that start by modeling class-conditional probabilities are referred to as generative models.

It is time to delve into the process of training the logistic regression algorithm, which will be the focus of the next article.