Implementing decision trees with the C4.5 algorithm - Part 2

In this post, we provide a precise definition of entropy to better prepare for the upcoming articles.

Truly understanding entropy

As previously mentioned, decision trees employ the C4.5 algorithm, which, in turn, utilizes the concept of entropy. In this section, we will attempt to clarify this crucial notion introduced by Shannon in the late 1940s.

Very important

We should not be intimidated by the term "entropy". Although it may evoke memories of thermodynamics and seem daunting, we will discover that it is entirely comprehensible when presented effectively.

You should call it entropy (...). As nobody knows what entropy is, whenever you use the term you will always be at an advantage! (Von Neumann to Shannon)

Entropy originates from information theory, and it is essential to address this aspect initially.

What is information theory ?

Strictly speaking, information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Specifically, it seeks to quantify the information content within a message and explore methods for compressing it to facilitate efficient transmission to another location.

This definition is somewhat informal and may appear abstract, but in reality, we unconsciously apply its principles in our daily lives.

In our everyday life, we compress information.

Here, David is condensing information by informing Samantha that John visited Italy during his holidays. Once Samantha learns that John discussed Italy, she can infer that he might have highlighted various famous monuments or shared other well-known facts about the country. In any case, she wouldn't be astonished if David mentioned that John talked about the Colosseum or mentioned visiting Rome or Milan.

But if David suddenly mentioned something unexpected, Samantha might be surprised because this information was not evident within the given context.

Entropy simply involves expressing this situation in mathematical terms, nothing more and nothing less. Entropy attempts to quantify the level of surprise or uncertainty in a message.

$$entropy(\text{He visited Roma})=0$$ $$entropy(\text{He met his girlfriend})>0$$

And how to do that ? Enter the surprise concept

Therefore, entropy seeks to mathematically express the idea that certain events are predictable and do not provide significant information, while others are highly rare (or even impossible) and, conversely, offer valuable information. Reconsider the previous sentence: we use terms like events, certainty, and impossibility. This terminology is linked to probability, and it is fitting to utilize probability theory to articulate our concepts.

Informally, the lower probability we assigned to an event, the more surprising it is, so surprise seems to be some kind of decreasing function of probability.

$$P(\text{He visited Italy and met Italian people}) = 1 \implies surprise(\text{He visited Italy and met Italian people})=0$$ $$P(\text{He visited Italy and met Dante}) = 0 \implies surprise(\text{He visited Italy and met Dante})=+\infin$$

As a probability can only exist within the range of 0 and 1, our objective is to identify a function, denoted as $I$, which maps from $[0,1]$ to $[0, +\infin]$, with the conditions that $I(0)=+\infin$ and $I(1)=0$.

The following examples satisfy that property and can be considered as candidates for our definition of surprise.

$$I(P)=\dfrac{1}{P}-1$$ $$I(P)=-log_{2}P$$ $$I(P)=-log_{10}P$$ $$I(P)=-lnP$$ $$I(P)=...$$

Which function to ultimately choose ?

In fact, we are not at the end of the story. Since we can choose among several functions, it would be advantageous to select one that not only quantifies surprise but also verifies some desirable properties. Wouldn't it be wonderful if we could kill two birds with one stone ?

Consider event $A$: "John met his girlfriend at the Tower of Pisa" and event $B$: "John found money on the ground". These two events are clearly independent (John could meet his girlfriend without finding money or find money without meeting his girlfriend), and we can observe that Samantha's surprise progressively increases when she hears about both of these events: her surprise is ADDITIVE.

Mathematically, we would like to have $I(P(A,B)=I(P(A)P(B))=I(P(A))+I(P(B))$. This equation is a Cauchy functional equation, and we know that the solution is the logarithm.

More precisely, there must exist $a \in \mathbb{R} $ such that $I(P)=alnP$. If $a=-1$, then $I(P)=-lnP$, if $a=-ln2$, then $I(P)=-log_{2}P$ and so forth.

Shannon chose $a=-ln2$.

$$\boxed{I(P)=-log_{2}P}$$

And where did entropy go ?

Consider the two following scenarios:

  • A fair coin is tossed once. What is the expected outcome ? If it lands heads, we would not be very surprised but have gained an information we did not have. If it lands tails, we would not be very surprised either but once again have gained an information we did not have. In any case, regardless of the outcome, we have gained information.

  • A biased coin (which lands heads 9 out of 10 times) is tossed once. What is the expected outcome ? If it lands heads, we would not be surprised at all, and, in fact, we can say that not much information has been gained. Conversely, if it lands tails, we would be quite surprised and have gained information.

Entropy attempts to quantify these facts: how much information can we, on average, obtain after the toss ? In this sense, entropy is quite similar to an expectation in probability.

Entropy is usually denoted as $H$. In the first case, $H(P)=\dfrac{1}{2}(-log_{2}\dfrac{1}{2}) + \dfrac{1}{2}(-log_{2}\dfrac{1}{2})=1$. In the second case, $H(P)=\dfrac{1}{10}(-log_{2}\dfrac{1}{10}) + \dfrac{9}{10}(-log_{2}\dfrac{9}{10})=0.469$.

This result aligns with intuition: in the first case, we consistently gain information every time, whereas in the second case, we gain information infrequently. It seems normal that the entropy in the first case is higher than in the second case.

Very important

More generally, entropy is a measure of the average amount of uncertainty or surprise associated with a random variable. The entropy $H$ of a discrete random variable $X$ is defined by the formula $H(X)=−\displaystyle\sum_{i=1}^nP(x_{i})log_{2}P(x_{i})$

Here, $P(x_{i})$ is the probability of outcome $x_{i}$ and the sum is taken over all possible outcomes of $X$.

In simple terms, entropy quantifies the amount of information contained in a random variable. It reaches its maximum when all outcomes are equally likely, and it decreases as the probability distribution becomes more certain or focused.

In the context of decision trees and the C4.5 algorithm, entropy will often be used to measure the impurity or disorder in a set of data points, helping guide the process of choosing splits to build an effective decision tree.

What does it quantify concretely ?

Imagine a scenario where an individual is tasked with selecting a number between 1 and 100, and our goal is to guess this number by asking binary questions. How should we approach this ? As developers, we understand that the most effective method is to employ a dichotomy, as depicted in the figure below.

Drawing on our knowledge from computer science, we recognize that the number of steps required to arrive at the chosen number is precisely $log_{2}100$. If we calculate the entropy $H$ for this scenario, we have $H=−\displaystyle\sum_{i=1}^{100}\dfrac{1}{100}log_{2}\dfrac{1}{100}=log_{2}100$. Therefore, entropy measures the minimum number of steps (units of entropy in the jargon) needed in a binary scheme to acquire all the information contained in a message.

Now, it's time to put this theory into practice with our decision trees. Stay tuned for the upcoming article on this topic.