Implementing decision trees with the C4.5 algorithm - Part 4

In this final article, we explore the C4.5 algorithm as an enhancement of the ID3 algorithm and how to implement it in ML.NET.

What id the C4.5 algorithm ?

The C4.5 algorithm is a successor to the ID3 algorithm. C4.5 was developed by Ross Quinlan with improvements in handling both categorical and continuous data, as well as handling missing values. While ID3 works well with categorical data, C4.5 introduces a mechanism to handle continuous data by selecting optimal thresholds for splitting. This allows C4.5 to be more versatile in handling different types of datasets.

In summary

In summary, the C4.5 algorithm is an extension and improvement upon the ID3 algorithm, making it more robust and applicable to a broader range of datasets, including those with continuous attributes and missing values.

Important

Additionally, C4.5 incorporates a pruning mechanism to avoid overfitting. After building the tree, it may be pruned by removing branches that do not significantly improve the model's accuracy on unseen data. This helps prevent the tree from becoming too complex and specific to the training data, promoting better generalization. ** We won't cover this aspect in this post.**

Understanding drawbacks of the information gain

The ID3 algorithm is straightforward to comprehend and implement, especially once we grasp the concept of entropy. However, this simplicity conceals several significant drawbacks. For instance, in the dataset we previously utilized, the distribution of different values for a specific feature was well spread out.

HeightHairEyesAttractive ?
SmallBlondeBrownNo
TallDarkBrownNo
TallBlondeBlueYes
TallDarkBlueNo
SmallDarkBlueNo
TallRedBlueYes
TallBlondeBrownNo
SmallBlondeBlueYes

Indeed, the dataset comprises a total of 8 rows. Among these 8 rows, the Height feature has two different values quite evenly distributed (3 Small and 5 Tall), the Hair feature has three different values quite evenly distributed (4 Blonde, 3 Dark and 1 Red), and the Eyes feature has two different values quite evenly distributed (3 Brown and 5 Blue).

But consider now a dataset where one of the features is the age of a woman's first pregnancy. Typically, in many countries, women have their first children between 20 and 35 years old. However, there may be cases where some women have children after 40 and even up to 50, but these instances are relatively rare.

Samantha is not surprised by Jennifer's age.

Samantha is very surprised by Jennifer's age.

The dataset could anyway ressemble the following.

Age first pregnancyHairWorking status
15BlondeUnemployed
15DarkUnemployed
.........
25BlondeEmployed
25DarkEmployed
.........
28DarkUnemployed
28DarkEmployed
.........
35DarkUnemployed
35BlondeEmployed
.........
42DarkUnemployed
44DarkEmployed
48BlondeEmployed
50BlondeUnemployed

This dataset could comprise a total of 10000 rows. Among these 10000 rows, the Age first pregnancy feature has 36 different values that are extremely poorly distributed (800 25, 850 28, 1 44, 0 46, 1 50 for instance). This poor distribution will lead to high entropy (uncertainty is low), resulting in high information gain. Consequently, this feature will be chosen to split the data. But how will the tree be constructed then ?

The tree will be able to split the data further for women whose age at first pregnancy is 25, 28, or 35. But how can it proceed for the woman whose age at first pregnancy is 50 ? As there is only one example, it will simply copy the data of this example (with Hair=Blonde). Now, what will happen if a dark-haired 50-year-old woman is tested against this decision tree ? As there is no path with these attributes, it will return the only value it can. In fact, here, this portion of the tree is too tailored to the training dataset.

This illustrates overfitting: the dataset fits the training example very well but does not generalize well to other sets.

To remember

Information gain is biased toward high branching features.

Entering gain ratio

To mitigate the bias introduced by information gain, another heuristic was proposed by Quinlan and named gain ratio ($IGR$). He first defines the split information value for a specific feature $f$ as follows.

$$SplitInformation(S,f)=-\displaystyle\sum_{i=1}^{N} \dfrac{N(x_{i})}{N(S)} log_{2}\dfrac{N(x_{i})}{N(S)}$$

$x_{1}$, $x_{2}$, ..., $x_{N}$ are the different possible values for $f$ and $N(x_{i})$ are the number of times that $x_{i}$ occurs divided by the total count of events $N(S)$ for all $i \in [1,N]$.

Then, he defines the gain ratio $IGR(S,f)$ for this feature as simply the ratio of the information gain $IG(S,f)$ and the split information.

$$IGR(S,f)=\dfrac{IG(S,f)}{SplitInformation(S,f)}$$

The intuition behind this formula is that low entropy caused by features with a high number of distinct values (like our example above), results in a higher split information. Therefore, the information gain ratio is lower.

A worked example, please !

We will revisit the example from the previous post regarding which features are important for the attractiveness of a person. Below, we have copied the information gains that we had calculated earlier.

$$IG(S,Height)=0.004$$ $$IG(S,Hair)=0.454$$ $$IG(S,Eyes)=0.348$$

From the above formula, we have for example $SplitInformation(S,Height)=-\dfrac{3}{8}log_{2}\dfrac{3}{8}-\dfrac{5}{8}log_{2}\dfrac{5}{8}=0.954$ (5 records labeled as "Tall" and 3 labeled as "Small").

$$IGR(S,Height)=\dfrac{IG(S,Height)}{SplitInformation(S,Height)}=0.004$$

The reader is invited to perform the remaining calculations, and we'll proceed to present the results below.

IGSplitInformationIGR
Height0.0040.9540.004
Hair0.4541.4060.323
Eyes0.3480.9540.365

In this example, we can observe that the gain ratio has favored the feature with the fewer number of distinct values (Eyes, which can take only Brown and Blue) and penalized the Hair feature. Consequently, the data is now split with the Eyes feature first, in contrast to the Hair feature as discussed in the previous post.

Important

The C4.5 algorithm (using the gain ratio) is less prone to overfitting compared to its ID3 counterpart.

How to deal with continuous features ?

Another limitation of the ID3 algorithm is the challenge it faces when dealing with continuous variables. The C4.5 algorithm attempts to address this issue and we will modify our example by replacing the Height feature with numerical values to see it in action.

Height (cm)HairEyesAttractive ?
160BlondeBrownNo
186DarkBrownNo
188BlondeBlueYes
184DarkBlueNo
162DarkBlueNo
192RedBlueYes
188BlondeBrownNo
160BlondeBlueYes

The C4.5 algorithm proposes to perform binary split based on a threshold value. This threshold should be a value which offers maximum gain for that attribute.

  • Sort the different values for Height in the ascending order: 160, 162, ..., 192.
  • For each value $h$, partition the records into two groups: one with Height values up to and including $h$ and the other with values greater than $h$.
  • For each of these partitions, compute the gain ratio.
  • Choose the partition that maximizes the gain

A worked example again !

Remember that for our toy example, $H(S)=0.954$

  • Sort the different values for Height in the ascending order. There are 6 distinct values for the Height feature: 160, 162, 184, 186, 188, 192.

  • For each value $h$, partition the records into two groups: one with Height values up to and including $h$ (dataset named $H_{Height\le h}$) and the other with values greater than $h$ (dataset named $H_{Height\gt h}$).

For $h=160$ for instance, $H_{Height\le 160}$ comprises 2 records.

Height (cm)HairEyesAttractive ?
160BlondeBrownNo
160BlondeBlueYes

$H_{Height\gt 160}$ comprises 6 records.

Height (cm)HairEyesAttractive ?
162DarkBlueNo
184DarkBlueNo
186DarkBrownNo
188BlondeBlueYes
188BlondeBrownNo
192RedBlueYes
  • Compute the gain ratio

We have $H(H_{Height\le 160})=1$ and $H(H_{Height\gt 160})=-\dfrac{4}{6}log_{2}\dfrac{4}{6}-\dfrac{2}{6}log_{2}\dfrac{2}{6}=0.918$.

It follows that $IG(S,Height_{160})=H(S)-(\dfrac{2}{8}H(H_{Height\le 160})+\dfrac{6}{8}H(H_{Height\gt 160}))=0.016$

Since $SplitInformation(S,Height_{160})=-\dfrac{2}{8}log_{2}\dfrac{2}{8}-\dfrac{6}{8}log_{2}\dfrac{6}{8}=0.811$, we eventually get $IGR(S,Height_{160})=0.020$

  • Repeat this operation for all the distinct values
$h$ (cm)$IG(S,Height_{h})$$SplitInformation(S,Height_{h})$$IGR(S,Height_{h})$
1600.0160.8110.020
1620.0030.9540.003
1840.0491.0000.049
1860.1590.9540.167
1880.0740.5440.136
  • Choose the partition that maximizes the gain

Thus, if we were to split the data using the Height feature, we would need to set the threshold at 186 cm.

Important

This method involves a substantial number of computations.

It's a bit of theory! We will now put this into practice and see how to leverage ML.NET to witness decision trees in action. This will be the subject of the next article.