Implementing decision trees with the C4.5 algorithm - Part 5

In this final article, we demonstrate how to implement a basic version of the C4.5 algorithm in C#.

Our goal is to develop a simplified version and apply the concepts discussed in the last post. This approach will enable us to gain a solid understanding of the theoretical notions we presented and to implement what we've learned in practice.

Disclaimer

The code presented is not optimized, and there may be cases where it does not work. Our intention is solely to illustrate decision trees and witness them in action. In particular, we won't deal with continuous variables.

Defining classes

We commence by establishing classes to manage data. A feature, identified primarily by its name (for example Height, Hair or Eyes), is a distinctive attribute or aspect of something.

1public class Feature
2{
3    public string Name { get; set; }        
4}

We delineate FeatureValued as a feature linked to one of its potential values (for example, Height=Small or Height=Tall).

1public class FeatureValued
2{
3    public Feature Feature { get; set; }
4
5    public string Value { get; set; }
6}

Similarly, we can define output variables using the same approach.

1public class Output
2{
3    public string Name { get; set; }
4}
1public class OutputValued
2{
3    public Output Output { get; set; }
4
5    public string Value { get; set; }
6}

Armed with these definitions, we can straightforwardly depict a record as a compilation of FeatureValued and OutputValued objects.

1public class Record
2{
3    public List<FeatureValued> Variables { get; set; }
4
5    public OutputValued Output { get; set; }
6}

Consequently, a dataset is essentially an enumeration of records, encompassing features and output variables.

 1public class DataSet
 2{
 3    public List<Feature> Features { get; set; }
 4
 5    public Output Output { get; set; }
 6
 7    public List<Record> Records { get; set; }
 8	
 9	// ...
10}

Defining decision trees

Several methods exist to define trees in C#, and our implementation will depend on dictionaries. It follows a recursive approach, mirroring the internal mechanisms of the decision trees we have previously encountered.

1public class DecisionTree
2{
3    public DecisionTreeRule Root { get; set; }
4}
 1public abstract class DecisionTreeRule
 2{
 3    public abstract bool IsNode { get; }
 4
 5    public abstract bool IsLeaf { get; }
 6
 7    public abstract string Value { get; }
 8}
 9
10public class DecisionTreeRuleFeature : DecisionTreeRule
11{
12    public Feature Feature { get; set; }
13
14    public Dictionary<string, DecisionTreeRule> Children { get; set; }
15
16    public override bool IsNode => true;
17
18    public override bool IsLeaf => false;
19
20    public override string Value => string.Empty;    
21}
22
23public class DecisionTreeRuleOutput : DecisionTreeRule
24{
25    public OutputValued Output { get; set; }
26
27    public override bool IsNode => false;
28
29    public override bool IsLeaf => true;
30
31    public override string Value => Output.Value;
32}

Defining interfaces for algorithms

The classical definition of the interface that algorithms must adhere to includes the implementation of Fit and Predict methods.

1public interface IDecisionTreeTrainer
2{
3    void Fit(DataSet set);
4
5    OutputValued Predict(TestRecord record);
6}

Implementing the C45 algorithm

Finally, an implementation of this interface is required for the C45 algorithm. It is worth noting that the gain ratio, rather than information gain, is employed to determine the feature for data splitting.

  1public class C45DecisionTreeTrainer : IDecisionTreeTrainer
  2{
  3    private DecisionTree _tree;
  4    private Output _output;
  5
  6    public void Fit(DataSet set)
  7    {
  8        _output = set.Output;
  9        var features = set.Features;
 10        
 11        var root = BuildRule(features, _output, set);
 12
 13        _tree = new DecisionTree() {  Root = root };           
 14    }
 15
 16    public OutputValued Predict(TestRecord record)
 17    {
 18        return Predict(_tree.Root, record);
 19    }
 20
 21    #region Private Methods
 22
 23    private DecisionTreeRule BuildRule(List<Feature> features, Output output, DataSet set)
 24    {
 25        var outputStatistics = set.OutputStatistics;
 26        if (outputStatistics.NumberOfDistinctValues == 1)
 27        {
 28            var outputValued = new OutputValued() { Output = set.Output, Value = outputStatistics.PossibleValues.FirstOrDefault() };
 29            return new DecisionTreeRuleOutput() { Output = outputValued };
 30        }
 31
 32        if (!features.Any())
 33        {
 34            var outputValued = new OutputValued() { Output = set.Output, Value = outputStatistics.MostRepresentative };
 35            return new DecisionTreeRuleOutput() { Output = outputValued };
 36        }
 37
 38        var max = double.MinValue;
 39        Feature eligible = null;
 40        foreach (var feature in features)
 41        {
 42            var gainRatio = GainRatio(feature, set); // Should be given in the constructor with dependency injection
 43            if (gainRatio > max)
 44            {
 45                max = gainRatio;
 46                eligible = feature;
 47            }
 48        }
 49
 50
 51        var ruleNode = new DecisionTreeRuleFeature() { Feature = eligible, Children = new Dictionary<string, DecisionTreeRule>() };
 52
 53        var remainingFeatures = features.Where(x => x.Name != eligible.Name).ToList();
 54        var possibleValuesForEligible = set.Statistics.FirstOrDefault(x => x.Feature.Name == eligible.Name).PossibleValues;
 55        foreach (var possibleValue in possibleValuesForEligible)
 56        {
 57            var valued = new FeatureValued() { Feature = eligible, Value = possibleValue };
 58            var subset = set.ExtractSubset(valued);
 59
 60            var rule = BuildRule(remainingFeatures, output, subset);
 61            ruleNode.Children.Add(possibleValue, rule);
 62        }
 63
 64        return ruleNode;
 65    }
 66
 67    private double GainRatio(Feature feature, DataSet set)
 68    {
 69        var nb = set.Records.Count;
 70
 71        // Calculate the dataset entropy (same for all the features)
 72        var entropySet = 0.0;
 73        var possibleOutputs = set.OutputStatistics.PossibleValues;            
 74        foreach (var possibleOutput in possibleOutputs)
 75        {
 76            var outputWithThisValue = set.Records.Select(x => x.Output).Where(t => t.Value == possibleOutput).LongCount();
 77            var ratio = (double)outputWithThisValue / (double)nb;
 78
 79            if (ratio != 0)
 80                entropySet = entropySet - ratio * Math.Log2(ratio);
 81        }
 82
 83        // Calculate the conditional entropy and the split information
 84        var conditionalEntropySet = 0.0; var splitInformation = 0.0;
 85        var records = set.Records.SelectMany(x => x.Variables).Where(t => t.Feature.Name == feature.Name);
 86        var possibleValuesFeature = set.Statistics.FirstOrDefault(x => x.Feature.Name == feature.Name).PossibleValues;
 87        foreach (var possibleValueFeature in possibleValuesFeature)
 88        {
 89            var recordCountWithThisValue = (double)records.Where(t => t.Value == possibleValueFeature).LongCount();
 90            var pValueFeature = recordCountWithThisValue / nb;
 91
 92            if (pValueFeature != 0)
 93                splitInformation = splitInformation - pValueFeature * Math.Log2(pValueFeature);
 94
 95            var entropyFeature = 0.0;
 96            foreach (var possibleOutput in possibleOutputs)
 97            {
 98                var outputWithThisValue = set.Records.Where(x => x.Variables.Any(t => t.Feature.Name == feature.Name && t.Value == possibleValueFeature)).Select(x => x.Output).Where(t => t.Value == possibleOutput).LongCount();
 99                var ratio = (double)outputWithThisValue / (double)recordCountWithThisValue;
100
101                if (ratio!=0)
102                    entropyFeature = entropyFeature - ratio * Math.Log2(ratio);
103            }
104
105            conditionalEntropySet = conditionalEntropySet + pValueFeature * entropyFeature;
106        }
107
108        // Calculate the information gain
109        var informationGain = entropySet - conditionalEntropySet;
110
111        // Calculate the gain ratio
112        var gainRatio = informationGain / splitInformation;
113
114        return gainRatio;
115    }
116
117    private OutputValued Predict(DecisionTreeRule rule, TestRecord record)
118    {
119        if (rule.IsLeaf) return new OutputValued() { Output = _output, Value = rule.Value };
120
121        var ruleFeature = rule as DecisionTreeRuleFeature;
122        var featureValue = record.Variables.FirstOrDefault(x => x.Feature.Name == ruleFeature.Feature.Name);
123
124        var child = (rule as DecisionTreeRuleFeature).Children[featureValue.Value];
125
126        return Predict(child, record);
127    }
128
129    #endregion
130}

Running the program

Testing this program can be conducted using a Console application.

 1internal class Program
 2{
 3    static void Main(string[] args)
 4    {
 5        var path = AppContext.BaseDirectory + "/training.txt";
 6        var dataset = new DataSet(path);
 7
 8        // Train the algorithm
 9        var trainer = new C45DecisionTreeTrainer();
10        trainer.Fit(dataset);
11
12        // Test the algorithm with an unforeseen record
13        var testRecord = new TestRecord()
14        {
15            Variables = new List<FeatureValued>()
16             {
17                 new FeatureValued(){ Feature = new Feature(){ Name = "Height"}, Value = "Small"},
18                 new FeatureValued(){ Feature = new Feature(){ Name = "Hair"}, Value = "Dark"},
19                 new FeatureValued(){ Feature = new Feature(){ Name = "Eyes"}, Value = "Brown"},
20             }
21        };
22        var prediction = trainer.Predict(testRecord);
23    }
24}

Final thoughts

In this article, our aim was to provide a comprehensible overview of what C45 algorithm is and how it can be implemented to provide accurate predictions.

If you wish to delve deeper into this topic, acquire the following books, which encompass all the concepts emphasized in this series and delve into more advanced ones.

Do not hesitate to contact me shoud you require further information.