Understanding the naïve Bayes algorithm with its implementation in C# - Part 4

In this article, we put the naive Bayes algorithm into practice by defining the relevant features that we will retain.

In the preceding article, we observed that the features utilized during the training phase were entirely the responsibility of developers or data scientists. Therefore, it is now time to explore what these features might entail.
We won't be particularly innovative in this regard, as others have already addressed this problem. Our proposal is to utilize the various biletters, triletters, and 4-letters present in the training set.

What are biletters, triletters and ... ?

The terms "biletters" and "triletters" might sound complex, but they are, in fact, sophisticated words denoting simple concepts. A biletter for example is a pair of consecutive letters in a sequence of text. It represents a two-letter combination, providing insight into the relationships and context between adjacent letters. For example, in the sentence "The quick brown fox" the biletters would be "th" "he", "e ", " q", "qu" and so forth.

Triletters, similarly, are a set of three consecutive letters in a sequence of text.

Hence, our features will encompass the counts of biletters, triletters, and similar constructs in the training set. As evident, this approach can yield numerous features, given the 26 letters in the Latin alphabet, along with numbers.

Information

The selection of these features aims to differentiate between languages based on their commonly associated letters. It is a known fact that each language has specific preferences for letter combinations (for instance, in English, the pair "zx" is rarely found in a word, but it might be more common in some other languages). Essentially, our features serve to mathematically formalize this straightforward observation.

Implementation in C#

In this section, we will tackle the implementation in C# of everything mentioned earlier and witness a practical example in action.

Defining the concepts

First and foremost, let's provide a brief description of the concepts we emphasized.

Defining a language

 1// Define Language
 2public class Language : IEquatable<Language>
 3{
 4    public string Name { get; set; }
 5
 6    public Language(string name)
 7    {
 8        Name = name;
 9    }
10
11    public bool Equals(Language? other)
12    {
13        return other != null && Name == other.Name;
14    }
15
16    public override int GetHashCode()
17    {
18        return Name.GetHashCode();
19    }
20}

Defining a document

A document is merely a sequence of words.

 1// Define Document
 2public class Document
 3{
 4    public Document(List<string> words)
 5    {
 6        Words = words;
 7    }
 8
 9    public List<string> Words { get; set; }
10}

We also define a labelled document. A labeled document is a document that has been categorized into a specific language. This concept proves useful during the training phase of the algorithm.

1// Define Document
2public class LabelledDocument
3{
4    public Document Document { get; set; }
5
6    public Language Language { get; set; }
7}

Defining the interfaces

Armed with these concepts, we can now define interfaces as contracts that our classes must adhere to.

1public interface ILanguageClassifier
2{
3    void Train(List<LabelledDocument> documents);
4
5    Language Predict(Document document);
6}

There is nothing overly complex in this interface: we have a classifier with a training phase and a predict method. In our case, we will implement the naive Bayes algorithm, but we could also code logistic regression with the same contract.

Information

We also implement a number of preprocessing tasks to clean up the data: converting the content to lowercase, removing punctuation marks, and so on. This process is integral to every data mining problem, and we won't delve into the details here.

Implementing the naive Bayes algorithm

The final step involves concretely implementing the algorithm. As mentioned in the last post, we utilize Laplace smoothing to avoid errors.

  1public class NaiveBayesLanguageClassifier : ILanguageClassifier
  2{
  3    private readonly Dictionary<Language, Dictionary<string, long>> _patterns;
  4    private readonly Dictionary<Language, Dictionary<int, HashSet<string>>> _terms;
  5    private readonly List<Language> _languages;
  6
  7    private readonly List<IDocumentProcessor> _processors;
  8
  9    private readonly Dictionary<Language, double> _priors;
 10    private readonly Dictionary<Language, Dictionary<string, double>> _likehoods;
 11    private readonly HashSet<string> _vocabulary;
 12
 13    public NaiveBayesLanguageClassifier()
 14    {
 15        _patterns = new Dictionary<Language, Dictionary<string, long>>();
 16        _terms = new Dictionary<Language, Dictionary<int, HashSet<string>>>();
 17        _languages = new List<Language>();
 18
 19        _processors = new List<IDocumentProcessor>()
 20        {
 21            new WriteInLowerCaseDocumentProcessor(),
 22            new RemovePunctuationMarksDocumentProcessor(),
 23            new RemoveSpecialCharactersDocumentProcessor(),
 24            new ReplaceNumbersDocumentProcessor(),
 25            new PadWithWhitespacesDocumentProcessor()
 26        };
 27
 28        _priors = new Dictionary<Language, double>();
 29        _likehoods = new Dictionary<Language, Dictionary<string, double>>();
 30        _vocabulary = new HashSet<string>();
 31    }
 32
 33    public void Train(List<LabelledDocument> labelledDocuments)
 34    {
 35        var processedDocuments = new List<LabelledDocument>();
 36        foreach (var labelledDocument in labelledDocuments)
 37        {
 38            var processedDocument = labelledDocument.Document;
 39            foreach (var processor in _processors)
 40                processedDocument = processor.Process(processedDocument);
 41
 42            processedDocuments.Add(new LabelledDocument() { Document = processedDocument, Language = labelledDocument.Language });
 43        }
 44
 45        foreach (var processedDocument in processedDocuments)
 46        {
 47            var lang = new Language(processedDocument.Language.Name);
 48            if (!_languages.Contains(lang)) _languages.Add(lang);
 49
 50            if (!_patterns.ContainsKey(lang))
 51            {
 52                _patterns.Add(lang, new Dictionary<string, long>());
 53            }
 54            var items = _patterns[lang];
 55
 56
 57            if (!_terms.ContainsKey(lang))
 58            {
 59                _terms.Add(lang, new Dictionary<int, HashSet<string>>());
 60            }
 61            var langTerms = _terms[lang];
 62
 63            var words = processedDocument.Document.Words;
 64            for (var i = 2; i <= 4; i++)
 65            {
 66                if (!langTerms.ContainsKey(i))
 67                {
 68                    langTerms.Add(i, new HashSet<string>());
 69                }
 70                var grams = langTerms[i];
 71
 72                foreach (var word in words)
 73                {
 74                    var characters = word.ToCharArray();
 75                    for (var j = i; j <= characters.Length; j++)
 76                    {
 77                        var prefix = new string(characters.Skip(j-i).Take(i).ToArray());
 78                        if (items.ContainsKey(prefix))
 79                        {
 80                            items[prefix] = items[prefix] + 1;
 81                        }
 82                        else
 83                        {
 84                            items.Add(prefix, 1);
 85                        }
 86
 87                        if (!grams.Contains(prefix))
 88                        {
 89                            grams.Add(prefix);
 90                        }
 91
 92                        _vocabulary.Add(prefix);
 93                    }
 94                }
 95            }
 96        }
 97
 98
 99        var numberOfDocuments = processedDocuments.Count;            
100
101        foreach (var language in _languages)
102        {
103            var numberOfDocumentsInLanguage = processedDocuments.Count(t => t.Language.Name == language.Name);
104            _priors.Add(language, Math.Log((double)numberOfDocumentsInLanguage/(double)numberOfDocuments));                
105
106            var vocabularyLanguage = _patterns[language];
107
108            var denom = _vocabulary.Sum(x =>
109            {
110                if (vocabularyLanguage.ContainsKey(x)) return vocabularyLanguage[x] + 1;                    
111                return 1;
112            });
113
114            if (!_likehoods.ContainsKey(language))
115            {
116                _likehoods.Add(language, new Dictionary<string, double>());
117            }
118            var likehood = _likehoods[language];
119
120            foreach (var v in _vocabulary)
121            {
122                var numberOfvForLanguage = 1L;
123                if (vocabularyLanguage.ContainsKey(v))
124                    numberOfvForLanguage = vocabularyLanguage[v] + 1;
125
126                var n = Math.Log((double)numberOfvForLanguage/(double)denom);                    
127                likehood.Add(v, n);
128            }
129
130        }
131    }
132
133    public Language Predict(Document document)
134    {
135        foreach (var processor in _processors)
136            document = processor.Process(document);                
137
138        var results = new Dictionary<Language, double>();
139        foreach (var language in _languages)
140        {
141            var somme = _priors[language];
142            var words = document.Words;
143            var languageLikehoods = _likehoods[language];
144
145            for (var i = 2; i <= 4; i++)
146            {
147                var items = _patterns[language];
148                foreach (var word in words)
149                {
150                    var characters = word.ToCharArray();
151                    for (var j = i; j <= characters.Length; j++)
152                    {
153                        var prefix = new string(characters.Skip(j-i).Take(i).ToArray());
154                        if (!_vocabulary.Contains(prefix)) continue;
155
156                        somme = somme + languageLikehoods[prefix];
157                    }
158                }
159            }
160
161            results.Add(language, somme);
162        }
163
164        return results.MaxBy(kvp => kvp.Value).Key;
165    }
166}

Run the program

We will employ a console application to test our program.

 1internal class Program
 2{
 3    static void Main(string[] args)
 4    {            
 5        var languages = new List<string>() { "en", "fr" };
 6        var documents = new List<LabelledDocument>();
 7
 8        foreach (var language in languages)
 9        {
10            var lang = new Language(language);
11            var contents = File.ReadAllLines(AppContext.BaseDirectory + $"/corpus_{language}.txt");
12
13            foreach (var content in contents)
14            {
15                var words = content.Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries).ToList();
16                var document = new Document(words);
17
18                documents.Add(new LabelledDocument() { Document = document, Language = lang });
19            }
20        }
21
22        var classifier = new NaiveBayesLanguageClassifier();
23
24        classifier.Train(documents);        
25
26        var sentence = "My tailor is rich.";
27        var testDocument = new Document(sentence.Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries).ToList());
28        var inferred = classifier.Predict(testDocument);
29    }
30}

The training dataset consists of 3 documents in French and 3 documents in English, as depicted in the figure below.

documents in French

documents in English

It is worthwhile to test several sentences to witness the remarkable efficiency of this algorithm. Here are some results.

SentenceLanguage predicted
My tailor is richen
My flowers are beautifulen
Carmen est un opéra de Bizetfr
Paris vaut bien une messefr
May the Force be with youen

These results are even more remarkable considering the small size of our training set.

Final thoughts

If you wish to delve deeper into this topic, acquire the following book, which encompasses all the concepts emphasized in this series and delves into more advanced ones.

Machine Learning: A Probabilistic Perspective (Murphy)

Do not hesitate to contact me shoud you require further information.