Understanding the naïve Bayes algorithm with its implementation in C# - Part 4
In this article, we put the naive Bayes algorithm into practice by defining the relevant features that we will retain.
In the preceding article, we observed that the features utilized during the training phase were entirely the responsibility of developers or data scientists. Therefore, it is now time to explore what these features might entail.
We won't be particularly innovative in this regard, as others have already addressed this problem. Our proposal is to utilize the various biletters, triletters, and 4-letters present in the training set.
What are biletters, triletters and ... ?
The terms "biletters" and "triletters" might sound complex, but they are, in fact, sophisticated words denoting simple concepts. A biletter for example is a pair of consecutive letters in a sequence of text. It represents a two-letter combination, providing insight into the relationships and context between adjacent letters. For example, in the sentence "The quick brown fox" the biletters would be "th" "he", "e ", " q", "qu" and so forth.
Triletters, similarly, are a set of three consecutive letters in a sequence of text.
Hence, our features will encompass the counts of biletters, triletters, and similar constructs in the training set. As evident, this approach can yield numerous features, given the 26 letters in the Latin alphabet, along with numbers.
The selection of these features aims to differentiate between languages based on their commonly associated letters. It is a known fact that each language has specific preferences for letter combinations (for instance, in English, the pair "zx" is rarely found in a word, but it might be more common in some other languages). Essentially, our features serve to mathematically formalize this straightforward observation.
Implementation in C#
In this section, we will tackle the implementation in C# of everything mentioned earlier and witness a practical example in action.
Defining the concepts
First and foremost, let's provide a brief description of the concepts we emphasized.
Defining a language
1// Define Language
2public class Language : IEquatable<Language>
3{
4 public string Name { get; set; }
5
6 public Language(string name)
7 {
8 Name = name;
9 }
10
11 public bool Equals(Language? other)
12 {
13 return other != null && Name == other.Name;
14 }
15
16 public override int GetHashCode()
17 {
18 return Name.GetHashCode();
19 }
20}
Defining a document
A document is merely a sequence of words.
1// Define Document
2public class Document
3{
4 public Document(List<string> words)
5 {
6 Words = words;
7 }
8
9 public List<string> Words { get; set; }
10}
We also define a labelled document. A labeled document is a document that has been categorized into a specific language. This concept proves useful during the training phase of the algorithm.
1// Define Document
2public class LabelledDocument
3{
4 public Document Document { get; set; }
5
6 public Language Language { get; set; }
7}
Defining the interfaces
Armed with these concepts, we can now define interfaces as contracts that our classes must adhere to.
1public interface ILanguageClassifier
2{
3 void Train(List<LabelledDocument> documents);
4
5 Language Predict(Document document);
6}
There is nothing overly complex in this interface: we have a classifier with a training phase and a predict method. In our case, we will implement the naive Bayes algorithm, but we could also code logistic regression with the same contract.
We also implement a number of preprocessing tasks to clean up the data: converting the content to lowercase, removing punctuation marks, and so on. This process is integral to every data mining problem, and we won't delve into the details here.
Implementing the naive Bayes algorithm
The final step involves concretely implementing the algorithm. As mentioned in the last post, we utilize Laplace smoothing to avoid errors.
1public class NaiveBayesLanguageClassifier : ILanguageClassifier
2{
3 private readonly Dictionary<Language, Dictionary<string, long>> _patterns;
4 private readonly Dictionary<Language, Dictionary<int, HashSet<string>>> _terms;
5 private readonly List<Language> _languages;
6
7 private readonly List<IDocumentProcessor> _processors;
8
9 private readonly Dictionary<Language, double> _priors;
10 private readonly Dictionary<Language, Dictionary<string, double>> _likehoods;
11 private readonly HashSet<string> _vocabulary;
12
13 public NaiveBayesLanguageClassifier()
14 {
15 _patterns = new Dictionary<Language, Dictionary<string, long>>();
16 _terms = new Dictionary<Language, Dictionary<int, HashSet<string>>>();
17 _languages = new List<Language>();
18
19 _processors = new List<IDocumentProcessor>()
20 {
21 new WriteInLowerCaseDocumentProcessor(),
22 new RemovePunctuationMarksDocumentProcessor(),
23 new RemoveSpecialCharactersDocumentProcessor(),
24 new ReplaceNumbersDocumentProcessor(),
25 new PadWithWhitespacesDocumentProcessor()
26 };
27
28 _priors = new Dictionary<Language, double>();
29 _likehoods = new Dictionary<Language, Dictionary<string, double>>();
30 _vocabulary = new HashSet<string>();
31 }
32
33 public void Train(List<LabelledDocument> labelledDocuments)
34 {
35 var processedDocuments = new List<LabelledDocument>();
36 foreach (var labelledDocument in labelledDocuments)
37 {
38 var processedDocument = labelledDocument.Document;
39 foreach (var processor in _processors)
40 processedDocument = processor.Process(processedDocument);
41
42 processedDocuments.Add(new LabelledDocument() { Document = processedDocument, Language = labelledDocument.Language });
43 }
44
45 foreach (var processedDocument in processedDocuments)
46 {
47 var lang = new Language(processedDocument.Language.Name);
48 if (!_languages.Contains(lang)) _languages.Add(lang);
49
50 if (!_patterns.ContainsKey(lang))
51 {
52 _patterns.Add(lang, new Dictionary<string, long>());
53 }
54 var items = _patterns[lang];
55
56
57 if (!_terms.ContainsKey(lang))
58 {
59 _terms.Add(lang, new Dictionary<int, HashSet<string>>());
60 }
61 var langTerms = _terms[lang];
62
63 var words = processedDocument.Document.Words;
64 for (var i = 2; i <= 4; i++)
65 {
66 if (!langTerms.ContainsKey(i))
67 {
68 langTerms.Add(i, new HashSet<string>());
69 }
70 var grams = langTerms[i];
71
72 foreach (var word in words)
73 {
74 var characters = word.ToCharArray();
75 for (var j = i; j <= characters.Length; j++)
76 {
77 var prefix = new string(characters.Skip(j-i).Take(i).ToArray());
78 if (items.ContainsKey(prefix))
79 {
80 items[prefix] = items[prefix] + 1;
81 }
82 else
83 {
84 items.Add(prefix, 1);
85 }
86
87 if (!grams.Contains(prefix))
88 {
89 grams.Add(prefix);
90 }
91
92 _vocabulary.Add(prefix);
93 }
94 }
95 }
96 }
97
98
99 var numberOfDocuments = processedDocuments.Count;
100
101 foreach (var language in _languages)
102 {
103 var numberOfDocumentsInLanguage = processedDocuments.Count(t => t.Language.Name == language.Name);
104 _priors.Add(language, Math.Log((double)numberOfDocumentsInLanguage/(double)numberOfDocuments));
105
106 var vocabularyLanguage = _patterns[language];
107
108 var denom = _vocabulary.Sum(x =>
109 {
110 if (vocabularyLanguage.ContainsKey(x)) return vocabularyLanguage[x] + 1;
111 return 1;
112 });
113
114 if (!_likehoods.ContainsKey(language))
115 {
116 _likehoods.Add(language, new Dictionary<string, double>());
117 }
118 var likehood = _likehoods[language];
119
120 foreach (var v in _vocabulary)
121 {
122 var numberOfvForLanguage = 1L;
123 if (vocabularyLanguage.ContainsKey(v))
124 numberOfvForLanguage = vocabularyLanguage[v] + 1;
125
126 var n = Math.Log((double)numberOfvForLanguage/(double)denom);
127 likehood.Add(v, n);
128 }
129
130 }
131 }
132
133 public Language Predict(Document document)
134 {
135 foreach (var processor in _processors)
136 document = processor.Process(document);
137
138 var results = new Dictionary<Language, double>();
139 foreach (var language in _languages)
140 {
141 var somme = _priors[language];
142 var words = document.Words;
143 var languageLikehoods = _likehoods[language];
144
145 for (var i = 2; i <= 4; i++)
146 {
147 var items = _patterns[language];
148 foreach (var word in words)
149 {
150 var characters = word.ToCharArray();
151 for (var j = i; j <= characters.Length; j++)
152 {
153 var prefix = new string(characters.Skip(j-i).Take(i).ToArray());
154 if (!_vocabulary.Contains(prefix)) continue;
155
156 somme = somme + languageLikehoods[prefix];
157 }
158 }
159 }
160
161 results.Add(language, somme);
162 }
163
164 return results.MaxBy(kvp => kvp.Value).Key;
165 }
166}
Run the program
We will employ a console application to test our program.
1internal class Program
2{
3 static void Main(string[] args)
4 {
5 var languages = new List<string>() { "en", "fr" };
6 var documents = new List<LabelledDocument>();
7
8 foreach (var language in languages)
9 {
10 var lang = new Language(language);
11 var contents = File.ReadAllLines(AppContext.BaseDirectory + $"/corpus_{language}.txt");
12
13 foreach (var content in contents)
14 {
15 var words = content.Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries).ToList();
16 var document = new Document(words);
17
18 documents.Add(new LabelledDocument() { Document = document, Language = lang });
19 }
20 }
21
22 var classifier = new NaiveBayesLanguageClassifier();
23
24 classifier.Train(documents);
25
26 var sentence = "My tailor is rich.";
27 var testDocument = new Document(sentence.Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries).ToList());
28 var inferred = classifier.Predict(testDocument);
29 }
30}
The training dataset consists of 3 documents in French and 3 documents in English, as depicted in the figure below.
documents in French
documents in English
It is worthwhile to test several sentences to witness the remarkable efficiency of this algorithm. Here are some results.
Sentence | Language predicted |
---|---|
My tailor is rich | en |
My flowers are beautiful | en |
Carmen est un opéra de Bizet | fr |
Paris vaut bien une messe | fr |
May the Force be with you | en |
These results are even more remarkable considering the small size of our training set.
Final thoughts
If you wish to delve deeper into this topic, acquire the following book, which encompasses all the concepts emphasized in this series and delves into more advanced ones.
Machine Learning: A Probabilistic Perspective (Murphy)
Do not hesitate to contact me shoud you require further information.