Building a content-based recommendation system with the tf-idf algorithm - Part 4
In this article, we will implement what we theoretically mentioned in the last post and code the tf-idf algorithm (here in C# but the programming language can be easily adpated).
The source code is available upon request via email (elementsofcomputerscience(at)gmail.com).
Implementing the underlying concepts
First and foremost, we will need to implement the concepts we highlighted in the previous posts.
In Visual Studio, create a new Blank Solution project and give it a name.
In this solution, add a new Class Library project and name it ContentBasedRecommendations.Data for example.
What is an article ?
An article is essentially composed of a title (headline), a summary, and content.
1public class Article
2{
3 public string Id { get; set; }
4
5 public string Name { get; set; }
6
7 public string Summary { get; set; }
8
9 public string Content { get; set; }
10}
As mentioned earlier, since we utilize the vector space model, we will need to represent documents as vectors. This necessitates defining an ArticleVector class.
1public class ArticleVector
2{
3 public Dictionary<long, double> Components { get; set; } = new Dictionary<long, double>();
4
5 public string Id { get; set; }
6
7 public long Dimension { get; set; }
8
9 ...
10}
The vectors representing the documents are typically very sparse (lots of 0), as the terms appearing in a document are only a fraction of all the terms present in the corpus. That's why we use a dictionary to store only the non-null components of the vector. Using an array of doubles would be very inefficient in this case.
Finally, we define the ArticleRecommended class, which models the final articles proposed to the user. Therefore, a recommendation is a list of ArticleRecommended objects.
1public class ArticleRecommended
2{
3 public string Id { get; set; }
4
5 public string Name { get; set; }
6
7 public string Summary { get; set; }
8
9 public DateTime UpdatedAtUtc { get; set; }
10}
1public class Recommendation
2{
3 public string CurrentArticle { get; set; }
4
5 public List<ArticleRecommended> RelatedArticles { get; set; }
6}
Our goal is to implement an application that, for a given Article object, provides a recommendation and, consequently, for all the articles in the corpus, generates a list of recommendations.
Defining the interfaces
We will establish an interface that the algorithms must implement.
1public interface INewsRecommender
2{
3 List<Recommendation> CreateRecommendations();
4}
We will also require an interface to specify the similarity between vectors.
1public interface IComputeSimilarity
2{
3 double GetSimilarity(ArticleVector v1, ArticleVector v2);
4}
This interface will be instantiated by two classes—one for Jaccard similarity and the other for cosine similarity.
1public class JaccardComputeSimilarity : IComputeSimilarity
2{
3 public double GetSimilarity(ArticleVector v1, ArticleVector v2)
4 {
5 var dimension1 = v1.Dimension; var dimension2 = v2.Dimension;
6 if (dimension1 != dimension2) throw new ArgumentException();
7
8 var list1 = v1.Components.Keys;
9 var list2 = v2.Components.Keys;
10
11 var intersection = list1.Intersect(list2);
12 var union = list1.Union(list2);
13
14 return (double)intersection.Count()/(double)union.Count();
15 }
16}
1public class CosineComputeSimilarity : IComputeSimilarity
2{
3 public double GetSimilarity(ArticleVector v1, ArticleVector v2)
4 {
5 var dimension1 = v1.Dimension; var dimension2 = v2.Dimension;
6 if (dimension1 != dimension2) throw new ArgumentException();
7
8 v1.Normalize();
9 v2.Normalize();
10
11 var res = 0.0;
12 foreach (var item1 in v1.Components.Keys)
13 {
14 if (!v2.Components.ContainsKey(item1)) continue;
15 res = res + v1.Components[item1]*v2.Components[item1];
16 }
17
18 return res;
19 }
20}
Here, we observe the importance and efficiency of using a data structure suitable for sparse vectors instead of arrays of doubles. Computing similarities involves only a few multiplications for components not equal to zero.
Implementing the tf-idf algorithm
Now, we need only to implement the tf-idf algorithm.
1public class TfIdfNewsRecommender : INewsRecommender
2{
3 private readonly List<Article> _articles;
4 private readonly List<IContentProcessor> _processors;
5 private readonly IComputeSimilarity _similarity;
6
7 public TfIdfNewsRecommender(List<Article> articles)
8 {
9 _articles = articles;
10
11 _processors = new List<IContentProcessor>()
12 {
13 new WriteInLowerCaseContentProcessor(),
14 new RemoveApostrophesContentProcessor(),
15 new RemoveColonsContentProcessor(),
16 new RemoveCommasContentProcessor(),
17 new RemoveSpecialCharactersContentProcessor(),
18 new RemoveExclamationsContentProcessor()
19 };
20
21 _similarity = new CosineComputeSimilarity();
22 //_similarity = new JaccardComputeSimilarity();
23 }
24
25 public List<Recommendation> CreateRecommendations()
26 {
27 var processedArticles = new List<ArticleScoredWithTfIdf>();
28 var corpora = new Dictionary<string, long>();
29
30 foreach (var article in _articles)
31 {
32 var content = article.Content.Trim();
33 if (string.IsNullOrWhiteSpace(content)) continue;
34
35 foreach (var processor in _processors)
36 content = processor.Process(content);
37
38 var words = content.Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries).Where(t => t.Length >= 4).ToList();
39
40 foreach (var word in words.Distinct())
41 {
42 if (corpora.ContainsKey(word))
43 {
44 corpora[word] = corpora[word] + 1;
45 }
46 else
47 {
48 corpora.Add(word, 1);
49 }
50 }
51
52 var articleScored = new ArticleScoredWithTfIdf()
53 {
54 Id = article.Id,
55 Name = article.Name,
56 Summary = article.Summary,
57 Terms = new Dictionary<string, int>(),
58 TermsScored = new Dictionary<string, double>()
59 };
60
61 foreach (var word in words)
62 articleScored.AddTerm(word);
63
64 processedArticles.Add(articleScored);
65 }
66
67 // Compute idf for each term
68 var D = corpora.Keys.Count; var idfs = new Dictionary<string, double>();
69 foreach (var word in corpora.Keys)
70 {
71 idfs[word] = Math.Log10(D/corpora[word]);
72 }
73
74 // Compute tf-idf for each term and each document
75 foreach (var articleScored in processedArticles)
76 {
77 foreach (var term in articleScored.Terms.Keys)
78 {
79 articleScored.TermsScored[term] = articleScored.ComputeTermFrequency(term) * idfs[term];
80 }
81 }
82
83
84 // Compute the similarity
85 var dimensions = corpora.Keys.OrderBy(t => t).ToList(); var dimension = dimensions.Count;
86 var map = new Dictionary<string, long>(); var i = 0L;
87 foreach (var word in dimensions)
88 {
89 map.Add(word, i);
90 i++;
91 }
92
93 var vectors = new List<ArticleVector>();
94 foreach (var articleScored in processedArticles)
95 {
96 var vector = new ArticleVector(articleScored.Id, dimension);
97 foreach (var term in articleScored.TermsScored.Keys)
98 {
99 var position = map[term];
100 vector.Set(position, articleScored.TermsScored[term]);
101 }
102
103 vectors.Add(vector);
104 }
105
106 var recommendations = new List<Recommendation>();
107 var now = DateTime.UtcNow;
108 foreach (var vector in vectors)
109 {
110 var scores = new Dictionary<string, double>();
111 var remainingVectors = vectors.Where(t => t.Id != vector.Id).ToList();
112 foreach (var remainingVector in remainingVectors)
113 {
114 var sim = _similarity.GetSimilarity(vector, remainingVector);
115 scores.Add(remainingVector.Id, sim);
116 }
117
118 var recommendation = new Recommendation() { CurrentArticle = vector.Id, RelatedArticles = new List<ArticleRecommended>() };
119 foreach (var score in scores.OrderByDescending(x => x.Value).Take(5))
120 {
121 var processedArticle = processedArticles.FirstOrDefault(x => x.Id == score.Key); // A hashtable woulbe better here.
122 var recommended = new ArticleRecommended()
123 {
124 Id = processedArticle.Id,
125 Name = processedArticle.Name,
126 Summary = processedArticle.Summary,
127 UpdatedAtUtc = now
128 };
129 recommendation.RelatedArticles.Add(recommended);
130 }
131
132 recommendations.Add(recommendation);
133 }
134
135
136 return recommendations;
137 }
138}
139
140
141
142public class ArticleScoredWithTfIdf
143{
144 public string Id { get; set; }
145
146 public string Name { get; set; }
147
148 public string Summary { get; set; }
149
150 public Dictionary<string, int> Terms { get; set; }
151
152 public Dictionary<string, double> TermsScored { get; set; }
153
154 public void AddTerm(string term)
155 {
156 if (Terms.ContainsKey(term))
157 {
158 Terms[term] = Terms[term] + 1;
159 }
160 else
161 {
162 Terms.Add(term, 1);
163 }
164 }
165
166 public double ComputeTermFrequency(string term)
167 {
168 if (!Terms.ContainsKey(term)) return 0.0;
169 return Math.Log10(1 + Terms[term]);
170 }
171}
This code is extensive and follows the steps emphasized in the last post. It accepts a similarity metric and a list of articles as arguments, returning a list of recommendations.
The details matter. At the start of the algorithm, we use a list of preprocessors to appropriately format each article's content: converting the text to lowercase, removing special characters, punctuation marks, colons, and so on. This step is essential to ensure the purity of the corpus. While we won't delve into the intricacies of this process, it is often necessary to improve the performance and efficiency of algorithms (data cleaning).
Running the program
Now it's time to observe the results of our efforts. We will compare the recommended articles for the two similarity metrics and examine a few articles to see what items are suggested.
Add a Console Application project named ContentBasedRecommendations.Console.
Add the following code in the Program.cs class.
1internal class Program
2{
3 static void Main(string[] args)
4 {
5 var articles = new List<Article>();
6 var contents = File.ReadAllLines(AppContext.BaseDirectory + "/articles.txt");
7
8 var parser = new Regex(",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))");
9
10 foreach (var line in contents.Skip(1))
11 {
12 try
13 {
14 var c = parser.Split(line);
15 if (c.Length != 18) continue;
16 if (c[11].Length < 1000) continue;
17
18
19 var article = new Article()
20 {
21 Id = c[16],
22 Name = c[0],
23 Summary = c[3],
24 Content = c[11]
25 };
26 articles.Add(article);
27 }
28 catch { }
29 }
30
31 var recommender = new TfIdfNewsRecommender(articles);
32 var results = recommender.CreateRecommendations();
33 }
34}
Results with the Jaccard similarity
Original article | Recommended articles |
---|---|
Welcome To The New Democratic Party | - Here’s Joe Biden’s Opening Bid To The Left For The General Election - Obama's Trade Deals: A Test For Hillary Clinton - Defend Workers and the Environment Before Voting Fast Track - Now The GOP Is Defending Obamacare Repeal By Attacking Hillary Clinton - At The DNC, Democrats Embraced The Civil Rights Movement To Explain Their Place In The Present |
It'll Be Really Hot In California This Weekend, And Wildfires Are Still Blazing | - The Western Forests May Be Going Away - As California Burns, Trump's Proposed Cuts To Forest Service Loom - The World's Largest Tree At Risk As California Wildfires Head Toward Giant Sequoia Groves - Hurricane Laura Makes Landfall In Louisiana As Category 4 Monster - New Wildfire Erupts Near Reagan Library In Southern California |
Cobarde agresión a un jugador en plena entrevista | - Brilla Luis Suárez en victoria del Barcelona - No nos vamos a callar, este no es su país - MIRA: El extraño lanzamiento de Iván Rodríguez en la Serie del Caribe - Tachan a Chávez Jr. de bulto - #NoEraPenal, #RoboAPanamá |
At first glance, it seems that the recommended articles are quite relevant. While we are not specialists in American politics, associating the Democratic Party with Clinton or Biden does not seem too far-fetched. Additionally, topics on fires in California are also well grouped. Lastly, articles in Spanish recommend articles in Spanish, aligning with intuition and reality.
Results with the cosine similarity
Original article | Recommended articles |
---|---|
Welcome To The New Democratic Party | - Here’s Joe Biden’s Opening Bid To The Left For The General Election - Obama's Trade Deals: A Test For Hillary Clinton - Clinton Won Big Tuesday Night. But Sanders Has Won Something Too. - Now The GOP Is Defending Obamacare Repeal By Attacking Hillary Clinton - Free Trade' Agreements Promoted for Corporate Benefit Gut the Middle Class, Expand Wealth Inequity & Accelerate Immigration |
It'll Be Really Hot In California This Weekend, And Wildfires Are Still Blazing | - The Western Forests May Be Going Away - As California Burns, Trump's Proposed Cuts To Forest Service Loom - New Technology Propels Fight Against West Coast Wildfires - The World's Largest Tree At Risk As California Wildfires Head Toward Giant Sequoia Groves - New Wildfire Erupts Near Reagan Library In Southern California |
Cobarde agresión a un jugador en plena entrevista | - Brilla Luis Suárez en victoria del Barcelona - Cumbre de las Américas: ¿Otro desastre para Obama? - No nos vamos a callar, este no es su país - Hispanos rechazan candidatura de Ted Cruz - MIRA: El extraño lanzamiento de Iván Rodríguez en la Serie del Caribe |
It seems that the recommended articles are also quite relevant.
Which one to choose ?
Jaccard similarity or cosine similarity ? Both metrics appear to yield good results, and in this case, we may need to resort to online A/B testing to determine which is better suited to our specific problem. Moreover, we can explore further customization of our algorithm by either designing our own similarity measure or utilizing other existing ones.
What to take from this?
It seems that the tf-idf algorithm proves to be a state-of-the-art method for performing content-based filtering recommendations. Even on our small sample (with only 5500 articles), it manages to provide good recommendations. Its advantages are numerous:
- It can associate articles in one language with articles in the same language without the need for explicit language specification.
- There's no need to manually tag articles in advance.
- No special treatment for stop-words is required.
- In contrast to collaborative filtering, content-based filtering is not susceptible to the cold start phenomenon. It can be utilized on day-1 of a project without the need to wait for data on users (which can take a long time). This is why it is a preferred method at the initial stages of a project.
Everything is automated.
These advantages come with certain drawbacks: while we are currently dealing with a relatively small amount of data (articles in our case), it may not be apparent since, for illustrative purposes, we used a simplified sample. However, in reality, the situation can be quite different, and we may end up managing millions of items. In such scenarios, cloud computing can come to the rescue, and the workload must be distributed intelligently. This is why some commercial companies specialize in recommendations and also offer A/B testing to tailor recommendations perfectly (for instance, BEYABLE). In a JAMStack perspective, these companies are often a good choice.
Final thoughts
In this article, our aim was to provide a comprehensible overview of what tf-idf algorithm is and how it can be implemented to provide accurate recommendations.
If you wish to delve deeper into this topic, acquire the following book, which encompasses all the concepts emphasized in this series and delves into more advanced ones.
Mining of Massive Datasets (Leskovec, Rajaraman, Ullman)
Do not hesitate to contact me shoud you require further information.