Building a content-based recommendation system with the tf-idf algorithm - Part 4

In this article, we will implement what we theoretically mentioned in the last post and code the tf-idf algorithm (here in C# but the programming language can be easily adpated).

The source code is available upon request via email (elementsofcomputerscience(at)gmail.com).

Implement the underlying concepts

First and foremost, we will need to implement the concepts we highlighted in the previous posts.

  • In Visual Studio, create a new Blank Solution project and give it a name.

  • In this solution, add a new Class Library project and name it ContentBasedRecommendations.Data for example.

What is an article ?

An article is essentially composed of a title (headline), a summary, and content.

 1public class Article
 2{
 3    public string Id { get; set; }
 4
 5    public string Name { get; set; }
 6
 7    public string Summary { get; set; }
 8
 9    public string Content { get; set; }
10}

As mentioned earlier, since we utilize the vector space model, we will need to represent documents as vectors. This necessitates defining an ArticleVector class.

 1public class ArticleVector
 2{
 3    public Dictionary<long, double> Components { get; set; } = new Dictionary<long, double>();
 4
 5    public string Id { get; set; }
 6
 7    public long Dimension { get; set; }
 8
 9    ...
10}
Important

The vectors representing the documents are typically very sparse (lots of 0), as the terms appearing in a document are only a fraction of all the terms present in the corpus. That's why we use a dictionary to store only the non-null components of the vector. Using an array of doubles would be very inefficient in this case.

Finally, we define the ArticleRecommended class, which models the final articles proposed to the user. Therefore, a recommendation is a list of ArticleRecommended objects.

 1public class ArticleRecommended
 2{
 3    public string Id { get; set; }
 4
 5    public string Name { get; set; }
 6
 7    public string Summary { get; set; }
 8
 9    public DateTime UpdatedAtUtc { get; set; }
10}
1public class Recommendation
2{
3    public string CurrentArticle { get; set; }
4
5    public List<ArticleRecommended> RelatedArticles { get; set; }
6}

Our goal is to implement an application that, for a given Article object, provides a recommendation and, consequently, for all the articles in the corpus, generates a list of recommendations.

Define the interfaces

We will establish an interface that the algorithms must implement.

1public interface INewsRecommender
2{
3    List<Recommendation> CreateRecommendations();
4}

We will also require an interface to specify the similarity between vectors.

1public interface IComputeSimilarity
2{
3    double GetSimilarity(ArticleVector v1, ArticleVector v2);
4}

This interface will be instantiated by two classes—one for Jaccard similarity and the other for cosine similarity.

 1public class JaccardComputeSimilarity : IComputeSimilarity
 2{
 3    public double GetSimilarity(ArticleVector v1, ArticleVector v2)
 4    {
 5        var dimension1 = v1.Dimension; var dimension2 = v2.Dimension;
 6        if (dimension1 != dimension2) throw new ArgumentException();
 7
 8        var list1 = v1.Components.Keys;
 9        var list2 = v2.Components.Keys;
10
11        var intersection = list1.Intersect(list2);
12        var union = list1.Union(list2);            
13
14        return (double)intersection.Count()/(double)union.Count();
15    }
16}
 1public class CosineComputeSimilarity : IComputeSimilarity
 2{
 3    public double GetSimilarity(ArticleVector v1, ArticleVector v2)
 4    {
 5        var dimension1 = v1.Dimension; var dimension2 = v2.Dimension;
 6        if (dimension1 != dimension2) throw new ArgumentException();
 7
 8        v1.Normalize();
 9        v2.Normalize();
10
11        var res = 0.0;
12        foreach (var item1 in v1.Components.Keys)
13        {
14            if (!v2.Components.ContainsKey(item1)) continue;
15            res = res + v1.Components[item1]*v2.Components[item1];
16        }
17
18        return res;
19    }
20}
Information

Here, we observe the importance and efficiency of using a data structure suitable for sparse vectors instead of arrays of doubles. Computing similarities involves only a few multiplications for components not equal to zero.

Implement the tf-idf algorithm

Now, we need only to implement the tf-idf algorithm.

  1public class TfIdfNewsRecommender : INewsRecommender
  2{
  3    private readonly List<Article> _articles;
  4    private readonly List<IContentProcessor> _processors;
  5    private readonly IComputeSimilarity _similarity;
  6
  7    public TfIdfNewsRecommender(List<Article> articles)
  8    {
  9        _articles = articles;
 10
 11        _processors = new List<IContentProcessor>()
 12        {
 13            new WriteInLowerCaseContentProcessor(),
 14            new RemoveApostrophesContentProcessor(),
 15            new RemoveColonsContentProcessor(),
 16            new RemoveCommasContentProcessor(),
 17            new RemoveSpecialCharactersContentProcessor(),
 18            new RemoveExclamationsContentProcessor()
 19        };
 20
 21        _similarity = new CosineComputeSimilarity();
 22        //_similarity = new JaccardComputeSimilarity();
 23    }
 24
 25    public List<Recommendation> CreateRecommendations()
 26    {
 27        var processedArticles = new List<ArticleScoredWithTfIdf>();
 28        var corpora = new Dictionary<string, long>();
 29
 30        foreach (var article in _articles)
 31        {
 32            var content = article.Content.Trim();
 33            if (string.IsNullOrWhiteSpace(content)) continue;
 34
 35            foreach (var processor in _processors)
 36                content = processor.Process(content);
 37
 38            var words = content.Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries).Where(t => t.Length >= 4).ToList();
 39
 40            foreach (var word in words.Distinct())
 41            {
 42                if (corpora.ContainsKey(word))
 43                {
 44                    corpora[word] = corpora[word] + 1;
 45                }
 46                else
 47                {
 48                    corpora.Add(word, 1);
 49                }
 50            }
 51
 52            var articleScored = new ArticleScoredWithTfIdf()
 53            {
 54                Id = article.Id,
 55                Name = article.Name,
 56                Summary = article.Summary,
 57                Terms = new Dictionary<string, int>(),
 58                TermsScored = new Dictionary<string, double>()
 59            };
 60
 61            foreach (var word in words)
 62                articleScored.AddTerm(word);
 63
 64            processedArticles.Add(articleScored);
 65        }
 66
 67        // Compute idf for each term
 68        var D = corpora.Keys.Count; var idfs = new Dictionary<string, double>();
 69        foreach (var word in corpora.Keys)
 70        {
 71            idfs[word] = Math.Log10(D/corpora[word]);
 72        }
 73
 74        // Compute tf-idf for each term and each document
 75        foreach (var articleScored in processedArticles)
 76        {
 77            foreach (var term in articleScored.Terms.Keys)
 78            {
 79                articleScored.TermsScored[term] = articleScored.ComputeTermFrequency(term) * idfs[term];
 80            }
 81        }
 82
 83
 84        // Compute the similarity
 85        var dimensions = corpora.Keys.OrderBy(t => t).ToList(); var dimension = dimensions.Count;
 86        var map = new Dictionary<string, long>(); var i = 0L;
 87        foreach (var word in dimensions)
 88        {
 89            map.Add(word, i);
 90            i++;
 91        }
 92
 93        var vectors = new List<ArticleVector>();
 94        foreach (var articleScored in processedArticles)
 95        {
 96            var vector = new ArticleVector(articleScored.Id, dimension);
 97            foreach (var term in articleScored.TermsScored.Keys)
 98            {
 99                var position = map[term];
100                vector.Set(position, articleScored.TermsScored[term]);
101            }
102
103            vectors.Add(vector);
104        }
105
106        var recommendations = new List<Recommendation>();
107        var now = DateTime.UtcNow;
108        foreach (var vector in vectors)
109        {
110            var scores = new Dictionary<string, double>();
111            var remainingVectors = vectors.Where(t => t.Id != vector.Id).ToList();
112            foreach (var remainingVector in remainingVectors)
113            {
114                var sim = _similarity.GetSimilarity(vector, remainingVector);
115                scores.Add(remainingVector.Id, sim);
116            }
117
118            var recommendation = new Recommendation() { CurrentArticle = vector.Id, RelatedArticles = new List<ArticleRecommended>() };
119            foreach (var score in scores.OrderByDescending(x => x.Value).Take(5))
120            {
121                var processedArticle = processedArticles.FirstOrDefault(x => x.Id == score.Key); // A hashtable woulbe better here.
122                var recommended = new ArticleRecommended()
123                {
124                    Id = processedArticle.Id,
125                    Name = processedArticle.Name,
126                    Summary = processedArticle.Summary,
127                    UpdatedAtUtc = now
128                };
129                recommendation.RelatedArticles.Add(recommended);
130            }
131
132            recommendations.Add(recommendation);
133        }
134
135        
136        return recommendations;
137    }
138}
139
140
141
142public class ArticleScoredWithTfIdf
143{
144    public string Id { get; set; }
145
146    public string Name { get; set; }
147
148    public string Summary { get; set; }
149
150    public Dictionary<string, int> Terms { get; set; }
151
152    public Dictionary<string, double> TermsScored { get; set; }
153
154    public void AddTerm(string term)
155    {
156        if (Terms.ContainsKey(term))
157        {
158            Terms[term] = Terms[term] + 1;
159        }
160        else
161        {
162            Terms.Add(term, 1);
163        }
164    }
165
166    public double ComputeTermFrequency(string term)
167    {
168        if (!Terms.ContainsKey(term)) return 0.0;
169        return Math.Log10(1 + Terms[term]);
170    }
171}

This code is extensive and follows the steps emphasized in the last article. It accepts a similarity metric and a list of articles as arguments, returning a list of recommendations.

Important

The details matter. At the start of the algorithm, we use a list of preprocessors to appropriately format each article's content: converting the text to lowercase, removing special characters, punctuation marks, colons, and so on. This step is essential to ensure the purity of the corpus. While we won't delve into the intricacies of this process, it is often necessary to improve the performance and efficiency of algorithms (data cleaning).

Run the program

Now it's time to observe the results of our efforts. We will compare the recommended articles for the two similarity metrics and examine a few articles to see what items are suggested.

  • Add a Console Application project named ContentBasedRecommendations.Console.

  • Add the following code in the Program.cs class.

 1internal class Program
 2{
 3    static void Main(string[] args)
 4    {
 5        var articles = new List<Article>();
 6        var contents = File.ReadAllLines(AppContext.BaseDirectory + "/articles.txt");
 7
 8        var parser = new Regex(",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))");
 9        
10        foreach (var line in contents.Skip(1))
11        {
12            try
13            {
14                var c = parser.Split(line);
15                if (c.Length != 18) continue;
16                if (c[11].Length < 1000) continue;
17
18
19                var article = new Article()
20                {
21                    Id = c[16],
22                    Name = c[0],
23                    Summary = c[3],
24                    Content = c[11]
25                };
26                articles.Add(article);
27            }
28            catch { }
29        }
30
31        var recommender = new TfIdfNewsRecommender(articles);
32        var results = recommender.CreateRecommendations();
33    }
34}

Results with the Jaccard similarity

Original articleRecommended articles
Welcome To The New Democratic Party- Here’s Joe Biden’s Opening Bid To The Left For The General Election
- Obama's Trade Deals: A Test For Hillary Clinton
- Defend Workers and the Environment Before Voting Fast Track
- Now The GOP Is Defending Obamacare Repeal By Attacking Hillary Clinton
- At The DNC, Democrats Embraced The Civil Rights Movement To Explain Their Place In The Present
It'll Be Really Hot In California This Weekend, And Wildfires Are Still Blazing- The Western Forests May Be Going Away
- As California Burns, Trump's Proposed Cuts To Forest Service Loom
- The World's Largest Tree At Risk As California Wildfires Head Toward Giant Sequoia Groves
- Hurricane Laura Makes Landfall In Louisiana As Category 4 Monster
- New Wildfire Erupts Near Reagan Library In Southern California
Cobarde agresión a un jugador en plena entrevista- Brilla Luis Suárez en victoria del Barcelona
- No nos vamos a callar, este no es su país
- MIRA: El extraño lanzamiento de Iván Rodríguez en la Serie del Caribe
- Tachan a Chávez Jr. de bulto
- #NoEraPenal, #RoboAPanamá

At first glance, it seems that the recommended articles are quite relevant. While we are not specialists in American politics, associating the Democratic Party with Clinton or Biden does not seem too far-fetched. Additionally, topics on fires in California are also well grouped. Lastly, articles in Spanish recommend articles in Spanish, aligning with intuition and reality.

Results with the cosine similarity

Original articleRecommended articles
Welcome To The New Democratic Party- Here’s Joe Biden’s Opening Bid To The Left For The General Election
- Obama's Trade Deals: A Test For Hillary Clinton
- Clinton Won Big Tuesday Night. But Sanders Has Won Something Too.
- Now The GOP Is Defending Obamacare Repeal By Attacking Hillary Clinton
- Free Trade' Agreements Promoted for Corporate Benefit Gut the Middle Class, Expand Wealth Inequity & Accelerate Immigration
It'll Be Really Hot In California This Weekend, And Wildfires Are Still Blazing- The Western Forests May Be Going Away
- As California Burns, Trump's Proposed Cuts To Forest Service Loom
- New Technology Propels Fight Against West Coast Wildfires
- The World's Largest Tree At Risk As California Wildfires Head Toward Giant Sequoia Groves
- New Wildfire Erupts Near Reagan Library In Southern California
Cobarde agresión a un jugador en plena entrevista- Brilla Luis Suárez en victoria del Barcelona
- Cumbre de las Américas: ¿Otro desastre para Obama?
- No nos vamos a callar, este no es su país
- Hispanos rechazan candidatura de Ted Cruz
- MIRA: El extraño lanzamiento de Iván Rodríguez en la Serie del Caribe

It seems that the recommended articles are also quite relevant.

Which one to choose ?

Jaccard similarity or cosine similarity ? Both metrics appear to yield good results, and in this case, we may need to resort to online A/B testing to determine which is better suited to our specific problem. Moreover, we can explore further customization of our algorithm by either designing our own similarity measure or utilizing other existing ones.

What to take from this?

It seems that the tf-idf algorithm proves to be a state-of-the-art method for performing content-based filtering recommendations. Even on our small sample (with only 5500 articles), it manages to provide good recommendations. Its advantages are numerous:

  • It can associate articles in one language with articles in the same language without the need for explicit language specification.
  • There's no need to manually tag articles in advance.
  • No special treatment for stop-words is required.
  • In contrast to collaborative filtering, content-based filtering is not susceptible to the cold start phenomenon. It can be utilized on day-1 of a project without the need to wait for data on users (which can take a long time). This is why it is a preferred method at the initial stages of a project.

Everything is automated.

These advantages come with certain drawbacks: while we are currently dealing with a relatively small amount of data (articles in our case), it may not be apparent since, for illustrative purposes, we used a simplified sample. However, in reality, the situation can be quite different, and we may end up managing millions of items. In such scenarios, cloud computing can come to the rescue, and the workload must be distributed intelligently. This is why some commercial companies specialize in recommendations and also offer A/B testing to tailor recommendations perfectly (for instance, BEYABLE). In a JAMStack perspective, these companies are often a good choice.

Final thoughts

In this article, our aim was to provide a comprehensible overview of what tf-idf algorithm is and how it can be implemented to provide accurate recommendations.

If you wish to delve deeper into this topic, acquire the following book, which encompasses all the concepts emphasized in this series and delves into more advanced ones.

Mining of Massive Datasets (Leskovec, Rajaraman, Ullman)

Do not hesitate to contact me shoud you require further information.