Comparing K-Means and others algorithms for data clustering - Part 3

Jan 3, 2024 · 3 min read ·

Share on:

Before delving into the C# implementation of the three algorithms, we will briefly outline the logical structure of the Visual Studio project in this post.

As explained in the previous post, we will exclusively use a dataset with two features, X and Y. Consequently, we will have a very simple DataPoint class to model a record.

 1public class DataPoint : IEquatable<DataPoint>
 2{
 3    public double X { get; set; }
 4
 5    public double Y { get; set; }
 6
 7    public double ComputeDistance(DataPoint other)
 8    {
 9        var X0 = other.X; var Y0 = other.Y;
10        return Math.Sqrt((X-X0)*(X-X0) + (Y-Y0) * (Y-Y0));
11    }
12
13    public bool Equals(DataPoint? other)
14    {
15        return other != null && X == other.X && Y == other.Y;
16    }
17
18    public override int GetHashCode()
19    {
20        return X.GetHashCode() ^ Y.GetHashCode();
21    }
22}

Important

Note that we have implemented a ComputeDistance method to assess the similarity or dissimilarity between points. The chosen distance metric is the Euclidean one. In a real-world scenario, the distance should be provided as an argument in the constructor through dependency injection.

A dataset is simply a list of DataPoint objects.

 1public class Dataset
 2{
 3    public List<DataPoint> Points { get; set; }
 4
 5    private Dataset() { Points = new List<DataPoint>(); }
 6
 7    public static Dataset Load(string path)
 8    {
 9        var set = new Dataset();
10
11        var contents = File.ReadAllLines(path);
12        foreach (var content in contents.Skip(1))
13        {
14            var d = content.Split(';');
15            var point = new DataPoint() { X = Convert.ToDouble(d[0]), Y = Convert.ToDouble(d[1]) };
16            set.Points.Add(point);
17        }
18
19        return set;
20    }
21}

Important

Data is supplied through a file and loaded using the Load method.

As the goal is to define clusters, we need to model this entity in C#. Thus, a DataCluster is simply a list of DataPoint objects labeled with a cluster number.

 1public class DataCluster : IEquatable<DataCluster>
 2{
 3    public List<DataPoint> Points { get; set; }
 4
 5    public string Cluster { get; set; }        
 6
 7    public DataPoint GetCentroid()
 8    {
 9        var avgx = Points.Average(t => t.X);
10        var avgy = Points.Average(t => t.Y);
11
12        return new DataPoint() { X = avgx, Y = avgy };
13    }
14
15    public bool Equals(DataCluster? other)
16    {
17        return other != null && Cluster == other.Cluster;
18    }
19
20    public override int GetHashCode()
21    {
22        return Cluster.GetHashCode();
23    }
24}

With this foundational code in place, it becomes possible to define an interface that all the algorithms will be required to implement.

1public interface IClusteringStragtegy
2{
3    List<DataCluster> Cluster(Dataset set, int clusters);
4}

This interface contains only one method, the objective of which is to obtain the DataCluster objects from a dataset and a predefined number of clusters to discover.

To be comprehensive, the file containing the data resembles the following format for the first dataset presented earlier.

 1X;Y
 20.2;0.22
 30.22;0.2
 40.18;0.205
 50.188;0.18
 60.205;0.185
 70.8;0.792
 80.82;0.8
 90.78;0.82
100.795;0.825
110.815;0.78
120.34;0.84
130.32;0.86
140.335;0.88
150.345;0.82
160.35;0.855

We can now commence our exploration, starting with the venerable K-Means.

Comparing K-Means and others for data clustering - Part 4