Comparing K-Means and others algorithms for data clustering - Part 3
Before delving into the C# implementation of the three algorithms, we will briefly outline the logical structure of the Visual Studio project in this post.
As explained in the previous post, we will exclusively use a dataset with two features, X and Y. Consequently, we will have a very simple DataPoint class to model a record.
1public class DataPoint : IEquatable<DataPoint>
2{
3 public double X { get; set; }
4
5 public double Y { get; set; }
6
7 public double ComputeDistance(DataPoint other)
8 {
9 var X0 = other.X; var Y0 = other.Y;
10 return Math.Sqrt((X-X0)*(X-X0) + (Y-Y0) * (Y-Y0));
11 }
12
13 public bool Equals(DataPoint? other)
14 {
15 return other != null && X == other.X && Y == other.Y;
16 }
17
18 public override int GetHashCode()
19 {
20 return X.GetHashCode() ^ Y.GetHashCode();
21 }
22}
Note that we have implemented a ComputeDistance method to assess the similarity or dissimilarity between points. The chosen distance metric is the Euclidean one. In a real-world scenario, the distance should be provided as an argument in the constructor through dependency injection.
A dataset is simply a list of DataPoint objects.
1public class Dataset
2{
3 public List<DataPoint> Points { get; set; }
4
5 private Dataset() { Points = new List<DataPoint>(); }
6
7 public static Dataset Load(string path)
8 {
9 var set = new Dataset();
10
11 var contents = File.ReadAllLines(path);
12 foreach (var content in contents.Skip(1))
13 {
14 var d = content.Split(';');
15 var point = new DataPoint() { X = Convert.ToDouble(d[0]), Y = Convert.ToDouble(d[1]) };
16 set.Points.Add(point);
17 }
18
19 return set;
20 }
21}
Data is supplied through a file and loaded using the Load method.
As the goal is to define clusters, we need to model this entity in C#. Thus, a DataCluster is simply a list of DataPoint objects labeled with a cluster number.
1public class DataCluster : IEquatable<DataCluster>
2{
3 public List<DataPoint> Points { get; set; }
4
5 public string Cluster { get; set; }
6
7 public DataPoint GetCentroid()
8 {
9 var avgx = Points.Average(t => t.X);
10 var avgy = Points.Average(t => t.Y);
11
12 return new DataPoint() { X = avgx, Y = avgy };
13 }
14
15 public bool Equals(DataCluster? other)
16 {
17 return other != null && Cluster == other.Cluster;
18 }
19
20 public override int GetHashCode()
21 {
22 return Cluster.GetHashCode();
23 }
24}
With this foundational code in place, it becomes possible to define an interface that all the algorithms will be required to implement.
1public interface IClusteringStragtegy
2{
3 List<DataCluster> Cluster(Dataset set, int clusters);
4}
This interface contains only one method, the objective of which is to obtain the DataCluster objects from a dataset and a predefined number of clusters to discover.
To be comprehensive, the file containing the data resembles the following format for the first dataset presented earlier.
1X;Y
20.2;0.22
30.22;0.2
40.18;0.205
50.188;0.18
60.205;0.185
70.8;0.792
80.82;0.8
90.78;0.82
100.795;0.825
110.815;0.78
120.34;0.84
130.32;0.86
140.335;0.88
150.345;0.82
160.35;0.855
We can now commence our exploration, starting with the venerable K-Means.