Comparing K-Means and others algorithms for data clustering - Part 1

Beginning today, we are initiating a series of articles on data clustering, featuring a comparison of significant algorithms in the field (in particular K-Means and DBSCAN). Our goal is to blend theoretical concepts with practical implementation (in C#), offering clear illustrations of the associated challenges.

Clustering is a technique in machine learning and data analysis that involves grouping similar data points together based on certain features or characteristics. The goal of clustering is to identify natural patterns or structures within a dataset, organizing it into subsets or clusters. Data points within the same cluster are expected to be more similar to each other than to those in other clusters.

In essence, clustering allows for the discovery of inherent structures or relationships in data, providing insights into its underlying organization. This technique is widely used in various fields, including pattern recognition, image analysis, customer segmentation, and anomaly detection. Common clustering algorithms include K-Means, hierarchical clustering, CURE and DBSCAN.

In this series of articles, we will implement the main clustering techniques in C# and endeavor to compare each one, assessing their strengths and weaknesses. This exploration will provide insights into the situations and types of data where each algorithm is better suited.

The following textbooks on this topic merit consultation. These books extend beyond data clustering and covers a myriad of expansive and general machine learning topics.

Pattern Recognition and Machine Learning (Bishop)

Machine Learning: An Algorithmic Perspective (Marsland)

Probabilistic Machine Learning: An Introduction (Murphy)

Probabilistic Machine Learning: Advanced Topics (Murphy)

Mining of Massive Datasets (Leskovec, Rajaraman, Ullman)

Without further ado and as usual, let's begin with a few prerequisites to correctly understand the underlying concepts. Continue here.