Unsupervised Machine Learning: Key Algorithms

Karl Retumban
Oct 24, 2024
2 min read

In terms of dealing with unlabeled data, Unsupervised Machine Learning are applied. It enables the models to work with unlabeled data, allowing them to identify hidden patterns, relationships, and structures without predefined categories. The main objective of unsupervised learning is exploration and pattern recognition, making it valuable in areas like customer segmentation, anomaly detection, and recommendation systems.

Unlike supervised learning, where models learn from labeled examples, unsupervised learning finds intrinsic structures within data. Among the numerous techniques available, we discuss 3 essential types which are K-Means Clustering, Hierarchical Clustering, and Principal Component Analysis (PCA).

K-Means Clustering

One of the most widely used clustering algorithms, K-Means Clustering groups data points into K distinct clusters based on their similarities.

Algorithm

Randomly selecting K centroids.
Assigning each data point to the nearest centroid.
Recalculating the centroid based on the mean of all assigned points.
Repeating the process until centroids stabilize.

Strengths

Efficient for large datasets.
Works well with clearly separated clusters.

Challenges

Requires predefining K, which may not always be known.
Sensitive to outliers.

Hierarchical Clustering

Hierarchical clustering is a method that builds a tree-like structure of nested clusters, grouping data points based on their similarities. It uses either agglomerative (bottom-up) or divisive (top-down) clustering approaches.

Algorithm

Agglomerative clustering starts with individual points and merges them into larger clusters.
Divisive clustering starts with a large cluster and splits it into smaller sub-clusters.
The resulting hierarchy can be visualized using a dendrogram, which helps determine the optimal number of clusters.

Strengths

No need to predefine the number of clusters.
Useful for hierarchical or taxonomic data structures.
Provides interpretable visualizations of nested clusters.

Challenges

Computationally expensive for large datasets.
Once merged or split, clusters cannot be reassigned.
Choosing the right level of hierarchy can be subjective.

Principal Component Analysis (PCA)

PCA is a dimensionality reduction technique rather than a clustering algorithm. It simplifies large datasets by transforming them into fewer components while retaining most of the variance.

Algorithm