K-Means Clustering Example
[K-means is an unsupervised algorithm.] k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid). k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.
The problem is computationally difficult (NP-hard). These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both k-means and Gaussian mixture modeling. They both use cluster centers to model the data; however, k-means clustering tends to find clusters of comparable spatial extent, while the Gaussian mixture model allows clusters to have different shapes.
require(graphics)
# Generate 2-dimensional data
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
(cl <- kmeans(x, 2))
plot(x,
col=cl$cluster,
main="K-means using random data separated into two clusters"
)
points(cl$centers, col = 1:2, pch = 8, cex = 2) # Asterisk indicates centroid
This indicates that the first 6 points are in cluster #2 centered at x = 0.002265857 and y = 0.03744441.
# sum of squares
ss <- function(x) sum(scale(x, scale = FALSE)^2)
## cluster centers "fitted" to each obs.:
fitted.x <- fitted(cl); head(fitted.x)
## x y
## 2 0.04298836 -0.05387672
## 2 0.04298836 -0.05387672
## 2 0.04298836 -0.05387672
## 2 0.04298836 -0.05387672
## 2 0.04298836 -0.05387672
## 2 0.04298836 -0.05387672
This calculation indicates that the clusters have similar squared-summed error between samples, within samples and total error.
resid.x <- x - fitted(cl)
## Equalities : ----------------------------------
cbind(cl[c("betweenss", "tot.withinss", "totss")], # the same two columns
c(ss(fitted.x), ss(resid.x), ss(x)))
## [,1] [,2]
## betweenss 50.73954 50.73954
## tot.withinss 15.13286 15.13286
## totss 65.8724 65.8724
stopifnot(all.equal(cl$ totss, ss(x)),
all.equal(cl$ tot.withinss, ss(resid.x)),
## these three are the same:
all.equal(cl$ betweenss, ss(fitted.x)),
all.equal(cl$ betweenss, cl$totss - cl$tot.withinss),
## and hence also
all.equal(ss(x), ss(fitted.x) + ss(resid.x))
)
kmeans(x,1)$withinss # trivial one-cluster, (its W.SS == ss(x))
## [1] 65.8724
The set of data was then broken into 5 sets of clusters with different centroids.
## random starts do help here with too many clusters
## (and are often recommended anyway!):
(cl <- kmeans(x, 5, nstart = 25))
## K-means clustering with 5 clusters of sizes 19, 20, 26, 15, 20
##
## Cluster means:
## x y
## 1 0.2367049 -0.27280740
## 2 -0.2278296 -0.01554423
## 3 1.1318356 0.85540688
## 4 0.3790998 0.45791425
## 5 0.9359450 1.31075033
##
## Clustering vector:
## [1] 2 2 1 1 2 2 1 4 2 4 2 2 1 1 1 4 1 2 2 1 4 1 2 1 4 4 1 1 1 4 2 2 1 2 4 4 2
## [38] 1 1 1 4 4 2 2 1 2 2 2 2 1 5 3 4 3 3 5 3 3 3 5 4 5 4 3 5 3 4 3 5 3 5 5 3 3
## [75] 5 3 5 3 3 3 5 5 5 5 3 5 3 5 3 5 3 5 3 3 3 5 3 3 5 3
##
## Within cluster sum of squares by cluster:
## [1] 1.335033 1.277605 1.528026 1.273764 1.168803
## (between_SS / total_SS = 90.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
plot(x,
col = cl$cluster,
main="K-means clustering with 5 clusters of sizes 18, 27, 21, 11, 23"
)
points(cl$centers, col = 1:5, pch = 8)
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.