Personal code snippets of @tmasjc

Image by Mads Schmidt Rasmussen from unsplash.com

Minimal Bootstrap Theme by Zachary Betz

parDist

Feb 9, 2018 #parallelDist

Problem Statement:

We have a large dataset. We need to compute its distance matrix (i.e. for clustering purposes). The complexity for a N * P matrix is N(N-1)/2 * 3P

Create Sample

library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
##     filter, lag
## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union
pts <- rnorm(1e4, mean = c(5, 5, 10, 10, 15, 15), sd = 1) %>% matrix(ncol = 2, byrow = T)
pts %>% plot(xlab = "x", ylab = "y")

Package ‘parallelDist’

parallelDist is a fast parallelized alternative to R’s native dist function. The package is mainly implemented in C++ and leverages the RcppParellel package to parallelize the distance computations. In addition, it also uses Armadillo linear algebra library to optimize matrix operations during distance calculations.

In short, to compute distance matrix for large data object, use parallelDist because it is much faster.

Demo

Say we wish to compute distance matrix to compute silhouette distance for sample points above to determine the optimal number of cluster (which we already knew is 3).

# By default, parDist returns a dist object
# Here we convert to matrix for minor efficiency
dist.euclidean <- parallelDist::parDist(pts, method = "euclidean") %>% as.matrix()

# A custom function for looping
compare_silhouette <- function(k){
kmeans(pts, centers = k, nstart = 20, iter.max = 50)\$cluster %>%
# use dmatrix instead of dist
cluster::silhouette(dmatrix = dist.euclidean) %>%
summary() %>%
# extract avergae silhouette width
[[('avg.width')
}

# Here we try out various number of clusters
res <- lapply(2:5, compare_silhouette)

library(ggplot2)
data.frame(x = 2:5, y = unlist(res)) %>% ggplot(aes(x, y)) + geom_line() + geom_point() + theme_light()