Summary of Cluster Analysis Distances

Clustering

4 min

Cluster analysis is one of the most useful techniques in research and applications studies in a wide range of branches. It is also consider as a data reduction technique like principal components analysis (PCA), where instead of analyzing the variables, we analyze the profiles or registers. The starting point of the cluster analysis is proximity matrix that measure the similarity between objects because this is the most important concept to build clusters.

I am not going to give a full theory of this technique, but rather I want to make some descriptions of the kind of distances that can be considered to measure the similarity or dissimilarity between objects. There are several softwares that can performance cluster analysis including Matlab and R (I mention these two because the quantity of users that they have) and we have the question of what distance I should consider to develop my analysis; principally people whom are not much familiarity with a deep theory of statistics or maths, so let’s go.

The proximity between objects could be analyze trough similarity or dissimilarity measures. A common example of similarity is the Pearson correlation coefficient while the Euclidean distance is a common dissimilarity measure.

Dissimilarity

Considering two objects \(r\) and \(s\), \(p_{rs}\) is a dissimilarity measure if the values are greater or equal than 0, \(p_{rs}=0\) when the two objects are identical and \(p_{rs}=p_{sr}\).

  • Euclidian Distance:
  • This distance is the most common in cluster analysis because it measures the geometric distance between two points in a n-dimensional space. It means that we can see if two points are near or far away in the geometric space. There is no difference if it is apply in a centered or non-centered variable. It is well used when the analyzed variables were measured under the same scale or there are no big differences between its scales. This distance is expressed as follow:

    \(d_{rs}^2=\sum_{j=1}^p (x_{rj}-x_{sj})^2\)

  • Standardized Euclidian Distance:
  • When the variables have different scales of measure, the euclidian distance is not a good dissimilarity index because it can be highly influenced by the variable with greatest scale. In this situation, the Standardized Euclidian Distance is a good alternative. As you see, this distance is similar to the euclidian distance but with weights to each variable.

    \(d_{rs}^2=\sum_{j=1}^p \frac{(x_{rj}-x_{sj})^2}{s_j^2}=(x_r-x_s)'D^{-1}(x_r-x_s)\)

  • Mahalanobis Distance:
  • It takes in consideration the difference in variance between features and their covariance structure. This distance is equivalent to applying the euclidian distance to the full principal components matrix.

    \(d_{rs}^2=(x_r-x_s)'S^{-1}(x_r-x_s)\)

    You can note that this distance delete the covariance structure. That makes it non adequate in some occasions where the correlation is very important in the distance.

  • Manhattan or City Block Metric:
  • This distance is based in the sum of the absolute values of the differences among the coordinates. In this metric, a constant difference between each of the p coordinates in the amount \(a\) has the same effect on total distance as changing the difference in only one coordinate by the amount \(pa\). That is not true for the Euclidian distance. It happens because; for example, \(5^2 + 5^2 \neq (5 + 5)^2\). Furthermore, it is much less sensitive to the presence of outliers.

    \(b_{rs}=\sum_{j=1}^p |x_{rj}-x_{sj}|\)

  • Minkowski Metric:
  • The Minkowski metric is a more general distance that covers some of the distances presented above. When \(\lambda=2\), it is the euclidian distance and is the Manhattan distance when \(\lambda=1\). It is always true that \(m_{rs} \leq m_{rm}+p_{ms}\).

    \(m_{rs}=\left[\sum_{j=1}^p |x_{rj}-x_{sj}|^\lambda\right]^{1/\lambda}\)

Similarity

Considering two objects \(r\) and \(s\), \(p_{rs}\) is a similarity measure if the range of values is between [0-1], \(p_{rs}= 1\) when the two objects are identical and \(p_{rs}=p_{sr}\).

  • Cosine:
  • In Multivariate Analysis, the cosine of the angle between two vectors is used as a kind of measurement of similarity. It only consider the direction of the two vectors and does not depend of the length of the vectors. This kind of measure is useful when you want to evaluate the structure of the profiles.

    \(c_{rs}=\sum_{j=1}^p x_{rj}x_{sj}/\sqrt{\sum_{j=1}^p x_{rj}^2 \sum_{j=1}^p x_{sj}^2}\)

  • Correlation coefficient:
  • When the cosine is calculated to the centralized variable, it is known as the Pearson Correlation Coefficient.

    \(q_{rs}=\sum_{j=1}^p (x_{rj}-\overline{x}_{r.})(x_{sj}-\overline{x}_{s.})/\sqrt{\sum_{j=1}^p (x_{rj}-\overline{x}_{r.})^2 \sum_{j=1}^p (x_{sj}-\overline{x}_{s.})^2}\)

For more information, you can check the book of J. D. Jobson, “Applied Multivariate Data Analysis: Volume II”. I hope this information will be useful for you.