Machine Learning – Cluster Analysis

Cluster Analysis

I’m running my Machine Learning revision very late and struggling with the second part of the course. If anyone wants to get in touch about any part two stuff I’d be more than happy to chat any time before the exam as it really helps. I shall leave my laptop on overnight so emails (bedforj8@cs.man.ac.uk) and skype messages/calls (Starscent) will wake me up. Here’s what I’ve managed to grasp of cluster analysis.

• A cluster is a collection of data objects.
• Cluster analysis is about trying to detect which objects have similar properties and are “clustered together”.
• Attributes of data objects can be discrete or continuous.
• The data can be represented as:
• Data matrix (an object by attribute structure) which holds n data points (objects) with p dimensions (attributes).
• Distance/dissimilarity matrix (an object by object structure) which holds the distances between the objects and is a triangular matrix structure.
• The distances can be measured using:
• Euclidean distance = sqrt(x ^ 2 + y ^ 2)
• Manhattan distance = x * y
• The ‘Cosine Measure‘.
• cos (x, j) = SUM(x1 * j1 … xn * jn) / (sqrt(SUM(x^2)) * sqrt(SUM(j^2)))
• d(x, j) = 0.5 * (1 – cos(x, j))

• K-Means Algorithm
• K-Means is a partitioning cluster analysis algorithm.
• The space between objects is “partitioned.”
• Classes are used to classify the data object points. Objects close together should belong to one “cluster class”. There shall be k number of classes (or clusters).
• The algorithm aims to converge the centres (centroids) of the clusters.
• The k-means algorithm assumes the number of clusters is already know.
• If the number of clusters is unknown then a guess must be made.
• The process if as follows:
• Randomly assign k centroids.
• Classify the training data by calculating the distance to the nearest centroid. The training data is mapped to the nearest centroid.
• Modify the centroids depending on the results. The new centroids are found by find the average coordinates of all the training data that was mapped to that centroid.
• Repeat from step 2 until there is no change.
• Lines can be drawn to indicate the classification boundaries.
• Issues:
• Computational complexity is O(tKn) where n is the number of objects in the truing data, K is the number of clusters (or classes) and t is the number of iterations before the centroids stop moving. t should be less than n.
• Problems:
• The centroids can converge to an unwanted solution.
• K needs to be known in advance.
• Noisy data can’t be handled – so a K-Medoids algorithm could be used instead.
• A series of linear decision boundaries are formed, which may not be too flexible.

• Hierarchical Clustering Approach
• Is interested in the distance between the centroids, as opposed to the distance between the the objects and the centroids.
• The data is partitioned sequentially.
• The number of clusters does not need to be known!
• The number of clusters (and the clusters themselves) are defined hierarchically using one of two strategies. Each strategy works out where the clusters are using the distances of each object to the nearest cluster.
• Agglomerative – a bottom-up strategy, starting with as many clusters are there are objects in the training set. These are known as atomic clusters as they only hold one object each and so can’t be split any further. The clusters are then merged together to create a refined set of clusters (starting with the nearest).
• Divisive – a top-down strategy, starting with all the objects joined together as a single cluster. This cluster is then broken down into a refined set of smaller clusters.
• The distances between the clusters can be measured in the following ways:
• Single link – the shortest distance between one element (any) of one cluster and one element (any) of another.
• Complete link – the furthest distance between two clusters, taking all elements of each cluster into account.
• Average – the average distance of all the distances from every node in one cluster to every node in the other cluster.
• The idea is to create clusters by merging the nearest cluster to any other cluster – if using an agglomerative strategy. The results for each stage is stored and the process is repeated until there’s one single cluster. With the information about each stage of the clusters, the most appropriate number of clusters (depending on whichever implementation one chooses to use) can be determined.
• The Agglomerative Algorithm
• Calculate the distances between all the data objects and stores the distances in a matrix.
• Each object is set within it’s own atomic cluster.
• Merge the two closest clusters.
• Update the distance matrix.
• Repeat steps 4 and 5 until there’s one single cluster.
• The information stored about the cluster merges can be represented in a dendrogram tree.
• The number of clusters can then be determined from this dendrogram using a distance threshold. But the best method of finding the number of clusters from this data is unclear and debated.
• In general, a distance threshold value can be used. When the distance between the clusters are greater than this threshold distance, then that’s the number of clusters that’s used.
• Heuristic rule (http://en.wikipedia.org/wiki/Heuristic#Computer_science) – cut the dendrogram with the maximum life time.
• I think of the life time as this, please correct me if I’m wrong. The life time is the distance going down the converging clusters. The distance measurement stops when it reaches a cluster vertical to it, or the cluster the line joins onto. I’m not sure if that’s right or makes sense but there we go it’s an alternative.
• Here’s an awesome thing I personally like. One can use this algorithm with a data set featuring fairly complex clusters, such as a circle as a cluster with a clump-like cluster in the middle. The algorithm will correctly detect the ring-like cluster as a cluster because each time steps 4 and 5
• Big problem is time complexity. The hierarchical algorithm has a time complexity of O(n^2). There are other hierarchical algorithm variants that have been developed to get around this problem.