Software > Hierarchical Clustering Using Ward's Algorithm
Agglomerative (bottom up) hierarchical clustering algorithms can be applied to create a hierarchy of clusters grouping similar data items, e.g., documents. Clustering starts with a set of singleton clusters, each containing a single document di D, i=1, ..., N, where D equals the entire set of documents and N equals the number of all documents. The two most similar clusters over the entire set D are merged to form a new cluster that covers both. This process is repeated for each of the remaining N-1 documents. A complete linkage algorithm is applied to determine the overall similarity of document clusters based on the document similarity matrix.
Merging of document clusters continues until a single, all-inclusive cluster remains. At termination, a uniform, binary hierarchy of document clusters is produced.
The figure below shows an example of how eight documents may be clustered hierarchically. Note that each document is identical to itself.
Ward's Algorithm (Ward, 1963) is a commonly used procedure for forming hierarchical groups of mutually exclusive subsets. It is particularly useful for large-scale (n > 100) studies when a precise optimal solution for a specified number of groups is not practical. Given n sets, this procedure reduces them to n - 1 mutually exclusive sets by considering the union of all possible n(n - 1)/2 pairs and selecting a union having a maximal value for the functional relation, or objective function, that reflects the criterion chosen by the investigator. By repeating this process until only one group remains, the complete hierarchical structure and a quantitative estimate of the loss associated with each stage in the grouping can be obtained according to:
The algorithm can produce an ordering of the objects, which may be informative for data display. Smaller clusters are generated, which may be helpful for discovery.
No provision can be made for a relocation of objects that may have been 'incorrectly' grouped at an early stage. Different distance metrices most likely result in different clusters. Performing multiple experiments and comparing the results is recommended.
Can be applied to cluster any data set for which a distance or similarity measure is available.
The implemented algorithm scans the data input file and stores it in a linked list. In each clustering step, the linked list is modified to reflect the number and content of currently existing clusters.
The algorithm was implemented and documented by Vivek Agrawal (email@example.com).