Agglomerative (bottom up) hierarchical clustering algorithms can be applied to create a hierarchy of clusters grouping similar data items, e.g., documents. Clustering starts with a set of singleton clusters, each containing a single document di D, i=1, ..., N, where D equals the entire set of documents and N equals the number of all documents. The two most similar clusters over the entire set D are merged to form a new cluster that covers both. This process is repeated for each of the remaining N-1 documents. A complete linkage algorithm is applied to determine the overall similarity of document clusters based on the document similarity matrix. Merging of document clusters continues until a single, all-inclusive cluster remains. At termination, a uniform, binary hierarchy of document clusters is produced. The figure below shows an example of how eight documents may be clustered hierarchically. Note that each document is identical to itself. Ward's Algorithm (Ward, 1963) is a commonly used procedure for forming hierarchical groups of mutually exclusive subsets. It is particularly useful for large-scale (n > 100) studies when a precise optimal solution for a specified number of groups is not practical. Given n sets, this procedure reduces them to n - 1 mutually exclusive sets by considering the union of all possible n(n - 1)/2 pairs and selecting a union having a maximal value for the functional relation, or objective function, that reflects the criterion chosen by the investigator. By repeating this process until only one group remains, the complete hierarchical structure and a quantitative estimate of the loss associated with each stage in the grouping can be obtained according to:
The algorithm can produce an ordering of the objects, which may be informative for data display. Smaller clusters are generated, which may be helpful for discovery. No provision can be made for a relocation of objects that may have been 'incorrectly' grouped at an early stage. Different distance metrices most likely result in different clusters. Performing multiple experiments and comparing the results is recommended.
Can be applied to cluster any data set for which a distance or similarity measure is available.
The implemented algorithm scans the data input file and stores it in a linked list. In each clustering step, the linked list is modified to reflect the number and content of currently existing clusters.
- Download and Uncompress the zipped files on your computer using winzip or other utility.
- It should make a directory named src in the directory you have downloaded it in.
- There will be a file named HCclustering.java and a file named input.txt and one named output.txt the two .txt files are input and output sample files respectively and the .java file is the original source file.
- Compile it with 'javac HCclustering.java' from the same directory.
- Run using 'java HCclustering input.txt output.txt'. It will read the data from the input file and create an output file.
- For large networks more memory is needed: Run with option -mx512m which means that the maximum memory required can go upto 512 MegaBytes and this value can be changed according to need.
- Joe H. Ward (1963). Hierarchical Grouping to optimize an objective function.
*Journal of American Statistical Association, 58*(301), 236-244.
The algorithm was implemented and documented by Vivek Agrawal (vivekagr@iitk.ac.in). |