Optimal Leaf Ordering
Heirarchial Clustering and UPGMA
Hierarchical clustering clusters a set of input elements into a
binary tree with similar groups clustered together. Hierarchial clustering
is useful in organizing large sample of data, and has been used
extensively, amongst others, in biological study of data samples. The distance
between the two groups clustered together is based in terms of the
similarities between the groups. For instance, for a given sample of DNA sequences,
two genes with more CG content will be grouped together because they are more similar.
In this program, we use the UPGMA method of hierarchical clustering. UPGMA, acronym for
Unweighted Pair Grouping with Arithmetic Mean, is an algorithm used in bioinformatics
for the creation of a dendrogram (i.e. a rooted tree), with the assumption that
there is a constant rate of evolution (between the gene samples from orthologs/paralogs,
for instance). For more information about the UPGMA algorithm, please refer to this
wikipedia site.
Optimal Ordering
A UPGMA dendrogram with any one of its internal node flipped will yield different ordering of the leaves,
while the tree structure remains the same. Since biological analysis is often done in the context of linear
arrangement of the leaves themselves, the linear ordering of the leaves become significant.
The Optimal Leaf Ordering attempts to cluster similar groups (or leaves) taken from the UPGMA algorithm, and yields the leaf order that maximizes the sum of the similarities of adjacent leaves in the ordering. For the Optimal Leaf Ordering algorithm, please refer to the paper "Fast Optimal Leaf Ordering for Heirarchical Clustering" by Ziv B. J. et. al. (published in Bioinformatics 17 2001 S22-S29). More information about the algorithm can be found here.