In this paper we examine the use of the minimum message length criterion in the process of evaluating alternative models of data when the samples are serially ordered in space and implicitly in time. Much data from vegetation studies can be arranged in a sequence and in such cases the user may elect to constrain the clustering by zones, in preference to an unconstrained clustering. We use the minimum message length principle to determine if such a choice provides an effective model of the data. Pollen data provide a suitably organised set of samples, but have other properties which make it desirable to examine several different models for the distribution of palynomorphs within the clusters. The results suggest that zonation is not a particularly preferred model since it captures only a small part of the patterns present. It represents a user expectation regarding the nature of variation in the data and results in some patterns being neglected. By using unconstrained clustering within zones, we can recover some of this overlooked pattern. We then examine other evidence for the nature of change in vegetation and finally discuss the usefulness of the minimum message length as a guiding principle in model choice and its relationship to other possible criteria.
Many methods of cluster analysis do not explicitly account for correlation between attributes. In this paper we explicitly model any correlation using a single factor within each cluster: i.e., the correlation of atributes within each cluster is adequately described by a single component axis. However, the use of a factor is not required in every cluster. Using a Minimum Message Length criterion, we can determine the number of clusters and also whether the model of any cluster is improved by introducing a factor. The technique allows us to seek clusters which reflect directional changes rather than imposing a zonation constrained by spatial (and implicitly temporal) position. Minimal message length is a means of utilising Okham’s Razor in inductive analysis. The ‘best’ model is that which allows most compression of the data, which results in a minimal message length for the description. Fit to the data is not a sufficient criterion for choosing models because more complicated models will almost always fit better. Minimum message length combines fit to the data with an encoding of the model and provides a Bayesian probability criterion as a means of choosing between models (and classes of model). Applying the analysis to a pollen diagram from Southern Chile, we find that the introduction of factors does not improve the overall quality of the mixture model. The solution without axes in any cluster provides the most parsimonious solution. Examining the cluster with the best case for a factor to be incorporated in its description shows that the attributes highly loaded on the axis represent a contrast of herbaceous vegetation and dominant forests types. This contrast is also found when fitting the entire population, and in this case the factor solution is the preferred model. Overall, the cluster solution without factors is much preferred. Thus, in this case classification is preferred to ordination although more data are desirable to confirm such a conclusion.