Knowledge discovery is the non-trivial process of identifying valid, novel, interesting, potentially useful and ultimately understandable patterns in data. It encompasses a wide range of techniques ranging from data cleaning to finding manifolds and separating mixtures. Starting in the early 50’s, ecologists contributed greatly to the development of these methods and applied them to a large number of problems. However, underlying the methodology are some fundamental questions bearing on their choice and function. In addition, other fields, from sociology to quantum mechanics, have developed alternatives or solutions to various problems. In this paper, I want to look at some of the general questions underlying the processes. I shall then briefly examine aspects of 3 areas, manifolds, clustering and networks, specifically for choosing between them using the concept of compression. Finally, I shall briefly examine some of the future possibilities which remain to be examined. These provide methods of possibly improving the results of clustering analysis in vegetation studies.
In previous studies a minimum message length fuzzy clustering method was applied to vegetation data and shown to give sensible estimates for the number of clusters as well as consistent estimates of cluster parameters. The minimum message length method provides a principled method of choosing between models and between classes of models. It comprises 2 components; one coding the model and its associated (meta)parameter values, the other coding the data, given the model. The program uses uncorrelated Gaussian distributions as a model for the distribution of attributes within clusters. This assumption may not be acceptable and in this paper a more general model, the t-distribution, has been examined. The t-distribution provides a class of thick-tailed models, while including the Gaussian as a subclass. This should be appropriate in hierarchical clustering where, even if the final clusters had internal Gaussian distributions, the upper levels would not. In addition, it may provide a better model of within-cluster distribution of the attributes even in the final clusters. Although forcing the use of t-distributions was not profitable, allowing a choice between Gaussian and t-distributions for each attribute in each class resulted in improved results. This was despite only one attribute actually selecting the t-distribution over the Gaussian.
In this paper, I re-examine the subtropical rainforest succession previously studied by Williams, Lance, Webb, Tracey and Dale (1969) (WLWTD) using a clustering procedure based on the Minimal Message Length principle of induction. This principle permits the optimal number of clusters to be estimated automatically. Optimality is defined here as a trade-off between quality of fit and complexity of model, both measured in message length units. Because of the common unit of measurement, we can assess the numerical effectiveness of the procedures adopted in the previous study and compare the results obtained by using density as against presence/absence data or the value of numeric data independent of presence/absence effects. The results also bear on the “principle of explicability”which posits that users seek interpretable results, even if they are less efficient in purely numerical terms. The optimal density result identified 8 clusters, although these were further clustered into 3 higher level groupings. The pattern of 2 temporal stages followed by spatial segregation is clear, with extra detail concerning aberrant stands and temporal dependency in the third spatial stage also apparent. This analysis was the most effective at recovering structure in the data, of those examined. Imposing the WLWTD analysis on density data was markedly suboptimal and even the number of clusters recognised (7) was strictly incorrect. However, by subjective interpretation WLWTD selected a number of clusters which was very close to the optimal density solution. For this reason insight gained into the processes operating was not overly compromised. The optimal density result cleans up a few corners and adds more detail but the main outlines are sufficiently clear in the subjectively assessed presence data. The results from optimal presence/absence analysis were understandable and effective, though considerably less detailed than those obtained using the density data or those from WLWTD's original analyses. Indeed the 3 clusters established using the presence data reflect the higher level of structure which is recognisable in the density result. Using numeric data with 0 values set to missing values, showed little of interest. Invocation of Kodratoff's principle of explicability, which argues for interpretability to dominate efficiency, was unnecessary since the efficient analyses were directly interpretable. The introduction of domain knowledge during the subjective interpretation in the original analysis was apparently sufficient to counter any losses due to the inefficiency of the clustering method. Given more effective clustering methods and using the density data, it becomes unnecessary.
Many methods of cluster analysis do not explicitly account for correlation between attributes. In this paper we explicitly model any correlation using a single factor within each cluster: i.e., the correlation of atributes within each cluster is adequately described by a single component axis. However, the use of a factor is not required in every cluster. Using a Minimum Message Length criterion, we can determine the number of clusters and also whether the model of any cluster is improved by introducing a factor. The technique allows us to seek clusters which reflect directional changes rather than imposing a zonation constrained by spatial (and implicitly temporal) position. Minimal message length is a means of utilising Okham’s Razor in inductive analysis. The ‘best’ model is that which allows most compression of the data, which results in a minimal message length for the description. Fit to the data is not a sufficient criterion for choosing models because more complicated models will almost always fit better. Minimum message length combines fit to the data with an encoding of the model and provides a Bayesian probability criterion as a means of choosing between models (and classes of model). Applying the analysis to a pollen diagram from Southern Chile, we find that the introduction of factors does not improve the overall quality of the mixture model. The solution without axes in any cluster provides the most parsimonious solution. Examining the cluster with the best case for a factor to be incorporated in its description shows that the attributes highly loaded on the axis represent a contrast of herbaceous vegetation and dominant forests types. This contrast is also found when fitting the entire population, and in this case the factor solution is the preferred model. Overall, the cluster solution without factors is much preferred. Thus, in this case classification is preferred to ordination although more data are desirable to confirm such a conclusion.
In this paper we examine the use of the minimum message length criterion in the process of evaluating alternative models of data when the samples are serially ordered in space and implicitly in time. Much data from vegetation studies can be arranged in a sequence and in such cases the user may elect to constrain the clustering by zones, in preference to an unconstrained clustering. We use the minimum message length principle to determine if such a choice provides an effective model of the data. Pollen data provide a suitably organised set of samples, but have other properties which make it desirable to examine several different models for the distribution of palynomorphs within the clusters. The results suggest that zonation is not a particularly preferred model since it captures only a small part of the patterns present. It represents a user expectation regarding the nature of variation in the data and results in some patterns being neglected. By using unconstrained clustering within zones, we can recover some of this overlooked pattern. We then examine other evidence for the nature of change in vegetation and finally discuss the usefulness of the minimum message length as a guiding principle in model choice and its relationship to other possible criteria.
In this paper, we use decision trees to construct models for predicting vegetation types from environmental attributes in a salt marsh. We examine a method for evaluating the worth of a decision tree and look at seven sources of uncertainty in the models produced, namely algorithmic, predictive, model, scenario, objective, context and scale. The accuracy of prediction of types was strongly affected by the scenario and scale, with the most dynamically variable attributes associated with poor prediction, while more static attributes performed better. However, examination of the misclassified samples showed that prediction of processes was much better, with local vegetation type-induced patterns nested within a broader environmental framework.
In this paper we examine the impact of runnelling on the vegetation of a salt marsh. Runnelling is a form of habitat modification used for mosquito control in Australia. Defining the states of the system through unsupervised clustering of vegetation records using the minimum message length principle, 11 states (or classes) were identified. The runnelled sites have a greater diversity of states present than the unrunnelled ones. The states at each time for each site were then used to develop transition matrices. From these, two different pathways were identified, indicating the patterns of change. The method of showing changes relied on pictures that represent average species size and density. Both the two main pathways of change started with the dominant grass (Sporobolus). One led to a reduction in Sporobolous and ended in bare ground; the other included changes involving variation in the size and density of a mix of Sporobolus and Sarcocornia. The effects can be interpreted in terms of the increased access of seawater to the marsh resulting in an extension of the lower marsh. We note, however, that this methodology does not distinguish between changes of state within a single process and changes associated with a change in the actual processes operating.
Correspondence analysis has found widespread application in analysing vegetation gradients. However, it is not clear how it is robust to situations where structures other than a simple gradient exist. The introduction of instrumental variables in canonical correspondence analysis does not avoid these difficulties. In this paper I propose to examine some simple methods based on the notion of the plexus (sensu McIntosh) where graphs or networks are used to display some of the structure of the data so that an informed choice of models is possible. I showthat two different classes of plexus model are available. These classes are distinguished by the use in one case of a global Euclidean model to obtain well-separated pair decomposition (WSPD) of a set of points which implicitly involves all dissimilarities, while in the other a Riemannian view is taken and emphasis is placed locally, i.e., on small dissimilarities. I showan example of each of these classes applied to vegetation data.
This paper examines how we might test the continuum theory against the community unit theory. Adherence to one or other of these models without testing is simply an assignment of an extreme prior probability to the preferred option. The question can be rephrased to ask whether, for a set of observations, a single model is adequate or whether a mixture of models would be preferable. To judge between them involves first defining the nature of the model(s) to be fitted in each case and then comparing the complexity and quality of fit. Occam's razor suggests that we should seek the simplest model with adequate fit, with parameters estimated with optimal precision. The simplest comparison of the two theories thus requires only the estimation of the number of clusters for the chosen model(s) of within-cluster variation. If a single cluster is of adequate quality then the continuum model is appropriate, while if several are needed then the community model is preferable for that particular dataset. To establish universal applicability of either model involves investigation of many datasets. There are several ways in which model quality can be assessed, and here I concentrate on the minimal message length principle which is a function of the prior probability of the model and its fit to the observed data, assuming the model to be correct. This principle has been shown to perform well when compared with other possibilities. I first illustrate the procedure for making a choice between models, using a simple model, then examine two alternative formulations of within-cluster models which seem more appropriate, one static, the other dynamic.
In this paper, I examine the choice of response curves of plants to environmental gradients. The commonest choice has been a unimodal curve, of a shape similar to a Gaussian. This is implied, for example, by correspondence analysis. However, other alternatives are possible which may lead to different interpretations. We have therefore a problem of determining an appropriate model. In part the choice is constrained by our beliefs about the nature of the gradient and by sampling considerations, particularly the modifiable unit area problem. For the rest it is necessary to invoke Ockham's razor and employ a technique such as minimum message length estimation in order to facilitate comparison of models and classes of models. In this paper I examine the existence of gradient(s), the possible forms of response functions and the effects of data contamination.