
Graduate Student
Dept. of Computer Science
michael@sanjuan.cs.wwu.edu
Thesis Proposal
April 9th, 1996
I. Introduction
II. The Problem
III. The Solution
IV. Proof
V. Summary
VI. References
Introduction:
This thesis will explore the applicability of Markov models to non-discrete multivariate data sets. Recent work in natural language processing [1] and speech processing [2] have shown traditional and other types of Markov models to be extremely powerful tools. In statistical language learning each letter ('a' through 'z' plus punctuation) is considered to be a single symbol. Using various techniques the model 'learns' from a base set of text and generates probabilities for determining what the next symbol should be given the symbols previously seen. Typically the Markov model looks at the current state as well as the past two or three states when generating its probabilities. How far the model 'looks back' is called the memory length of the model.
The Problem:
Traditional and other sorts of Markov models have been applicable only to univariate data sets where the data is composed of discrete elements. Statistical language learning with its single data stream of discrete symbols ('a'..'z') is a classic example of this. In the world of environmental science however most data sets are neither univariate nor composed of discrete data elements. Both the Institute for Watershed Studies and the Institute for Environmental Toxicology and Chemistry here at Western Washington University produce numerous data sets that are both multi-variate and non-discrete [3]. Adding further complexity to these data sets (and environmental data sets in general) is the fact that the variables are often interdependent - the value of variable A at time t can affect the value of variable B at time t+1.
Several data analysis tools have been created here at W.W.U. to analyze these data sets including Riffle [4], Riggle [5] and Worm [6] but as of yet none have been developed that actually simulate the data sets and provide statistical information on specific variable interdependencies within the data sets. Since Markov modeling has proven to be a powerful tool for simulation and statistical modeling of discrete univariate data sets I propose a variation of Markov modeling that will be applicable to data sets that are multivariate and whose data elements are non-discrete. Hopefully such a system will provide powerful simulation and statistical information on the data sets it is applied to.
The Solution:
There are three fundamental barriers to the application of Markov models to multivariate non-discrete data sets. The first barrier is in the data itself. Where a language data set may be composed of 26 discrete values 'a' to 'z', the typical environmental data set is composed of non-discrete numbers which may be very large and/or very small. Since the Markov model depends on a finite number of possible variable states the direct use of non-discrete data creates an explosive growth in the size of the Markov model. Furthermore the data set itself could be considered as infinitely sparse since the number of possible discrete values between any two real numbers is infinite. It is necessary therefore to provide some mechanism for the conversion of the non-discrete data into a discrete form, and the conversion method itself may depend greatly on the nature of the experiment that produced the data. I propose two simple conversion schemes.
The first, a scalar scheme, simply utilizes the notion of a 'minimum significant value' to define change. The data set is multiplied by this value, a scalar, and rounded to the nearest unit value. The second is a logarithmic scheme. This scheme first takes a logarithm of the raw data and then multiplies by a scalar. The logarithmic scheme is particularly useful for data whose value is dependent on its prior value - for instance the 'unit change of one' for bacteria in a petri dish may be 'doubling its population'. Clearly the granularity of the choice for scalar values will greatly affect the Markov model itself since the number of discrete states a variable can have is proportional to the possible values a variable can have.
It is important to note that in the conversion of numerical data to discrete data we lose some of the information in the data set - namely the predecessor/successor relationship. In a truly discrete data set such as the alphabet there is no inherent relationship between the elements - 'b' is not any closer to 'a' than 'g' is. With numbers however there is a concept of 1001 being close to 1000 than 2000. It may prove valuable to provide some sort of technique to utilize this information in the Markov model. Furthermore it may be necessary to develop a notion of dynamic states such as 'variable A is decreasing' rather than just static states such as 'variable A is 5'. At this time I do not have any particular solutions to these problems and will address them on an 'as needed' basis if they appear during the development of the project.
The next barrier is the memory length of the Markov model. In some instances, events that occurred some time ago (in the model) may still have predictive value at the current state. If the Markov model tries to look backwards a long distance the size of the model grows exponentially. I propose using a variation of what is known as the variable memory length Markov model [7][8]. This technique increases the memory length for sequences that increase the predictive value of the model while disregarding those that do not. This allows the model to avoid the exponential growth problem yet still be able to have a very deep memory when it is useful to do so.
The final barrier is the application of the Markov model to multivariate data sets. The Markov model is applicable to only the potential states of a single variable. I propose that the model be represented as a set of super-states where each superstate is composed of the logical concatenation of the states of all the variables in the model. There are several problems with this technique. First the number of states grows exponentially with the number of variables. Secondly the inter-dependency of the variables on one another (which in some cases is the point of the experiment) is entirely lost. The system I propose utilizes the super-state notion but the super-state is composed only of those variables that offer predictive value to the model as determined by the Kullback-Liebler divergence [9]. This system ought to keep the number of states in the model at a reasonable level. Furthermore, since irrelevant variables are not included in a given super-state it is easier to obtain direct information on which variables in the data affect others, and what that effect is.
Proof:
There are two fundamental problems being addressed. The first is the simulation of the data sets and the second is the gathering of statistical information on variable interdependencies within the data sets. Although in the system I propose the two are functionally inseparable it will be necessary to prove the accuracy of each separately. I propose developing the modeling system described above and applying it to various data sets generated by the Institute of Environmental Toxicology and Chemistry.
To prove the effectiveness of the model simulation I propose training the model on a sub-set of the IETC data and then using the trained model to simulate the other sets. The correlation between the results of the original data sets and the model generated sets will yield an accurate measure of the model's ability to simulate those types of data sets in general.
To prove the effectiveness of the statistical information I propose training the model on a full set of IETC data and then reviewing the statistics on those variables that have a high correlation of interdependency. This information can then be compared to known variable interdependencies at IETC for verification.
Summary:
The system I propose will model non-discrete multi-variate data sets utilizing a variation of Markov Modeling. Before the data can be analyzed it needs to be converted from a non-discrete to a discrete form. The method of conversion will depend upon the experiment itself and it may be necessary to accommodate some new kinds of states. A variable length memory model will be used to capture statistical information about long-term data dependencies without causing an explosive growth in the size of the model. The model itself will be composed of a set of super-states each of which will be a vector of variable states. The variable states present in any given super-state will be only those that offer predictive value to the model as a whole thus avoiding explosive growth in the size of the model as well as providing better statistical information on data inter-dependencies in the model itself.
References:
- E. Charniak. Statistical Language Learning. The MIT Press, 1993
- H. Schutze, Y. Singer. Part-of-Speech Tagging Using a Variable Memory Markov Model, Center for the Study of Language & Information, Stanford, CA 1995
- G. Matthews, R. Matthews, W. Landis, Nonmetric Conceptual Clustering in Ecology and Ecotoxicology, Western Washington University, 1994
- G. Matthews, J. Hearne. Clustering without a metric, IEEE Transactions on Pattern Analysis and Machine Intelligence 13(2):175-184, 1993
- C. Pickreign, Riggle, Western Washington University, 1995
- M. Roze. Worm, Western Washington University, 1995
- H. Schutze, Y. Singer. Part-of-Speech Tagging Using a Variable Memory Markov Model, Center for the Study of Language & Information, Stanford, CA 1995
- D. Ron, Y. Singer, N. Tishby. The Power of Amnesia, Institute of Computer Science and Center for Neural Computation, Hebrew University, 1995
- S. Kullback. Information Theory and Statistics, New York, Wiley, 1959
home | research page | ftp NetCDF tools | ftp this proposal