| This site is maintained by: | Yi Wang (http://yi.wang.2005.googlepages.com) |
| This work is supported by: | Department of Computer Science, Tsinghua University, School of Creative Media, City University of Hong Kong, and IBM China Research Lab |
Our motivation of developing VLHMM comes from the following observations: (1) although the first-order HMM (HMM) is efficient in learning, it is inaccurate in modeling; (2) the fixed-length high-order HMM (n-HMM) is accurate in modeling but, because of its huge number of parameters, is not efficient; (3) Various variants to n-HMM, including the Mixture Transition Distribution (MTD) cite, introduces mathematical constraints to simplify n-HMM. Although the simplification reduces model complexity, it also reduces the generality and accuracy of modeling. Unlike these variants that compromise between HMM and n-HMM, VLHMM is both accurate and efficient.
HMM models probabilistic transitions over a set of, say S, states under the assumption of first-order Markovian dependency, which assumes that with the previous one state known, the current states can be accurately determined via a transition probability distribution function (p.d.f.). However, in most real systems, the current state may depend upon many previous states, instead of only one. So HMM is not able to accurately model sequential data that are highly varied, such as motion capture data, video, and many event logs.
n-HMM determines the current state by the context of previous n states. The n-th order Markovian dependency makes n-HMM able to capture rather complex dynamics accurately.
As the efficiency of a probabilistic model is highly related with the number of its parameters, we show the number of parameters of HMM and n-HMM by representing both of them by a weighted and directed graph.
Suppose that the current state, illustrated by qt, is determined by the context of previous n=3 states, illustrated by [qt-3,qt-2,qt-1], we have the transition:This is why an n-HMM can be represented by a graph over a set of n-contexts, and an HMM, as a specialization of n-HMM with n=1 can be represented by a graph over a set of 1-contexts, that is, states. On each context, there defined an output p.d.f..
Given the number of states, S, for n-HMM the number of contexts is Sn, whereas for HMM, it is S. Accordingly, for n-HMM, the number of output p.d.f.s and that of the transition probabilities are Sn and Sn+1 respectively, whereas for HMM, they are S and S2 respectively. The exponentially more parameters of n-HMM than HMM requires exponentially more training data and learning time for an unbiased learning, or equivalently, a sufficient statistics. When n is set large enough to capture the complex dynamics, learning the n-HMM usually has become intractable.
There do have other high-order HMMs, e.g., the Mixture Transition Distribution (MTD), that use less parameters to capture high-order Markovian dependencies. Although these models are more efficient in learning than n-HMM, their "less parameters" are achieved by introducing mathematical constraints to simplify the model. This reduces the generality and accuracy of these models. To our knowledge, all these models compromise between HMM and n-HMM for efficiency and accuracy. (c.f. the figure in Q.2.)
From above discussions we can see that: for achieve efficient learning, the model should have a compact parameter set; for accurate modeling, the model should have long contexts to capture high-order Markovian dependency. However, when the lengths of contexts are fixed (as n-HMM), we would be in a dilemma that the longer the contexts the exponentially more model parameters. So we allow the contexts of VLHMM have variable lengths in order to make VLHMM able to learn a minimum set of parameters that accurately models the hidden Markovian process.
In particular, because most real dynamical systems have their dynamical complexity changing over time, the generated sequences also have complexity changing over time. So we can learn some longer contexts to model part of the training sequence that are highly varied, whereas learn some shorter ones to model parts with regular dynamics . In short, it is not necessary to lengthen all the contexts equally to n when we are pursuing for an accurate modeling of an n-th order hidden Markovian process.
The above two figures illustrate an example. Suppose that the number of states, S, is known to be 3, the previous state, denoted by qt-1, would have S=3 choices, and qt-2 has again S=3 choices. As illustrated in figure (a), the contexts can be represented by the terminals of a tree. (The term "terminal" in graph theory, informally, means the shortest path from the root of a tree to a leaf node.)
What is shown in Figure (a) is the context tree of a 2-HMM, so all contexts have fixed length of n=2 and the context tree is a complete tree. However, it is possible that when qt-1 has a certain value, say 1, there gets no need to know qt-2 and any older states in order to determine the new state qt accurately. In this case, all children of the node qt-1=1 can be removed from the context tree. Such pruning of the tree would significantly reduce the total number of terminals, and equivalently, the number of contexts. In fact, it is provable that the reduced number of contexts is asymptotically exponential with respect to the average number of pruned levels. Comparing figure (b) from (a), the total number of contexts is reduced from Sn=9 to 6, which are:
| r--1 | means that | qt-1=1 and whatever older states are,
| r--2 | qt-1=2 and qt-2=1 or 2 |
| r--2--3 | qt-1=2 and qt-2=3 |
| r--3--1 | qt-1=3 and qt-2=1 |
| r--3--2 | qt-1=3 and qt-2=2 |
| r--3--3 | qt-1=3 and qt-2=3 |
Technically, whether a context, say [qt-1=1,qt-2=2,qt-3=3], should be pruned shorter can be judged by evaluating a Kullback-Leibler divergence between two transition p.d.f.s: f(qt)=P(qt|1,2,3) and g(qt)=P(qt|1,2). If the KL-divergence is small, which means that g() has similar form like f(), it would be OK to shorten the context [1,2,3] to [1,2] without losing much accuracy of modeling. Similarly, we can determine whether a context need to be grown longer. This pruning and growing operations provably optimizes a Minimum-Entropy criterion and estimates a minimum context tree cite.
For HMM and n-HMM, when S is given, the total number of contexts is known, and all the model parameters to be estimated, including the output p.d.f.s and the transition probabilities, are known. In the machine learning theory, it is said that the "model structure is known". Thus the parameters can be estimated by optimizing the Maximum-Likelihood criterion by deriving an EM algorithm. (The EM learning algorithm derived for HMM is called "Baum-Welch algorithm".)
Whereas, for VLHMM, even when S is given, the number of contexts and thus the model parameters are still unknown. In fact, the contexts are to be learned by optimizing the Minimum-Entropy criterion discussed in Q.7. So learning VLHMM has to optimize two criteria --- the Minimum-Entropy criterion for learning the minimum context set and the Maximum-Likelihood criterion for learning the model parameters.
We derive a structural-EM algorithm cite, which, by optimizing both criteria through its M-step, provably converges to a Maximum Bayesian Information Criterion (BIC) solution, where BIC is an approximation of the KL-divergence between the learned model and the real data distribution, which is usually regarded as the ultimate criterion of effective learning. The BIC penalizes complex model structure (redundant contexts) and encourages accurate estimation of the model parameters. The algorithm is briefly shown as follows:
The learning algorithm presented Q.9 requires knowing the number of states S, which is similar with that the Baum-Welch algorithm requires knowing S before it can learn an HMM or an n-HMM. When the user does not know the exact value of S, it should be determined automatically. This job is known as "model selection" in learning theory.
A simply method for model selection of VLHMM is similar with those for HMM and n-HMM. Starting from a guessed S=S', the learning algorithm is invoked to estimated the model. Again, for S=S'+1 and S=S'-1, the learning algorithm is also invoked. If the BIC of S=S'+1 is larger than that of S=S', we set S'=S'+1 and repeat the process; if the BIC of S=S'-1 is larger than that of S=S', we set S'=S'+1 and repeat the process; otherwise, S=S' is good enough, we step the process and use S=S'.
Simply, VLMM is an "observable" Markov model and can only model sequences of discrete values; whereas VLHMM is a "hidden" model and can model sequences of discrete/continuous and scalar/vector values. Learning VLMM only needs to optimize the Minimum-Entropy criterion, whereas learning VLHMM needs to optimize both the Minimum-Entropy and the Maximum-Likelihood criteria.
In some papers, VLMM was used to learn a sequence of continuous/vector values. This is done by (1) discarding the order of the values in the sequence, (2) clustering the values into a finite number of clusters, (3) substituting each value of the sequence by its cluster index, and (4) learning VLMM from the sequence of cluster indices. However, this method is far from optimal, because the error introduced in the clustering step (usually called quantization or discretization) caused by discarding the order information may be very large and is out of control.
In fact, learning a hidden Markovian model (HMM, n-HMM, or VLHMM) is itself a clustering process, where each cluster is represented by an output p.d.f. This is similar with the EM-clustering, which is the generalization of most well-known clustering methods such as K-means, in that each cluster is represented by a Gaussian p.d.f.. The difference between learning a hidden Markovian model and EM-clustering is that, the former is a generalization of the latter, where the former considers not only the distances between the values in the sequence, but also their order, or temporal coherence.
With this fact in mind, we see that the clustering+learning_VLMM approach compared with the learning_VLHMM approach is similar with the EM-clustering or K-means compared with the learning_HMM --- the order information which is critical for sequential data is incorrectly discarded.