The Variable-length Hidden Markov Model

This site is maintained by: Yi Wang (http://yi.wang.2005.googlepages.com)
This work is supported by: Department of Computer Science, Tsinghua University,
School of Creative Media, City University of Hong Kong, and
IBM China Research Lab

Abstract

We develop the Variable-length Hidden Markov Model (VLHMM), which automatically learns the minimum set of parameters (by optimizing a Maximum-Likelihood criterion) that accurately models a complex (high-order) hidden Markovian process (by optimizing a Minimum-Entropy criterion). The minimum number of parameters makes the learning of VLHMM efficient, and the accuracy makes VLHMM effective for many applications.

Context

  1. Introduction to VLHMM (submitted to Journal of Machine Learning Research)
  2. Application on 3D Motion Synthesis (submitted to IEEE TPAMI)
  3. Application on Data Mining (accepted by IEEE ICDM'06)
  4. Application on Data Compression (recently accepted by AAAI'07)

Introduction

In order to make the introduction brief and comprehensive, we write it in the form of Q-and-A. A formal description of VLHMM and comparisons with other models can be found in our ICDM'06 paper:
Yi Wang and et al. Mining Complex Time-Series Data by Learning Markovian Models. In IEEE ICDM'06, Dec.2006, Hong Kong.
[PDF @ mirror 1] [PDF @ mirror 2]
More detailed mathematical derivation of VLHMM can be found in the draft of a technical report (which is still preliminary and is under revision):
Yi Wang and et al. The Variable-length Hidden Markov Model and Its Applications on Sequential Data Mining. Department of Computer Science, Tsinghua University, Beijing, China.
[PDF @ mirror 1] [PDF @ mirror 2]
The Q-and-A's
Q.1 Why VLHMM in addition to the many other models?
Q.2 How does VLHMM compare with other models?
Q.3 Why HMM is inaccurate?
Q.4 Why n-HMM is accurate?
Q.5 Why HMM is efficient whereas n-HMM is not?
Q.6 What about other high-order HMMs?
Q.7 How does VLHMM achieve both accurate modeling and efficient learning?
Q.8 What is the real difficulty of learning VLHMM?
Q.9 How does VLHMM optimize both the Minimum-Entropy and the Maximum-Likelihood criteria
Q.10 How to do model selection for VLHMM?
Q.11 What is the difference between VLHMM and VLMM
Q.12 I saw VLMM was used to do the same thing as VLHMM, why?
  1. Why VLHMM in addition to the many other models?

    Our motivation of developing VLHMM comes from the following observations: (1) although the first-order HMM (HMM) is efficient in learning, it is inaccurate in modeling; (2) the fixed-length high-order HMM (n-HMM) is accurate in modeling but, because of its huge number of parameters, is not efficient; (3) Various variants to n-HMM, including the Mixture Transition Distribution (MTD) cite, introduces mathematical constraints to simplify n-HMM. Although the simplification reduces model complexity, it also reduces the generality and accuracy of modeling. Unlike these variants that compromise between HMM and n-HMM, VLHMM is both accurate and efficient.

  2. How does VLHMM compare with other models?
  3. Why HMM is inaccurate?

    HMM models probabilistic transitions over a set of, say S, states under the assumption of first-order Markovian dependency, which assumes that with the previous one state known, the current states can be accurately determined via a transition probability distribution function (p.d.f.). However, in most real systems, the current state may depend upon many previous states, instead of only one. So HMM is not able to accurately model sequential data that are highly varied, such as motion capture data, video, and many event logs.

  4. Why n-HMM is accurate?

    n-HMM determines the current state by the context of previous n states. The n-th order Markovian dependency makes n-HMM able to capture rather complex dynamics accurately.

  5. Why HMM is efficient whereas n-HMM is not?

    As the efficiency of a probabilistic model is highly related with the number of its parameters, we show the number of parameters of HMM and n-HMM by representing both of them by a weighted and directed graph.

    Suppose that the current state, illustrated by qt, is determined by the context of previous n=3 states, illustrated by [qt-3,qt-2,qt-1], we have the transition:
    [qt-3,qt-2,qt-1] --> qt
    As the context and the current state form a new context, this transition from the context to the current state can be represented equivalently by one from the context to the new context:
    [qt-3,qt-2,qt-1] --> [qt-3,qt-2,qt-1,qt]
    Because n-HMM fixes the length of context to n, so the oldest one state, qt-3, is forgotten, and the transition becomes:
    [qt-3,qt-2,qt-1] --> [qt-2,qt-1,qt]

    This is why an n-HMM can be represented by a graph over a set of n-contexts, and an HMM, as a specialization of n-HMM with n=1 can be represented by a graph over a set of 1-contexts, that is, states. On each context, there defined an output p.d.f..

    Given the number of states, S, for n-HMM the number of contexts is Sn, whereas for HMM, it is S. Accordingly, for n-HMM, the number of output p.d.f.s and that of the transition probabilities are Sn and Sn+1 respectively, whereas for HMM, they are S and S2 respectively. The exponentially more parameters of n-HMM than HMM requires exponentially more training data and learning time for an unbiased learning, or equivalently, a sufficient statistics. When n is set large enough to capture the complex dynamics, learning the n-HMM usually has become intractable.

  6. What about other high-order HMMs?

    There do have other high-order HMMs, e.g., the Mixture Transition Distribution (MTD), that use less parameters to capture high-order Markovian dependencies. Although these models are more efficient in learning than n-HMM, their "less parameters" are achieved by introducing mathematical constraints to simplify the model. This reduces the generality and accuracy of these models. To our knowledge, all these models compromise between HMM and n-HMM for efficiency and accuracy. (c.f. the figure in Q.2.)

  7. How does VLHMM achieve both accurate modeling and efficient learning?

    From above discussions we can see that: for achieve efficient learning, the model should have a compact parameter set; for accurate modeling, the model should have long contexts to capture high-order Markovian dependency. However, when the lengths of contexts are fixed (as n-HMM), we would be in a dilemma that the longer the contexts the exponentially more model parameters. So we allow the contexts of VLHMM have variable lengths in order to make VLHMM able to learn a minimum set of parameters that accurately models the hidden Markovian process.

    In particular, because most real dynamical systems have their dynamical complexity changing over time, the generated sequences also have complexity changing over time. So we can learn some longer contexts to model part of the training sequence that are highly varied, whereas learn some shorter ones to model parts with regular dynamics . In short, it is not necessary to lengthen all the contexts equally to n when we are pursuing for an accurate modeling of an n-th order hidden Markovian process.


    (a) The context tree of an n-HMM with n=2 and S=3

    (b) The context tree of a VLHMM with variable n and S=3

    The above two figures illustrate an example. Suppose that the number of states, S, is known to be 3, the previous state, denoted by qt-1, would have S=3 choices, and qt-2 has again S=3 choices. As illustrated in figure (a), the contexts can be represented by the terminals of a tree. (The term "terminal" in graph theory, informally, means the shortest path from the root of a tree to a leaf node.)

    What is shown in Figure (a) is the context tree of a 2-HMM, so all contexts have fixed length of n=2 and the context tree is a complete tree. However, it is possible that when qt-1 has a certain value, say 1, there gets no need to know qt-2 and any older states in order to determine the new state qt accurately. In this case, all children of the node qt-1=1 can be removed from the context tree. Such pruning of the tree would significantly reduce the total number of terminals, and equivalently, the number of contexts. In fact, it is provable that the reduced number of contexts is asymptotically exponential with respect to the average number of pruned levels. Comparing figure (b) from (a), the total number of contexts is reduced from Sn=9 to 6, which are:

    r--1 means that qt-1=1 and whatever older states are,
    r--2 qt-1=2 and qt-2=1 or 2
    r--2--3 qt-1=2 and qt-2=3
    r--3--1 qt-1=3 and qt-2=1
    r--3--2 qt-1=3 and qt-2=2
    r--3--3 qt-1=3 and qt-2=3

    Technically, whether a context, say [qt-1=1,qt-2=2,qt-3=3], should be pruned shorter can be judged by evaluating a Kullback-Leibler divergence between two transition p.d.f.s: f(qt)=P(qt|1,2,3) and g(qt)=P(qt|1,2). If the KL-divergence is small, which means that g() has similar form like f(), it would be OK to shorten the context [1,2,3] to [1,2] without losing much accuracy of modeling. Similarly, we can determine whether a context need to be grown longer. This pruning and growing operations provably optimizes a Minimum-Entropy criterion and estimates a minimum context tree cite.

  8. What is the real difficulty of learning VLHMM?

    For HMM and n-HMM, when S is given, the total number of contexts is known, and all the model parameters to be estimated, including the output p.d.f.s and the transition probabilities, are known. In the machine learning theory, it is said that the "model structure is known". Thus the parameters can be estimated by optimizing the Maximum-Likelihood criterion by deriving an EM algorithm. (The EM learning algorithm derived for HMM is called "Baum-Welch algorithm".)

    Whereas, for VLHMM, even when S is given, the number of contexts and thus the model parameters are still unknown. In fact, the contexts are to be learned by optimizing the Minimum-Entropy criterion discussed in Q.7. So learning VLHMM has to optimize two criteria --- the Minimum-Entropy criterion for learning the minimum context set and the Maximum-Likelihood criterion for learning the model parameters.

  9. How does VLHMM optimize both the Minimum-Entropy and the Maximum-Likelihood criteria?

    We derive a structural-EM algorithm cite, which, by optimizing both criteria through its M-step, provably converges to a Maximum Bayesian Information Criterion (BIC) solution, where BIC is an approximation of the KL-divergence between the learned model and the real data distribution, which is usually regarded as the ultimate criterion of effective learning. The BIC penalizes complex model structure (redundant contexts) and encourages accurate estimation of the model parameters. The algorithm is briefly shown as follows:

    1. Initialization:
      - guess contexts by setting up a complete context tree with arbitrary depth,
      - guess model parameters (transition probabilities and output p.d.f.s) randomly.
    2. Looping until BIC converges
      E-step:
      - infer the probability that each element in the training sequence is generated from each context. (similar with the E-step of the Baum-Welch algorithm.)
      M-step:
      - update model parameters by maximizing the log of expected complete data likelihood.
      - update the context set by minimizing the entropy by invoking the context tree pruning/growing algorithm discussed in Q.7.

  10. How to do model selection for VLHMM?

    The learning algorithm presented Q.9 requires knowing the number of states S, which is similar with that the Baum-Welch algorithm requires knowing S before it can learn an HMM or an n-HMM. When the user does not know the exact value of S, it should be determined automatically. This job is known as "model selection" in learning theory.

    A simply method for model selection of VLHMM is similar with those for HMM and n-HMM. Starting from a guessed S=S', the learning algorithm is invoked to estimated the model. Again, for S=S'+1 and S=S'-1, the learning algorithm is also invoked. If the BIC of S=S'+1 is larger than that of S=S', we set S'=S'+1 and repeat the process; if the BIC of S=S'-1 is larger than that of S=S', we set S'=S'+1 and repeat the process; otherwise, S=S' is good enough, we step the process and use S=S'.

  11. What is the difference between VLHMM and VLMM

    Simply, VLMM is an "observable" Markov model and can only model sequences of discrete values; whereas VLHMM is a "hidden" model and can model sequences of discrete/continuous and scalar/vector values. Learning VLMM only needs to optimize the Minimum-Entropy criterion, whereas learning VLHMM needs to optimize both the Minimum-Entropy and the Maximum-Likelihood criteria.

  12. I saw VLMM was used to do the same thing as VLHMM, why?

    In some papers, VLMM was used to learn a sequence of continuous/vector values. This is done by (1) discarding the order of the values in the sequence, (2) clustering the values into a finite number of clusters, (3) substituting each value of the sequence by its cluster index, and (4) learning VLMM from the sequence of cluster indices. However, this method is far from optimal, because the error introduced in the clustering step (usually called quantization or discretization) caused by discarding the order information may be very large and is out of control.

    In fact, learning a hidden Markovian model (HMM, n-HMM, or VLHMM) is itself a clustering process, where each cluster is represented by an output p.d.f. This is similar with the EM-clustering, which is the generalization of most well-known clustering methods such as K-means, in that each cluster is represented by a Gaussian p.d.f.. The difference between learning a hidden Markovian model and EM-clustering is that, the former is a generalization of the latter, where the former considers not only the distances between the values in the sequence, but also their order, or temporal coherence.

    With this fact in mind, we see that the clustering+learning_VLMM approach compared with the learning_VLHMM approach is similar with the EM-clustering or K-means compared with the learning_HMM --- the order information which is critical for sequential data is incorrectly discarded.

References