// doc/tree_externals.dox
// Copyright 2009-2012 Microsoft Corporation Johns Hopkins University (Author: Daniel Povey)
// See ../../COPYING for clarification regarding multiple authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// http://www.apache.org/licenses/LICENSE-2.0
// THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
// WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
// MERCHANTABLITY OR NON-INFRINGEMENT.
// See the Apache 2 License for the specific language governing permissions and
// limitations under the License.
namespace kaldi {
/**
\page tree_externals How decision trees are used in Kaldi
\section tree_intro Introduction
This page gives an overview of how phonetic decision trees are built and used
in Kaldi and how this interacts with training and graph-building. For a
description of the internals, of the tree-building code, see \ref
tree_internals; for more details on our approach to building decoding graphs, see \ref
graph.
The basic algorithm that is being implemented is a top-down greedy splitting, where we have a number of
ways we can split the data by asking about, say, the left phone, the right
phone, the central phone, the state we're in, and so on.
The algorithm we implement is similar to the standard algorithm,
see for example the paper "Tree-based State Tying for High Accuracy Acoustic Modeling" by
Young, Odell and Woodland. In this algorithm, we split the data up by asking the locally
optimal question, i.e. the one that gives the most likelihood increase, supposing
we model the data on each side of the split by a single Gaussian.
Differences from standard implementations include added flexibility
about how to configure the tree roots; the ability to ask questions about the HMM-state and
the central phone; and the fact that by default in the Kaldi scripts, the questions
are automatically generated by a top-down binary clustering of the data, which means
that it is not necessary to supply hand-generated questions.
Regarding the configuration of the tree roots: it's possible to
have all the statistics from all the phones in a single shared group that gets
split (including questions about the central phone and the HMM-states), or to have individual phones,
or HMM-states of phones, be the tree roots that get split, or to have groups of phones
be the tree roots. For how to configure it using the standard scripts, see
\ref data_prep. In practice we generally let each tree-root correspond to a "real phone", meaning
that we group together all word-position-dependent, tone-dependent or stress-dependent versions of
each phone into one group that becomes a tree root.
The rest of this page mostly gives details at the code level of what is happening.
\section tree_window Phonetic context windows
Here we explain how we describe phonetic context in our code.
A particular tree will have two integer values that describe the
width and "central position" of the context window. The table
below summarizes these values:
Name in code | Name in command-line arguments | Value (triphone) | Value (monophone) |
N | --context-width=? | 3 | 1 |
P | --central-position=? | 1 | 0 |
N is the width of the context window and P is the identity of the designated
"central phone". Normally P is exactly in the middle of the window
(hence the name "central-position"); for example, with N=3, we would normally
have P=1, but you are free to choose any value from 0 to N-1; for instance, P=2 and
N=3 means two phones of left context and no right context at all.
In the code, when we talk about the "central phone" we always mean the P'th
phone which may or may not actually be the central phone of the context window.
A vector of integers representing a typical triphone context window might be:
\code
// probably not valid C++
vector ctx_window = { 12, 15, 21 };
\endcode
Assuming N=3 and P=1, this would represent phone 15 with
a right context of 21 and a left context of 12. The way we handle end
effects is using zero (which is not a valid phone because it's reserved in
OpenFst for the epsilon meaning "no symbol"), so for instance:
\code
vector ctx_window = { 12, 15, 0 };
\endcode
means phone 15 with a left-context of 12 and no right-context because it's the
end of the utterance. At the end of utterance in particular, the use of zero
this way may be a little unexpected because the last "phone" is actually the
subsequential symbol "$" (see \ref graph_c), but for the convenience
of the decision-tree code we don't
put the subsequential symbol in these context windows, we put zero. Note
that if we had N=3 and P=2, the above context window would be invalid because
its P'th element would be zero which is not a real phone; also of course,
if we had a tree with N=1, neither of the windows above would be valid because they
are the wrong size. In the monophone case, we would have a window like:
\code
vector ctx_window = { 15 };
\endcode
so monophone systems are just treated as a special case of context-dependent
systems, with a window size N of 1 and a tree that doesn't do anything very
interesting.
\section tree_building The tree building process
In this section we give an overview of the tree-building process in Kaldi.
Even a monophone system has a decision tree, but a trivial one; see the functions
MonophoneContextDependency() and MonophoneContextDependencyShared() which return
such trivial trees. These are called by the command-line program \ref
gmm-init-mono.cc "gmm-init-mono"; its main input is the HmmTopology object and it
outputs the tree, normally written as an object of type ContextDependency to a
file called "tree", and also the model file (the model file contains a
TransitionModel object and an AmDiagGmm object). If the program gmm-init-mono
receives an option called --shared-phones, it will share the pdfs between
specified sets of phones; otherwise it makes all the phones separate.
After training a monophone system starting from a flat start, we take
the monophone alignments
and use the function AccumulateTreeStats() (called from \ref acc-tree-stats.cc
"acc-tree-stats") to accumulate statistics for training the tree. This program is
not limited to reading in monophone alignments; it works from context-dependent
alignments too so we can build trees based on e.g. triphone alignments.
The statistics for tree building are written to disk as the type \ref BuildTreeStatsType
(see \ref treei_stats).
The function AccumulateTreeStats() takes the values N and P, as explained in the
previous section; the command-line programs will set these by default to 3 and
1 respectively, but this can be overridden using the --context-width
and --central-position options. The program \ref acc-tree-stats.cc
"acc-tree-stats" takes a list of context-independent phones (e.g. silence), but this is
not required even if there are context-independent phones; it is just
a mechanism to reduce the size of the statistics.
For context-independent hones, the program will accumulate the
corresponding statistics without the keys corresponding to the left and right phones defined
(c.f. \ref treei_event_map).
When the statistics have been
accumulated we use the program \ref build-tree.cc "build-tree" to
build the tree. This outputs the tree.
The program \ref build-tree.cc "build-tree" requires three things:
- The statistics (of type BuildTreeStatsType)
- The questions config (of type Questions)
- The roots file (see below)
The statistics would typically come from the program acc-tree-stats;
the questions configuration class would be output by the compile-questions
program, which takes in a topology list of phonetic questions (in our
scripts, these are automatically obtained from tree-building statistics
by the program cluster-phones. The roots file specifies sets of phones
that are goint to have shared roots in the decision-tree clustering process, and says
for each phone set the following two things:
- "shared" or "not-shared" says whether or not there should be separate
roots for each of the \ref pdf_class "pdf-classes" (i.e. HMM-states,
in the typical case), or if the roots
should be shared. If we are going to be splitting (the "split" option
below) we enforce that the roots should be shared.
- "split" or "not-split" says whether or not the decision tree splitting
should actually be done for the roots in question (for silence, we
typically don't split).
Be careful because the notation is a bit tricky. The "shared" on the line of
the roots file is about whether we will share all the 3 HMM-states of the phone
in a single tree root. But we will always share together the roots of all the phones that
appear on a single lines of the roots file. This is not configurable via these
strings because if you don't want to share them, you can just put them on
separate lines of the roots file.
Below is an example of a roots file; this assumes that phone 1 is silence
and all the other phones have separate roots.
\verbatim
not-shared not-split 1
shared split 2
shared split 3
...
shared split 28
\endverbatim
Having multiple phones on the same line is most useful when we have things like position and
stress-dependent phones; in this case each "real" phone would correspond
to a set of integer phone ids. In that case we share the roots for all
versions of a particular underlying phone. Below is an example of a roots file
for Wall Street Journal, from the egs/wsj/s5 scripts (this is in text, not integer form;
it would have to be converted to integer form before being read by Kalid):
\verbatim
not-shared not-split SIL SIL_B SIL_E SIL_I SIL_S SPN SPN_B SPN_E SPN_I SPN_S NSN NSN_B NSN_E NSN_I NSN_S
shared split AA_B AA_E AA_I AA_S AA0_B AA0_E AA0_I AA0_S AA1_B AA1_E AA1_I AA1_S AA2_B AA2_E AA2_I AA2_S
shared split AE_B AE_E AE_I AE_S AE0_B AE0_E AE0_I AE0_S AE1_B AE1_E AE1_I AE1_S AE2_B AE2_E AE2_I AE2_S
shared split AH_B AH_E AH_I AH_S AH0_B AH0_E AH0_I AH0_S AH1_B AH1_E AH1_I AH1_S AH2_B AH2_E AH2_I AH2_S
shared split AO_B AO_E AO_I AO_S AO0_B AO0_E AO0_I AO0_S AO1_B AO1_E AO1_I AO1_S AO2_B AO2_E AO2_I AO2_S
shared split AW_B AW_E AW_I AW_S AW0_B AW0_E AW0_I AW0_S AW1_B AW1_E AW1_I AW1_S AW2_B AW2_E AW2_I AW2_S
shared split AY_B AY_E AY_I AY_S AY0_B AY0_E AY0_I AY0_S AY1_B AY1_E AY1_I AY1_S AY2_B AY2_E AY2_I AY2_S
shared split B_B B_E B_I B_S
shared split CH_B CH_E CH_I CH_S
shared split D_B D_E D_I D_S
\endverbatim
When creating the roots file, you should ensure that at least one phone on each line is seen.
For instance, in this case, if the phone AY was seen in at least some combination of stress and
word-position, we would be OK.
In this example, we have various word-position-dependent variants of silence and so on.
In this example they will all share their pdf's because they are on the same line and are
"not-split"-- but they may have different transition parameters. In fact, most of these
variants of silence would never be used as silence never appears inside words; this is for
future-proofing the setup in case someone does something weird in future.
We do the initial phases of Gaussian mixing up using the alignments from
the previous (e.g. monophone) build; the alignments are converted from one
tree to another using the program \ref convert-alignments.cc "convert-alignments".
\section pdf_id PDF identifiers
The PDF identifier (pdf-id) is a number, starting from zero, that is used as an index
for the probability distribution function (p.d.f.). Each p.d.f. in the system has its own
pdf-id, and these are contiguous (typically there are several thousand of these in an LVCSR
system). They are originally assigned when the tree is first built. Depending
how the tree is built, it may or may not be possible to say, for each pdf-id, which phone
it corresponds to.
\section tree_ctxdep Context dependency objects
The ContextDependencyInterface object is a virtual base-class for the
tree that specifies how it interacts with the graph-building code. This
interface contains only four functions:
- \ref ContextDependencyInterface::ContextWidth() "ContextWidth()" returns
the value of N (context-width) that the tree requires.
- \ref ContextDependencyInterface::CentralPosition() "CentralPosition()" returns
the value of P (central-position) that the tree requires.
- \ref ContextDependencyInterface::NumPdfs() "NumPdfs()" returns the number
of pdfs defined by the tree; these are numbered from zero to NumPdfs()-1.
- \ref ContextDependencyInterface::Compute() "Compute()" is the function
that computes the \ref pdf_id "pdf-id" for a particular context
(and \ref pdf_class "pdf-class").
The \ref ContextDependencyInterface::Compute() Compute() function is declared as follows:
\code
class ContextDependencyInterface {
...
virtual bool Compute(const std::vector &phoneseq, int32 pdf_class,
int32 *pdf_id) const;
}
\endcode
It returns true if it was able to compute the pdf-id for this context and
\ref pdf_class "pdf-class". A return value of false will indicate some kind
of error or mismatch. An example of using this function is:
\code
ContextDependencyInterface *ctx_dep = ... ;
vector ctx_window = { 12, 15, 21 }; // not valid C++
int32 pdf_class = 1; // probably central state of 3-state HMM.
int32 pdf_id;
if(!ctx_dep->Compute(ctx_window, pdf_class, &pdf_id))
KALDI_ERR << "Something went wrong!"
else
KALDI_LOG << "Got pdf-id, it is " << pdf_id;
\endcode
The only class that currently inherits from ContextDependencyInterface
is the class ContextDependency, which has marginally richer interface;
the only important addition is the function \ref ContextDependency::GetPdfInfo
"GetPdfInfo" which is used by the TransitionModel class to work out which
phones a particular pdf can possibly correspond to (this function could
be emulated given only the interface of ContextDependencyInterface, by
enumerating all contexts).
The ContextDependency object is actually a fairly thin wrapper for the
EventMap object; see \ref tree_internals. We wanted to hide
the actual implementation of the tree as much as possible to make it
easy to refactor the code later if needed.
\section tree_example An example of a decision tree
The decision-tree file format was not created with human readability as the first priority,
but due to popular demand we will try to explain how to interpret this file.
Look at the example below this, which is a triphone tree from the Wall Street Journal recipe.
It starts with ContextDependency which is the name of the object; then N (the context-width),
which is 3; then P (the "central position" of the context window), which is 1, i.e. the
center of the phone context positions since we are in zero-based numbering. The rest of the file
contains a single EventMap object. EventMap is a polymorphic type which may
contain pointers to other EventMap objects. See \ref treei_event_map for more details;
it is a representation of a decision tree or set of decision trees, that maps from
a set of key-value pairs (e.g. left-phone=5, central-phone=10, right-phone=11, pdf-class=2)
to a pdf-id (e.g. 158). Briefly, it comes in three types: a SplitEventMap (like a split in a decision tree),
a ConstantEventMap (like a leaf of a decision tree, containing just a number representing
a pdf-id), and a TableEventMap (which is like a table lookup containing other EventMaps). The
SplitEventMap and TableEventMap both have a "key" that they query, which in this case
would be 0, 1 or 2 corresponding to left, central or right context, or -1 corresponding
to the identity of the "pdf-class". Normally the value of the pdf-class
is the same as the index of the HMM state, i.e. 0, 1 or 2. Try not to get confused by
this: the key is -1, but the value is 0, 1 or 2, and this has no connnection to the 0, 1 or 2
which are the keys of the phones in the context-window.
The SplitEventMap has a set of values that will trigger the "yes" branch of the tree.
Below is a kind of quasi-BNF notation that explains the tree-file format.
\verbatim
EventMap := ConstantEventMap | SplitEventMap | TableEventMap | "NULL"
ConstantEventMap := "CE"
SplitEventMap := "SE" "[" yes-value-list "]" "{" EventMap EventMap "}"
TableEventMap := "TE" "(" EventMapList ")"
\endverbatim
In the example below, the top-level EventMap of the tree is a SplitEventMap (SE) that
splits on key 1, which is the central phone. In square brackets are a contiguous range
of phone-ids. As it happens, these don't represent a question, but are just a way of
splitting on phones so we can get to the "real" decision trees which are per phone.
The issue is that this tree was built with "shared roots", so there are various phone-ids,
corresponding to different word-position-and-stress-marked versions of the same phone,
that share the root. We can't use a TableEventMap (TE) at the top level of the tree,
or we'd have to repeat each decision tree several times (since the EventMap is a pure
tree, not a general graph, it has no mechanism for pointers to be "shared").
The next few instances of the "SE" label are also part of this "quasi-tree" which
is initially splitting on the central phone (as we go down this file we are going
deeper into the tree; notice that the braces "{" are opening but not yet closing).
Then we have the string
"TE -1 5 ( CE 0 CE 1 CE 2 CE 3 CE 4 )", which represents splitting with a TableEventMap
on the pdf-class "-1" (effectively, the HMM-position), and returning values 0 through 4.
The values represent the five pdf-ids
for the silence and noise phones SIL, NSN and SPN; in our setup, the pdfs are shared between these
three non-speech phones (only the transition matrix is specific to each non-speech phone).
Note: we have a 5-state rather than 3-state HMM for
these phones, hence 5 different pdf-ids. Next is "SE -1 [ 0 ]"; and this can be considered
the first "real" question in the tree. We can see from the SE questions above it that it
applies when the central phone takes values 5 through 19, which are
various versions of the phone AA; and question is asking whether the pdf-class (key -1)
has value 0 (i.e. the leftmost HMM-state). Assuming the answer is "yes", the next question
is "SE 2 [ 220 221 222 223 ]", which is asking whether the phone to the right is one of various
forms of the phone "M" (a rather unintuitive question to ask, since we're
in the leftmost HMM-state); if yes, we ask "SE 0 [ 104 105 106 107... 286 287 ]" which is
a question about the phone to the right; if yes, then the pdf-id is 5 ("CE 5") and if
no, 696 ("CE 696").
\verbatim
s3# copy-tree --binary=false exp/tri1/tree - 2>/dev/null | head -100
ContextDependency 3 1 ToPdf SE 1 [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 \
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59\
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 9\
3 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 1\
20 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 14\
5 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170\
171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 \
196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 ]
{ SE 1 [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34\
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 6\
8 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 10\
1 102 103 104 105 106 107 108 109 110 111 ]
{ SE 1 [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34\
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 ]
{ SE 1 [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ]
{ SE 1 [ 1 2 3 ]
{ TE -1 5 ( CE 0 CE 1 CE 2 CE 3 CE 4 )
SE -1 [ 0 ]
{ SE 2 [ 220 221 222 223 ]
{ SE 0 [ 104 105 106 107 112 113 114 115 172 173 174 175 208 209 210 211 212 213 214 215 264 265 266 \
267 280 281 282 283 284 285 286 287 ]
{ CE 5 CE 696 }
SE 2 [ 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 132 \
133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 248 249 250 251 252 253 254 255 256 257 2\
58 259 260 261 262 263 268 269 270 271 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 30\
3 ]
\endverbatim
Below is a simpler example: the monophone tree from the Resource Management
recipe. The top-level EventMap is a TableEventMap ("TE 0 49 ...").
The key "0" is the phone-position of zero which represents the central (and only) phone
since the context width (N) is 1. The number of entries in the table is 49
(in this case, the number of phones plus one). The
first EventMap in the table (index zero) is NULL, because there is no phone with
index zero. The next one is a TableEventMap with three elements, corresponding
to the three HMM-states (technically, pdf-classes) of the first phone: "TE -1 3 ( CE 0 CE 1 CE 2 )".
\verbatim
s3# copy-tree --binary=false exp/mono/tree - 2>/dev/null| head -5
ContextDependency 1 0 ToPdf TE 0 49 ( NULL TE -1 3 ( CE 0 CE 1 CE 2 )
TE -1 3 ( CE 3 CE 4 CE 5 )
TE -1 3 ( CE 6 CE 7 CE 8 )
TE -1 3 ( CE 9 CE 10 CE 11 )
TE -1 3 ( CE 12 CE 13 CE 14 )
\endverbatim
\section tree_ilabel The ilabel_info object
The CLG graph (see \ref graph) has symbols
on its input side that represent context-dependent phones (as well as
disambiguation symbols and possibly epsilon symbols). In the graph, as always,
these are represented by integer labels. We use an object that, in code
and in filenames, is generally called ilabel_info. The ilabel_info object
has a strong connection to the \ref fst::ContextFst "ContextFst" objects, see \ref graph_context.
As with many other Kaldi types, ilabel_info is a generic (STL) type but
we use a consistent variable name
to make it identifiable. It is of the following type:
\code
std::vector > ilabel_info;
\endcode
It is a vector, indexed by the FST input label, that gives for each
input label the corresponding phonetic context window (see above,
\ref tree_window). For example, suppose symbol 1500 is phone
30 with a right-context of 12 and a left-context of 4, we would
have
\code
// not valid C++
ilabel_info[1500] == { 4, 30, 12 };
\endcode
In the monophone case, we would have things like:
\code
ilabel_info[30] == { 28 };
\endcode
There are special cases to deal with disambiguation symbols (see
\ref graph_disambig or the
Springer Handbook paper referenced above for an explanation of what these
are). If an ilabel_info entry corresponds to a disambiguation symbol,
we put in it the negative of the symbol-table entry of the disambiguation
symbol (note that this is not the same as the number of the printed form
of the disambiguation symbol as in #0, #1, #2 etc., it is the number
corresponding to it in a symbol-table file, which in our current scripts is
called phones_disambig.txt). For example,
\code
ilabel_info[5] == { -42 };
\endcode
would mean that symbol number 5 on HCLG corresponds to the disambiguation
symbol whose integer id is 42. We negate these for scripting convenience,
so the programs that interpret the ilabel_info object don't need to be
given a list of disambiguation symbols in order to be able to distinguish them from
real phones in the monophone case. There are two additional special cases:
we have
\code
ilabel_info[0] == { }; // epsilon
ilabel_info[1] == { 0 }; // disambig symbol #-1;
// we use symbol 1, but don't consider this hardwired.
\endcode
The first of these is the normal epsilon symbol, and we give it an empty vector
as its ilabel_info entry. This symbol would not normally appear on the left of
CLG. The second is a special disambiguation symbol that in
printed form is called "#-1". We use it where epsilons are used on the input of
the C transducer in the normal (Springer Handbook) recipe; it is needed to
ensure determinizability of CLG in the presence of words with empty
phonetic representations.
The program \ref fstmakecontextsyms.cc "fstmakecontextsyms" is able to create
a symbol-table that corresponds to the printed form of the ilabel_info object;
this is mainly of use for debugging and diagnostics.
As you can see, the ilabel_info object is not very pleasant as it relates to
disambiguation symbols, but there are not many parts of the code that interact
closely with it: only the \ref ContextFst class (and related things; see \ref
context_fst_group), the program fstmakecontextsyms.cc, and some functions listed
in \ref hmm_group_graph). The ContextDependency object, in particular, only sees
valid sequences of length N that represent phone context windows.
*/
}