Decoding

The one and only reason why all other applications are developed is being able to decode! That's why the decoder, the hart of the toolkit, is just called... Shout!

Run ./shout with the output meta-data file of shout_vtln (or of shout_cluster if no VTLN is needed). Next is a short description of the most important parameters. Please run ./shout -h for more help.

Model settings

The decoder needs a language model file (lm), acoustic model file (amp) and a lexical tree file (dct). All files should be binary files created by the shout toolkit.

Search settings

The search space of the decoder is restricted using five parameters. If these paramters are not assigned a value, the default values (shown when shout is started with -cc) will be used.

The five search restriction parameters:

  • BEAM (floating point number)
  • STATE_BEAM (floating point number)
  • END_STATE_BEAM (floating point number)
  • HISTOGRAM_STATE_PRUNING (positive number)
  • HISTOGRAM_PRUNING (positive number)

AM and LM scaling settings

The most likely paths in the jungle of feature vectors are calculated using a language model and acoustic models. The scaling between the two types of models influences the outcome of the trip through this jungle. This scaling is set using three parameters in the formula:

Score(LM_SCALE,TRANS_PENALTY,SIL_PENALTY) = ln(AMSCORE) + LM_SCALE*lm(LMSCORE) + TRANS_PENALTY*NR_WORDS + SIL_PENALTY*NR_SIL

Shout has implemented an efficient method of incorporating the LM score in the search. This method, Language Model Look-Ahead, is switched on by default, but it can be toggled on or off in the configuration file.

  • LM_SCALE (floating point number)
  • TRANS_PENALTY (floating point number)
  • SIL_PENALTY (floating point number)
  • LMLA (1=on, 0=off)

Alignment

You can specify a special background dictionary if you want to perform alignment with OOV marking. For performing alignment instead of ASR, simply set the forced-alignment parameter (see ./shout -h). Make sure to add the utterance to align in the meta-data file, starting with <s> <s>. See the training use-case for more information.

Type of output (text or XML)

  • XML (output will be in XML format)

Lattice output

Shout can generate lattices and output them in PSFG format. If this is wanted (decoding will be a bit slower) the lattice parameter should be used, with a path to the directory where the lattice files need to be written to. One lattice will be created for each line in the meta-data file. The lattice files will be named after the label of each audio file (defined in the meta-data file).