Setting up the system for training your acoustic models

Training new acoustic files is done in three repetitive steps:

Create an hypothesis file in native shout format. Alignments in other formats need to be re-formatted.
Make a training set using the alignment file.
Run shout_train_am for each phone and run shout_train_finish to combine all phones in a single file. The resulting AM can be used in the first step to create a better alignment.

Create an alignment of the training audio

In order to train new models, a set of training phones are needed. These training sets can only be created when an alignment of the audio and the speech transcription on phone level is available. This alignment must be in native shout format.

A shout alignment can be created by calling shout with a meta-data file in the following format

SPEAKER [label] [VTLN factor] [begin time] [length] <NA> <NA> [SPK ID] <NA> <s> <s> [word based transcription] Note that this file format is basically the RTTM format from NIST.

The <s> symbol represents silence. The first two symbols in the transcription are not used to align the audio, but only to create a language model history. When two <s> symbols are added, the language model history is reset to 'start of sentence'.

Make a training set

Once the alignment file is created, a training directory can be made by the application shout_maketrainset. This application uses a phone-set file, it needs to know the path to the audio files (all files need to be in the same directory), the alignment file (HYP file, printed to stdout by shout) and the path to the training set directory that will be created.

Alternatively, shout_maketrainset can use a phone-set file and a file with a list of audio-metadata-hyp files.

In order to create acoustic phone models, make sure to set the type to PHONE in shout_maketrainset.

Training the phones

You can train each individual phone with shout_train_am. The training procedure is split-up like this so that you can run the training process in parallel. See shout_train_am -h for more info.

Once all phones are trained, you can use shout_train_finish to combine the models into one single binary file.

SHoUT

by Marijn Huijbregts

Setting up the system for training your acoustic models

Create an alignment of the training audio

Make a training set

Training the phones