User manual

SHoUT is written in C++ on a Linux platform. The code should run on other platforms (Windows, Mac) without adjustments, but I do not support the porting. You can build the toolkit by running the script configure-make.sh. The binaries will appear in release/src. Installation (by root) of the binaries is not needed.

The various applications of the Shout toolkit will be discussed in the following sections. Each section describes a typical use case. For the precise syntax of all application parameters, just run the applications with the parameter --help (or -h). A short help text will appear.

Input/output

The shout toolkit does not support different audio and video file types. You will need to convert your multimedia files into raw audio files yourself. Input should be headerless raw audio: mono, 16 KHz, 16 bits little endian.

Each application will use a 'meta-data' file for input and/or output. On each line of this file, an audio segment is specified. What is done with these segments depends on the application that uses the file. The format of each line of a meta-data file is as follows:

  • SPEAKER [label] [VTLN factor] [begin time] [length] <NA> <NA> [SPK ID] <NA> [<s> <s> followed by a word based transcription]

  • label - is the identifier of the file.
  • VTLN factor - is the factor calculated by shout_vtln used for VTLN.
  • begin time - begin time of the audio segment
  • length - length of the segment
  • <NA> - Not applicable (for compatibility with NIST RTTM files).
  • SPK ID - label of the segment (can be SPEECH/SIL or SPK ID).

I use these meta-data files so that it is not needed to cut-up the audio files in actual segments. Each application simply reads the audio file and uses the bits of it that are defined in the meta-data file.

Typical use-cases

For my work on spoken document retrieval I run the system as depicted in the figure below. First, I run Speech Activity Detection (SAD) to find the segments in the audio that contain speech. Next, I run diarization to determine: 'who spoke when?'. Third, I run Vocal Tract Length Normalization (VTLN) to normalize the speech of each speaker for his/her vocal tract length. After this I perform the actual decoding. But before decoding is possible, first acoustic models, a dictionary and a language model need to be created. After the first decoding iteration it is possible to perform acoustic model adaptation and run a second decoding iteration. All these steps are explained in the use cases below.

system-overview-small.png