/*************************************************************************** * This file is part of the 'Shout LVCS Recognition toolkit'. * *************************************************************************** * Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010 by Marijn Huijbregts * * * * This program is free software; you can redistribute it and/or modify * * it under the terms of the GNU General Public License as published by * * the Free Software Foundation; version 2 of the License. * * * * This program is distributed in the hope that it will be useful, * * but WITHOUT ANY WARRANTY; without even the implied warranty of * * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * * GNU General Public License for more details. * * * * You should have received a copy of the GNU General Public License * * along with this program; if not, write to the * * Free Software Foundation, Inc., * * 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. * ***************************************************************************/ // gmake -f Makefile.cvs // mkdir optimized // cd optimized // CXXFLAGS="-O3 -funroll-loops -march=pentium4 -malign-double -mfpmath=sse -msse -msse2" ../configure // gmake -j1 ///////////////////////////////////////////////////////////////////////////////////////////////////// /// \mainpage Large Vocabulary Continuous Speech Recognition /// /// During my PhD research I have developed a large vocabulary continuous speech recognition toolkit that I named SHoUT. /// SHoUT is a Dutch acronym for: 'Speech Recognition Research at the University of Twente' (I could have called the toolkit 'SRRatUoT', but 'SHoUT' just sounds better). /// I think (read: 'hope') that the toolkit is now mature enough to be used by other researchers. /// You can download the latest release from the \ref download "download page". All comments and suggestions are welcome! /// For information on how to use the toolkit, have a look at the \ref user_manual "manual" (for users) and /// at the \ref programmers_manual "API reference" (for programmers). For background information (architecture, algorithms etc.) please read my thesis. /// ///////////////////////////////////////////////////////////////////////////////////////////////////// ///////////////////////////////////////////////////////////////////////////////////////////////////// /// \page contact Contact /// /// You can reach me for questions about SHoUT at: marijn.huijbregts@gmail.com /// ///////////////////////////////////////////////////////////////////////////////////////////////////// ///////////////////////////////////////////////////////////////////////////////////////////////////// /// \page download Download /// /// Shout is a research tool that I have mainly written for myself. It is *not* a commercial tool that you can use out of the box! /// If you are not a researcher on ASR, this tool might not be very useful to you. Having said that... Have fun! /// ///
/// /// \section version_0_3 Version 0.3 released at 01/12/2010 /// Source and Linux binaries tar-ball /// ///
/// /// \section version_0_2 Version 0.2 released at 01/12/2008 /// Source and Linux binaries tar-ball /// ///
/// /// \section version_0_1 Version 0.1 released at 06/11/2007 /// Source and Linux binaries tar-ball ///////////////////////////////////////////////////////////////////////////////////////////////////// ///////////////////////////////////////////////////////////////////////////////////////////////////// /// \page programmers_manual Reference guide /// This is your starting point for programming in or with the toolkit. Class reference can be found in the menu above. /// /// I use the great application "kdevelop" to edit and build my source code, but if /// you want to build the source code manually, this is how kdevelop does it. Start in the main shout directory and type: /// - gmake -f Makefile.cvs /// - mkdir optimized /// - cd optimized /// - CXXFLAGS="-O3 -funroll-loops -march=pentium4 -malign-double -mfpmath=sse -msse -msse2" ../configure /// - gmake -j1 /// ///
/// /// \section building_blocks The building blocks of Shout /// This software package contains multiple applications. A short description of each application is given below. If you just want /// to use the toolkit and you do not have the intention to develop yourself, it is better to read the \ref user_manual "user manual". /// ///
/// /// \section app_shout shout /// The hart of the software package, the decoder, is called shout. This is where all the models come together... During the early /// days of the development the decoder was called 'whisper'. Unfortunately another decoder with the same name already existed. /// I have changed the decoder name, but the main class that is handling the top-level recognition is still called Whisper. /// Whisper will load all needed models. After that, most work is done by the LexicalTree class. /// ///
/// /// \section app_shout_adapt_am shout_adapt_am /// This application reads an acoustic model file and a training/adapting phone directory and creates a new acoustic model file /// that is adapted to the training data using the Structured Maximum a Posteriori Linear Regression (SMAPLR) method. The main class /// of shout_adapt_am is Adapt_AM. /// ///
/// /// \section app_shout_cluster shout_cluster /// This is the speaker diarization application. See Shout_Cluster. /// ///
/// /// \section app_shout_dct2lextree shout_dct2lextree /// This application translates a dictionary file (text file containing rows of: word - tabs or spaces - phone pronunciation) into a /// lexical Prefix Tree, also called a Pronunciation Prefix Tree (PPT), suitable for the decoder to read. The main class of /// shout_dct2lextree is Shout_dct2lextree. /// ///
/// /// \section app_shout_lm2bin shout_lm2bin /// Shout can handle uni-, bi-, tri- and four-gram ARPA language models (depending on how the distribution is compiled). /// The application shout_lm2bin will read an ARPA LM and translate it to a binary format suitable /// for the decoder. The main class of shout_lm2bin is Shout_lm2bin. /// ///
/// /// \section app_shout_maketrainset shout_maketrainset /// Shout_maketrainset will read an hypothesis file (the output of Shout in native format) or a Master Label File (MLF) and will /// store all phones in the training directory. This data directory /// is later used by the training application (shout_train_master). The main class of shout_maketrainset is Shout_MakeTrainSet. /// It is possible to create a training directory for either ASR models, SAD models, diarization models or VTLN models. /// ///
/// /// \section app_shout_merge_am shout_merge_am /// In case you don't want to re-train all phones, with this application you can choose phone models from two AM files to store /// in a new AM file. See ShoutMergeAm. /// ///
/// /// \section app_shout_segment shout_segment /// This is the speech/non-speech detector. The main class is ShoutSegment. /// ///
/// /// \section app_shout_vtln shout_vtln /// Determines the VTLN warping factor. The main class of shout_vtln is Shout_VTLN. /// ///////////////////////////////////////////////////////////////////////////////////////////////////// ///////////////////////////////////////////////////////////////////////////////////////////////////// /// \page user_manual User manual /// /// SHoUT is written in C++ on a Linux platform. The code should run on other platforms (Windows, Mac) without adjustments, /// but I do not support the porting. You can build the toolkit by running the script configure-make.sh. The binaries will appear in release/src. /// Installation (by root) of the binaries is not needed. /// /// The various applications of the Shout toolkit will be discussed in the following sections. Each section /// describes a typical use case. For the precise syntax of all application parameters, just run the /// applications with the parameter --help (or -h). A short help text will appear. /// ///
/// /// \section input_output Input/output /// The shout toolkit does not support different audio and video file types. You will need to convert your multimedia files into raw audio files /// yourself. Input should be headerless raw audio: mono, 16 KHz, 16 bits little endian. /// /// Each application will use a 'meta-data' file for input and/or output. On each line of this file, an audio segment is specified. What is done with these segments depends on the application that uses the file. The format of each line of a meta-data file is as follows: /// - SPEAKER [label] [VTLN factor] [begin time] [length] \ \ [SPK ID] \ [\ \ followed by a word based transcription] /// /// - label - is the identifier of the file. /// - VTLN factor - is the factor calculated by shout_vtln used for VTLN. /// - begin time - begin time of the audio segment /// - length - length of the segment /// - \ - Not applicable (for compatibility with NIST RTTM files). /// - SPK ID - label of the segment (can be SPEECH/SIL or SPK ID). /// /// I use these meta-data files so that it is not needed to cut-up the audio files in actual segments. Each application simply reads the audio file and uses the bits of it that are defined in the meta-data file. /// ///
/// /// \section use_cases Typical use-cases /// For my work on spoken document retrieval I run the system as depicted in the figure below. First, I run /// Speech Activity Detection (SAD) to find the segments in the audio that contain speech. Next, I run diarization /// to determine: 'who spoke when?'. Third, I run Vocal Tract Length Normalization (VTLN) to normalize the speech of /// each speaker for his/her vocal tract length. After this I perform the actual decoding. But before decoding is possible, /// first acoustic models, a dictionary and a language model need to be created. After the first decoding iteration it is /// possible to perform acoustic model adaptation and run a second decoding iteration. All these steps are explained in the use cases below. /// /// /// - \ref use_case_sad /// - \ref use_case_diarization /// - \ref use_case_vtln /// - \ref use_case_dct_lm /// - \ref use_case_training /// - \ref use_case_recognition /// - \ref use_case_adapt ///////////////////////////////////////////////////////////////////////////////////////////////////// ///////////////////////////////////////////////////////////////////////////////////////////////////// /// \page use_case_dct_lm Preparing your dictionary and language model for Shout /// /// For decoding, three binary files are needed: a lexical tree file, a language model file and an /// acoustic model file. The acoustic models need to be trained by Shout. The language model /// and lexical tree are created using the applications shout_dct2lextree and shout_lm2bin. /// /// \section preparing_dictionary Preparing a dictionary /// The application shout_dct2lextree needs two input files and will output a binary 'lexical tree' /// file. The input phone list file consists of the entire list of phones. The first two lines of this format define /// the total number of phone and non-speech models. It uses the following syntax: /// - "Number of phones:" [number of phone models] /// - "Number of SIL's:" [number of non-speech models] /// - One non-speech model name per line (times the specified number of non-speech models) /// - One phone model name per line (times the specified number of phone models) /// /// The pronunciation dictionary contains one word per line followed by a string of phone or non-speech model names /// separated by one or more spaces. Make sure that the first two lines in your DCT are as follows: /// - \ SIL /// - \ SIL /// ///
/// /// \section preparing_lm Preparing a language model /// The application 'shout_lm2bin' needs a lexical tree file (the output file of shout_dct2lextree) /// and an ARPA language model. It will create a binary language model file fit for shout. /// Currently, unigram, bigram, trigram and four-gram language models are supported. /// During compile time the system can be optimized by setting a maximum depth (from bi- to four-grams) with the /// parameter LM_NGRAM_DEPTH in standard.h The standard setting of LM_NGRAM_DEPTH is for trigrams. ///////////////////////////////////////////////////////////////////////////////////////////////////// ///////////////////////////////////////////////////////////////////////////////////////////////////// /// \page use_case_training Setting up the system for training your acoustic models /// /// Training new acoustic files is done in three repetitive steps: /// - Create an hypothesis file in native shout format. Alignments in other formats need to be re-formatted. /// - Make a training set using the alignment file. /// - Run shout_train_am for each phone and run shout_train_finish to combine all phones in a single file. /// The resulting AM can be used in the first step to create a better alignment. /// ///
/// /// \section create_alignment Create an alignment of the training audio /// In order to train new models, a set of training phones are needed. These training sets can only be /// created when an alignment of the audio and the speech transcription on phone level is available. /// This alignment must be in native shout format. /// /// A shout alignment can be created by calling shout with a meta-data file in the following format /// - SPEAKER [label] [VTLN factor] [begin time] [length] \ \ [SPK ID] \ \ \ [word based transcription] /// Note that this file format is basically the RTTM format from NIST. /// /// The \ symbol represents silence. The first two symbols in the transcription are not used to align the audio, /// but only to create a language model history. When two \ symbols are added, the language model history is reset to 'start of sentence'. /// ///
/// /// \section create_trainset Make a training set /// Once the alignment file is created, a training directory can be made by the application shout_maketrainset. /// This application uses a phone-set file, it needs to know the path to the audio files (all files /// need to be in the same directory), the alignment file (HYP file, printed to stdout by shout) and the /// path to the training set directory that will be created. /// /// Alternatively, shout_maketrainset can use a phone-set file and a file with a list of audio-metadata-hyp files. /// /// In order to create acoustic phone models, make sure to set the type to PHONE in shout_maketrainset. /// ///
/// /// \section train_phones Training the phones /// You can train each individual phone with shout_train_am. The training procedure is split-up like this so that you /// can run the training process in parallel. See shout_train_am -h for more info. /// /// Once all phones are trained, you can use shout_train_finish to combine the models into one single binary file. ///////////////////////////////////////////////////////////////////////////////////////////////////// ///////////////////////////////////////////////////////////////////////////////////////////////////// /// \page use_case_recognition Decoding /// /// The one and only reason why all other applications are developed is being able to decode! /// That's why the decoder, the hart of the toolkit, is just called... Shout! /// /// Run ./shout with the output meta-data file of shout_vtln (or of shout_cluster if no VTLN is needed). /// Next is a short description of the most important parameters. Please run ./shout -h for more help. /// /// \section decoder_settings_model Model settings /// The decoder needs a language model file (lm), acoustic model file (amp) and a lexical tree file (dct). /// All files should be binary files created by the shout toolkit. /// ///
/// /// \section decoder_settings_search Search settings /// The search space of the decoder is restricted using five parameters. If these paramters are not assigned a value, the default values (shown when shout is started with -cc) will be used. /// /// The five search restriction parameters: /// - BEAM (floating point number) /// - STATE_BEAM (floating point number) /// - END_STATE_BEAM (floating point number) /// - HISTOGRAM_STATE_PRUNING (positive number) /// - HISTOGRAM_PRUNING (positive number) /// ///
/// /// \section decoder_settings_amlm AM and LM scaling settings /// The most likely paths in the jungle of feature vectors are calculated using a language model and acoustic models. /// The scaling between the two types of models influences the outcome of the trip through this jungle. This scaling is /// set using three parameters in the formula: /// /// Score(LM_SCALE,TRANS_PENALTY,SIL_PENALTY) = ln(AMSCORE) + LM_SCALE*lm(LMSCORE) + TRANS_PENALTY*NR_WORDS + SIL_PENALTY*NR_SIL /// /// Shout has implemented an efficient method of incorporating the LM score in the search. This method, Language Model Look-Ahead, /// is switched on by default, but it can be toggled on or off in the configuration file. /// - LM_SCALE (floating point number) /// - TRANS_PENALTY (floating point number) /// - SIL_PENALTY (floating point number) /// - LMLA (1=on, 0=off) /// ///
/// /// \section decoder_settings_alignment Alignment /// You can specify a special background dictionary if you want to perform alignment with OOV marking. For performing alignment instead /// of ASR, simply set the forced-alignment parameter (see ./shout -h). Make sure to add the utterance to align in the meta-data file, starting /// with \ \. See \ref use_case_training "the training use-case" for more information. /// ///
/// /// \section decoder_settings_output Type of output (text or XML) /// - XML (output will be in XML format) /// ///
/// /// \section decoder_settings_lattice Lattice output /// Shout can generate lattices and output them in PSFG format. If this is wanted (decoding will be a bit slower) /// the lattice parameter should be used, with a path to the directory where the lattice /// files need to be written to. One lattice will be created for each line in the meta-data file. The lattice files will be /// named after the label of each audio file (defined in the meta-data file). /// ///////////////////////////////////////////////////////////////////////////////////////////////////// ///////////////////////////////////////////////////////////////////////////////////////////////////// /// \page use_case_sad Speech/non-speech segmentation /// /// The decoder can only handle audio containing solely speech. Some silence models may be trained /// (like silence, lip smack, etc) but when audio contains other sources like for example /// music, jingles or formula-one cars, it is a good idea to segment the audio before feeding it to /// the decoder. Shout provides this possibility with the shout_segment application. This application can /// cluster audio in three categories: "speech", "silence" and "audible non-speech". /// /// Refer to my thesis for detailed information on how the application works (see menu). In short /// this is what happens: /// - Using a speech/silence AM, an initial segmentation is created. /// - Iteratively, using the high confidence fragments of the initial segmentation, three new GMMs are trained: "speech", "silence" and "audible non-speech" /// - Using BIC, the application checks if the "speech" and "audible non-speech" models are actually different from each other. If they are identical, all three models /// are discarded and a new set of GMMs, only containing "speech" and "silence", is trained iteratively. /// - A final Viterbi run determines the final alignment and clustering. /// ///
/// /// \section train_sad Training the GMM for the initial training /// In a future implementation I'm planning to change the first step where models are needed for creating the initial segmentation to a method that does not require models at all. /// But for now, the speech and non-speech models needs to be trained. Training these GMMs is done the same as \ref use_case_training "training phones", except that /// it is not needed to perform multiple training iterations (especially when the initial alignment is the final alignment of the training step for phones). /// /// Training new speech/non-speech models is done in three steps: /// - Create an hypothesis file in native shout format. Alignments in other formats need to be re-formatted. /// - Make a training set using the alignment. /// - Run shout_train_am and shout_train_finish. The resulting AM can be used in the first step to create /// a better alignment. /// ///
/// /// \section sad_create_alignment Create an alignment of the training audio /// For creating the alignment see the section on \ref create_alignment "training acoustic models". /// ///
/// /// \section sad_create_trainset shout_maketrainset /// You can create a training set with the application make_trainset. Be sure to choose SAD for the type of trainset you want to create. /// Run shout_maketrainset with the -h option for usage of this application. /// /// Note 1: The feature vectors used for creating these models are different from the phone feature vectors. That's why it is not possible to use the training directory for phones. /// /// Note 2: Not all phone occurrences are used for training. Only the once that are not directly next to a silence (non-speech) phone. /// ///
/// /// \section sad_train_am shout_train_am and shout_train_finish /// Once you have a training set for your SAD-AM, you can run shout_train_am to train your SIL and SPEECH models (run the application twice) and after that you can run shout_train_finish to generate a binary AM file from the two models. ///////////////////////////////////////////////////////////////////////////////////////////////////// ///////////////////////////////////////////////////////////////////////////////////////////////////// /// \page use_case_diarization Speaker Diarization /// /// Speaker diarization is the task of: 'Who spoke when?'. If you have a speech/non-speech meta-data file /// from shout_segment it is very easy to perform diarization using the application shout_cluster. /// Type ./shout_cluster -h for more information. ///////////////////////////////////////////////////////////////////////////////////////////////////// ///////////////////////////////////////////////////////////////////////////////////////////////////// /// \page use_case_adapt Adapting your acoustic models (SMAPLR) /// /// \todo Under development. Come back later :-) ///////////////////////////////////////////////////////////////////////////////////////////////////// ///////////////////////////////////////////////////////////////////////////////////////////////////// /// \page use_case_vtln Vocal tract length normalization /// /// Vocal Tract Length Normalization (VTLN) is a normalization step that improves ASR considerably. /// Simply run shout_vtln on the output of the diarization step (shout_cluster) to get the normalization /// factors for each speaker (incorporated in the output meta-data file). /////////////////////////////////////////////////////////////////////////////////////////////////////