LanguageModel Class Reference

This class contains all language model functionality. More...

Inheritance diagram for LanguageModel:

List of all members.


Public Member Functions

 LanguageModel ()
 LanguageModel (int numberOfWords)
 LanguageModel (const char *lmbinFileName)
 ~LanguageModel ()
void printInfo ()
void setUnigram (int wordID, float p)
float getUnigram (int wordID)
int getLastWordID (int *wordHistory)
void getAllP (int *lmHistory, float *allP)
virtual float getP (int newWordID, int *wordHistoryIn, int *wordHistoryOut)
bool checkValidity ()
int getNumberOfWords () const
bool compareKeys (int *key1, int *key2, int length) const

Protected Member Functions

float getUniGramP (int newWordID, int *wordHistoryNew)
float getBiGramP (int newWordID, int *wordHistory, int *wordHistoryNew)
float getTriGramP (int newWordID, int *wordHistory, int *wordHistoryNew)
float get4GramP (int newWordID, int *wordHistory, int *wordHistoryNew)
int getBiGramIndex (int *w)
int getTriGramIndex (int *w)
int get4GramIndex (int *w)

Protected Attributes

int uni_tableLength
 The length of the uni-gram (direct) lookup-table.
int bi_tableLength
 The length of the bi-gram (hash) lookup-table.
int tri_tableLength
 The length of the tri-gram (hash) lookup-table.
int four_tableLength
 The length of the 4-gram (hash) lookup-table.
int uni_startData
int bi_startData
int tri_startData
Hashbi_Hash
 The Hash-function of the bi-gram table.
Hashtri_Hash
 The Hash-function of the tri-gram table.
Hashfour_Hash
 The Hash-function of the tri-gram table.
LMEntryType_1uni_lmData
 The uni-gram direct lookup-table.
LMEntryType_2bi_lmData
 The bi-gram hash table.
LMEntryType_3tri_lmData
 The tri-gram hash table.
LMEntryType_4four_lmData
 The tri-gram hash table.
Hashtri_HashListSearch
int tri_tableLengthList
LMListSearch_2bi_lmDataList
LMListSearch_3tri_lmDataList

Detailed Description

This class contains all language model functionality.

Each LanguageModel object stores its data in three tables (uni-, bi- and tri-gram tables). The bigram and trigram tables are queried using the two (perfect minimal) Hash tables. The data is stored using the class Shout_lm2bin with the application shout_lm2bin.

With the method getUnigram() the unigram probability of a word can be retrieved. The method getP() is used to retrieve the LM probability given a certain word history. getP() will handle backoff if needed.


Constructor & Destructor Documentation

LanguageModel::LanguageModel (  ) 

LanguageModel::LanguageModel ( const char *  binlmFileName  ) 

This constructor (normally used in the shout application) loads its LM data from the given file handle.

References bi_Hash, bi_lmData, bi_lmDataList, bi_tableLength, four_Hash, four_lmData, four_tableLength, WriteFileLittleBigEndian::freadEndianSafe(), tri_Hash, tri_HashListSearch, tri_lmData, tri_lmDataList, tri_tableLength, uni_lmData, and uni_tableLength.

Here is the call graph for this function:

LanguageModel::~LanguageModel (  ) 

If needed, the destructor releases memory that is claimed for the LM arrays and the Hash tables.

References bi_Hash, bi_lmData, bi_lmDataList, four_Hash, four_lmData, tri_Hash, tri_HashListSearch, tri_lmData, tri_lmDataList, and uni_lmData.


Member Function Documentation

bool LanguageModel::checkValidity (  ) 

This method is only used for debugging purposes. It checks if the language model is consistant with the hash tables.

References bi_Hash, bi_lmData, bi_tableLength, Hash::getIndex(), tri_Hash, tri_lmData, tri_tableLength, LMEntryType_3::words, and LMEntryType_2::words.

Here is the call graph for this function:

bool LanguageModel::compareKeys ( int *  key1,
int *  key2,
int  length 
) const [inline]

int LanguageModel::get4GramIndex ( int *  w  )  [protected]

This method returns the 4-gram index for the 4-gram array for the word pair w[0], w[1], w[2] and w[3]. If this trigram does not exist, it will return -1.

References compareKeys(), four_Hash, four_lmData, and Hash::getIndex().

Referenced by get4GramP().

Here is the call graph for this function:

float LanguageModel::get4GramP ( int  newWordID,
int *  wordHistory,
int *  wordHistoryNew 
) [protected]

This method returns the 4-gram probability and updates the new LM history. If needed it will backoff to getTriGramP(). (this method is a helper method of getP() )

References four_lmData, four_tableLength, get4GramIndex(), getBiGramIndex(), getTriGramP(), LMEntryType_4::p, tri_lmData, and LMEntryType_3::words.

Referenced by getP().

Here is the call graph for this function:

int LanguageModel::getBiGramIndex ( int *  w  )  [protected]

This method returns the bigram index for the bigram array for the word pair w[0] and w[1]. If this bigram does not exist, it will return -1.

References bi_Hash, bi_lmData, compareKeys(), and Hash::getIndex().

Referenced by get4GramP(), getAllP(), getBiGramP(), and getP().

Here is the call graph for this function:

float LanguageModel::getBiGramP ( int  newWordID,
int *  wordHistory,
int *  wordHistoryNew 
) [protected]

This method returns the bigram probability and updates the new LM history. If needed it will backoff to getUniGramP(). (this method is a helper method of getP() )

References bi_lmData, bi_tableLength, getBiGramIndex(), getUniGramP(), LMEntryType_2::p, and uni_lmData.

Referenced by getP(), and getTriGramP().

Here is the call graph for this function:

int LanguageModel::getNumberOfWords (  )  const [inline]

References uni_tableLength.

Referenced by LexicalTree::setLM(), and Whisper::Whisper().

float LanguageModel::getP ( int  newWordID,
int *  wordHistory,
int *  wordHistoryOut 
) [virtual]

The LM probability for the new word 'newWordID' given the LM history 'wordHistory' is returned by this method. The probability is returned in the log domain.

The new LM history is stored in 'wordHistoryOut'. This last parameter may be ignored and the old history may be maintained. This makes it possible to use this method for Language Model Look-Ahead.

This method will check which probability (unigram, bigram or trigram) needs to be calculated and if needed multiply it with a backoff value (actualy, because all probabilities are stored in the log domain, the backoff is added).

Reimplemented in LanguageModel_Segmenter.

References bi_lmData, four_lmData, four_tableLength, get4GramP(), getBiGramIndex(), getBiGramP(), getTriGramIndex(), getTriGramP(), getUniGramP(), tri_lmData, tri_tableLength, LMEntryType_4::words, LMEntryType_3::words, and LMEntryType_2::words.

Referenced by LexicalTree::calcErrorRegionStats(), LexicalTree::createLatticeLMRescoring(), LexicalTree::createLatticeNodeGroups(), NBest::fillNBestArray(), LexicalTree::findBestToken(), NBest::NBest(), LexicalTree::processWord(), LexicalTree::safeBestRecognition(), LexicalTree::setInitialLMHistory(), and NBest::setReference().

Here is the call graph for this function:

int LanguageModel::getTriGramIndex ( int *  w  )  [protected]

This method returns the trigram index for the trigram array for the word pair w[0], w[1] and w[2]. If this trigram does not exist, it will return -1.

References compareKeys(), Hash::getIndex(), tri_Hash, and tri_lmData.

Referenced by getP(), and getTriGramP().

Here is the call graph for this function:

float LanguageModel::getTriGramP ( int  newWordID,
int *  wordHistory,
int *  wordHistoryNew 
) [protected]

This method returns the trigram probability and updates the new LM history. If needed it will backoff to getBiGramP(). (this method is a helper method of getP() )

References bi_lmData, getBiGramP(), getTriGramIndex(), LMEntryType_3::p, tri_lmData, tri_tableLength, and LMEntryType_2::words.

Referenced by get4GramP(), and getP().

Here is the call graph for this function:

float LanguageModel::getUnigram ( int  wordID  ) 

The unigram probability is retrieved from the unigram array using index wordID. The validity of wordID is not checked by this method!

References LMEntryType_1::p, and uni_lmData.

float LanguageModel::getUniGramP ( int  newWordID,
int *  wordHistoryNew 
) [protected]

This method returns the unigram probability and updates the new LM history. (this method is a helper method of getP() )

References LMEntryType_1::p, and uni_lmData.

Referenced by getBiGramP(), and getP().

void LanguageModel::printInfo (  ) 

This method prints some information about the language model to standard output.

References bi_tableLength, four_tableLength, tri_tableLength, and uni_tableLength.

void LanguageModel::setUnigram ( int  wordID,
float  p 
)


Member Data Documentation

The Hash-function of the bi-gram table.

Referenced by checkValidity(), getBiGramIndex(), LanguageModel(), Shout_lm2bin::Shout_lm2bin(), and ~LanguageModel().

int LanguageModel::bi_startData [protected]

The length of the bi-gram (hash) lookup-table.

Referenced by checkValidity(), getAllP(), getBiGramP(), LanguageModel(), printInfo(), and Shout_lm2bin::Shout_lm2bin().

The Hash-function of the tri-gram table.

Referenced by get4GramIndex(), LanguageModel(), Shout_lm2bin::Shout_lm2bin(), and ~LanguageModel().

The length of the 4-gram (hash) lookup-table.

Referenced by get4GramP(), getP(), LanguageModel(), printInfo(), and Shout_lm2bin::Shout_lm2bin().

The Hash-function of the tri-gram table.

Referenced by checkValidity(), getTriGramIndex(), LanguageModel(), Shout_lm2bin::Shout_lm2bin(), and ~LanguageModel().

The length of the tri-gram (hash) lookup-table.

Referenced by checkValidity(), getP(), getTriGramP(), LanguageModel(), printInfo(), and Shout_lm2bin::Shout_lm2bin().