LanguageModel Class Reference
This class contains all language model functionality. More...

Public Member Functions | |
LanguageModel () | |
LanguageModel (int numberOfWords) | |
LanguageModel (const char *lmbinFileName) | |
~LanguageModel () | |
void | printInfo () |
void | setUnigram (int wordID, float p) |
float | getUnigram (int wordID) |
int | getLastWordID (int *wordHistory) |
void | getAllP (int *lmHistory, float *allP) |
virtual float | getP (int newWordID, int *wordHistoryIn, int *wordHistoryOut) |
bool | checkValidity () |
int | getNumberOfWords () const |
bool | compareKeys (int *key1, int *key2, int length) const |
Protected Member Functions | |
float | getUniGramP (int newWordID, int *wordHistoryNew) |
float | getBiGramP (int newWordID, int *wordHistory, int *wordHistoryNew) |
float | getTriGramP (int newWordID, int *wordHistory, int *wordHistoryNew) |
float | get4GramP (int newWordID, int *wordHistory, int *wordHistoryNew) |
int | getBiGramIndex (int *w) |
int | getTriGramIndex (int *w) |
int | get4GramIndex (int *w) |
Protected Attributes | |
int | uni_tableLength |
The length of the uni-gram (direct) lookup-table. | |
int | bi_tableLength |
The length of the bi-gram (hash) lookup-table. | |
int | tri_tableLength |
The length of the tri-gram (hash) lookup-table. | |
int | four_tableLength |
The length of the 4-gram (hash) lookup-table. | |
int | uni_startData |
int | bi_startData |
int | tri_startData |
Hash * | bi_Hash |
The Hash-function of the bi-gram table. | |
Hash * | tri_Hash |
The Hash-function of the tri-gram table. | |
Hash * | four_Hash |
The Hash-function of the tri-gram table. | |
LMEntryType_1 * | uni_lmData |
The uni-gram direct lookup-table. | |
LMEntryType_2 * | bi_lmData |
The bi-gram hash table. | |
LMEntryType_3 * | tri_lmData |
The tri-gram hash table. | |
LMEntryType_4 * | four_lmData |
The tri-gram hash table. | |
Hash * | tri_HashListSearch |
int | tri_tableLengthList |
LMListSearch_2 * | bi_lmDataList |
LMListSearch_3 * | tri_lmDataList |
Detailed Description
This class contains all language model functionality.Each LanguageModel object stores its data in three tables (uni-, bi- and tri-gram tables). The bigram and trigram tables are queried using the two (perfect minimal) Hash tables. The data is stored using the class Shout_lm2bin with the application shout_lm2bin.
With the method getUnigram() the unigram probability of a word can be retrieved. The method getP() is used to retrieve the LM probability given a certain word history. getP() will handle backoff if needed.
Constructor & Destructor Documentation
LanguageModel::LanguageModel | ( | ) |
The standard constructor only initialises some internal variables.
References bi_Hash, bi_lmData, bi_lmDataList, bi_tableLength, four_Hash, four_lmData, four_tableLength, tri_Hash, tri_HashListSearch, tri_lmData, tri_lmDataList, tri_tableLength, uni_lmData, and uni_tableLength.
LanguageModel::LanguageModel | ( | int | nrUni | ) |
Creates a unigram LM
References LMEntryType_1::backoff, bi_Hash, bi_lmData, bi_lmDataList, bi_tableLength, four_Hash, four_lmData, four_tableLength, LMEntryType_1::p, tri_Hash, tri_HashListSearch, tri_lmData, tri_lmDataList, tri_tableLength, uni_lmData, and uni_tableLength.
LanguageModel::LanguageModel | ( | const char * | binlmFileName | ) |
This constructor (normally used in the shout application) loads its LM data from the given file handle.
References bi_Hash, bi_lmData, bi_lmDataList, bi_tableLength, four_Hash, four_lmData, four_tableLength, WriteFileLittleBigEndian::freadEndianSafe(), tri_Hash, tri_HashListSearch, tri_lmData, tri_lmDataList, tri_tableLength, uni_lmData, and uni_tableLength.

LanguageModel::~LanguageModel | ( | ) |
If needed, the destructor releases memory that is claimed for the LM arrays and the Hash tables.
References bi_Hash, bi_lmData, bi_lmDataList, four_Hash, four_lmData, tri_Hash, tri_HashListSearch, tri_lmData, tri_lmDataList, and uni_lmData.
Member Function Documentation
bool LanguageModel::checkValidity | ( | ) |
This method is only used for debugging purposes. It checks if the language model is consistant with the hash tables.
References bi_Hash, bi_lmData, bi_tableLength, Hash::getIndex(), tri_Hash, tri_lmData, tri_tableLength, LMEntryType_3::words, and LMEntryType_2::words.

bool LanguageModel::compareKeys | ( | int * | key1, | |
int * | key2, | |||
int | length | |||
) | const [inline] |
Referenced by get4GramIndex(), getBiGramIndex(), and getTriGramIndex().
int LanguageModel::get4GramIndex | ( | int * | w | ) | [protected] |
This method returns the 4-gram index for the 4-gram array for the word pair w[0], w[1], w[2] and w[3]. If this trigram does not exist, it will return -1.
References compareKeys(), four_Hash, four_lmData, and Hash::getIndex().
Referenced by get4GramP().

float LanguageModel::get4GramP | ( | int | newWordID, | |
int * | wordHistory, | |||
int * | wordHistoryNew | |||
) | [protected] |
This method returns the 4-gram probability and updates the new LM history. If needed it will backoff to getTriGramP(). (this method is a helper method of getP() )
References four_lmData, four_tableLength, get4GramIndex(), getBiGramIndex(), getTriGramP(), LMEntryType_4::p, tri_lmData, and LMEntryType_3::words.
Referenced by getP().

void LanguageModel::getAllP | ( | int * | lmHistory, | |
float * | allP | |||
) |
References LMEntryType_1::backoff, LMEntryType_2::backoff, bi_lmData, bi_lmDataList, bi_tableLength, getBiGramIndex(), Hash::getIndex(), LMListSearch_2::length, LMListSearch_3::length, LMListSearch_2::lowestIndex, LMListSearch_3::lowestIndex, LMEntryType_2::p, LMEntryType_3::p, LMEntryType_1::p, tri_HashListSearch, tri_lmData, tri_lmDataList, uni_lmData, uni_tableLength, LMEntryType_3::words, and LMEntryType_2::words.
Referenced by LexicalTree::getLMLATable().

int LanguageModel::getBiGramIndex | ( | int * | w | ) | [protected] |
This method returns the bigram index for the bigram array for the word pair w[0] and w[1]. If this bigram does not exist, it will return -1.
References bi_Hash, bi_lmData, compareKeys(), and Hash::getIndex().
Referenced by get4GramP(), getAllP(), getBiGramP(), and getP().

float LanguageModel::getBiGramP | ( | int | newWordID, | |
int * | wordHistory, | |||
int * | wordHistoryNew | |||
) | [protected] |
This method returns the bigram probability and updates the new LM history. If needed it will backoff to getUniGramP(). (this method is a helper method of getP() )
References bi_lmData, bi_tableLength, getBiGramIndex(), getUniGramP(), LMEntryType_2::p, and uni_lmData.
Referenced by getP(), and getTriGramP().

int LanguageModel::getLastWordID | ( | int * | wordHistory | ) |
Given the current word history, the last wordID is returned.
References bi_lmData, tri_lmData, LMEntryType_2::words, and LMEntryType_3::words.
Referenced by LexicalTree::calcErrorRegionStats(), LexicalTree::createLatticeNodeGroups(), NBest::fillNBestArray(), LexicalTree::getBestIDSequence(), LexicalTree::getBestPath(), LexicalTree::getWordFromWLR(), LexicalTree::safeBestRecognition(), and NBest::setReference().
int LanguageModel::getNumberOfWords | ( | ) | const [inline] |
float LanguageModel::getP | ( | int | newWordID, | |
int * | wordHistory, | |||
int * | wordHistoryOut | |||
) | [virtual] |
The LM probability for the new word 'newWordID' given the LM history 'wordHistory' is returned by this method. The probability is returned in the log domain.
The new LM history is stored in 'wordHistoryOut'. This last parameter may be ignored and the old history may be maintained. This makes it possible to use this method for Language Model Look-Ahead.
This method will check which probability (unigram, bigram or trigram) needs to be calculated and if needed multiply it with a backoff value (actualy, because all probabilities are stored in the log domain, the backoff is added).
Reimplemented in LanguageModel_Segmenter.
References bi_lmData, four_lmData, four_tableLength, get4GramP(), getBiGramIndex(), getBiGramP(), getTriGramIndex(), getTriGramP(), getUniGramP(), tri_lmData, tri_tableLength, LMEntryType_4::words, LMEntryType_3::words, and LMEntryType_2::words.
Referenced by LexicalTree::calcErrorRegionStats(), LexicalTree::createLatticeLMRescoring(), LexicalTree::createLatticeNodeGroups(), NBest::fillNBestArray(), LexicalTree::findBestToken(), NBest::NBest(), LexicalTree::processWord(), LexicalTree::safeBestRecognition(), LexicalTree::setInitialLMHistory(), and NBest::setReference().

int LanguageModel::getTriGramIndex | ( | int * | w | ) | [protected] |
This method returns the trigram index for the trigram array for the word pair w[0], w[1] and w[2]. If this trigram does not exist, it will return -1.
References compareKeys(), Hash::getIndex(), tri_Hash, and tri_lmData.
Referenced by getP(), and getTriGramP().

float LanguageModel::getTriGramP | ( | int | newWordID, | |
int * | wordHistory, | |||
int * | wordHistoryNew | |||
) | [protected] |
This method returns the trigram probability and updates the new LM history. If needed it will backoff to getBiGramP(). (this method is a helper method of getP() )
References bi_lmData, getBiGramP(), getTriGramIndex(), LMEntryType_3::p, tri_lmData, tri_tableLength, and LMEntryType_2::words.
Referenced by get4GramP(), and getP().

float LanguageModel::getUnigram | ( | int | wordID | ) |
The unigram probability is retrieved from the unigram array using index wordID. The validity of wordID is not checked by this method!
References LMEntryType_1::p, and uni_lmData.
float LanguageModel::getUniGramP | ( | int | newWordID, | |
int * | wordHistoryNew | |||
) | [protected] |
This method returns the unigram probability and updates the new LM history. (this method is a helper method of getP() )
References LMEntryType_1::p, and uni_lmData.
Referenced by getBiGramP(), and getP().
void LanguageModel::printInfo | ( | ) |
This method prints some information about the language model to standard output.
References bi_tableLength, four_tableLength, tri_tableLength, and uni_tableLength.
void LanguageModel::setUnigram | ( | int | wordID, | |
float | p | |||
) |
References LMEntryType_1::p, uni_lmData, and uni_tableLength.
Member Data Documentation
Hash* LanguageModel::bi_Hash [protected] |
The Hash-function of the bi-gram table.
Referenced by checkValidity(), getBiGramIndex(), LanguageModel(), Shout_lm2bin::Shout_lm2bin(), and ~LanguageModel().
LMEntryType_2* LanguageModel::bi_lmData [protected] |
The bi-gram hash table.
Referenced by checkValidity(), getAllP(), getBiGramIndex(), getBiGramP(), getLastWordID(), getP(), getTriGramP(), LanguageModel(), Shout_lm2bin::Shout_lm2bin(), and ~LanguageModel().
LMListSearch_2* LanguageModel::bi_lmDataList [protected] |
Referenced by getAllP(), LanguageModel(), and ~LanguageModel().
int LanguageModel::bi_startData [protected] |
int LanguageModel::bi_tableLength [protected] |
The length of the bi-gram (hash) lookup-table.
Referenced by checkValidity(), getAllP(), getBiGramP(), LanguageModel(), printInfo(), and Shout_lm2bin::Shout_lm2bin().
Hash* LanguageModel::four_Hash [protected] |
The Hash-function of the tri-gram table.
Referenced by get4GramIndex(), LanguageModel(), Shout_lm2bin::Shout_lm2bin(), and ~LanguageModel().
LMEntryType_4* LanguageModel::four_lmData [protected] |
The tri-gram hash table.
Referenced by get4GramIndex(), get4GramP(), getP(), LanguageModel(), Shout_lm2bin::Shout_lm2bin(), and ~LanguageModel().
int LanguageModel::four_tableLength [protected] |
The length of the 4-gram (hash) lookup-table.
Referenced by get4GramP(), getP(), LanguageModel(), printInfo(), and Shout_lm2bin::Shout_lm2bin().
Hash* LanguageModel::tri_Hash [protected] |
The Hash-function of the tri-gram table.
Referenced by checkValidity(), getTriGramIndex(), LanguageModel(), Shout_lm2bin::Shout_lm2bin(), and ~LanguageModel().
Hash* LanguageModel::tri_HashListSearch [protected] |
Referenced by getAllP(), LanguageModel(), and ~LanguageModel().
LMEntryType_3* LanguageModel::tri_lmData [protected] |
The tri-gram hash table.
Referenced by checkValidity(), get4GramP(), getAllP(), getLastWordID(), getP(), getTriGramIndex(), getTriGramP(), LanguageModel(), Shout_lm2bin::Shout_lm2bin(), and ~LanguageModel().
LMListSearch_3* LanguageModel::tri_lmDataList [protected] |
Referenced by getAllP(), LanguageModel(), and ~LanguageModel().
int LanguageModel::tri_startData [protected] |
int LanguageModel::tri_tableLength [protected] |
The length of the tri-gram (hash) lookup-table.
Referenced by checkValidity(), getP(), getTriGramP(), LanguageModel(), printInfo(), and Shout_lm2bin::Shout_lm2bin().
int LanguageModel::tri_tableLengthList [protected] |
LMEntryType_1* LanguageModel::uni_lmData [protected] |
The uni-gram direct lookup-table.
Referenced by LanguageModel_Segmenter::addPenalty(), LanguageModel_Segmenter::finishModel(), getAllP(), getBiGramP(), LanguageModel_Segmenter::getP(), getUnigram(), getUniGramP(), LanguageModel(), setUnigram(), Shout_lm2bin::Shout_lm2bin(), and ~LanguageModel().
int LanguageModel::uni_startData [protected] |
int LanguageModel::uni_tableLength [protected] |
The length of the uni-gram (direct) lookup-table.
Referenced by LanguageModel_Segmenter::addPenalty(), LanguageModel_Segmenter::finishModel(), getAllP(), getNumberOfWords(), LanguageModel_Segmenter::getP(), LanguageModel(), LanguageModel_Segmenter::LanguageModel_Segmenter(), printInfo(), setUnigram(), Shout_lm2bin::Shout_lm2bin(), and LanguageModel_Segmenter::transferTo().