vocabulary

class aitoolbox.nlp.core.vocabulary.Vocabulary(name, document_level=False)[source]

Vocabulary used for storing the tokens and converting between the indices and the tokens

Parameters:

name (str) – name of the vocabulary / type of vocabulary. Needed just for tracking purposes
document_level (bool) – If the vocabulary is on the sentence level or on the document level. Document consists of multiple sentences. This in effect means that we are adding additional tokens for start and the end of the doc.

add_sentence(sentence_tokens)[source]

Add tokenized sentence to the vocabulary

Parameters:: sentence_tokens (list) – sentence tokens, e.g. list of words representing the sentence
Returns:: None

add_word(word)[source]

Add the single word to the vocabulary

trim(min_count)[source]

Remove words below a certain count threshold

convert_sent2idx_sent(sent_tokens, start_end_token=True)[source]

Convert the given tokenized string sentence into the indices

Parameters:

Returns:

sentence tokens converted into the corresponding indices

Return type:

list

convert_idx_sent2sent(idx_sent, rm_default_tokens=False)[source]

Convert from token indices back to the human-readable string tokens

Parameters:

idx_sent – index tokens forming the sentence
rm_default_tokens (bool) – should the default tokens such as padding and start/end sentence tokens be removed from the result.

Returns:

sentence represented as a sequence of the string tokens

Return type:

list