vocabulary

class aitoolbox.nlp.core.vocabulary.Vocabulary(name, document_level=False)[source]

Bases: object

Vocabulary used for storing the tokens and converting between the indices and the tokens

Parameters
  • name (str) – name of the vocabulary / type of vocabulary. Needed just for tracking purposes

  • document_level (bool) – If the vocabulary is on the sentence level or on the document level. Document consists of multiple sentences. This in effect means that we are adding additional tokens for start and the end of the doc.

add_sentence(sentence_tokens)[source]

Add tokenized sentence to the vocabulary

Parameters

sentence_tokens (list) – sentence tokens, e.g. list of words representing the sentence

Returns

None

add_word(word)[source]

Add the single word to the vocabulary

Parameters

word (str) – single word string

Returns

None

trim(min_count)[source]

Remove words below a certain count threshold

Parameters

min_count (int) –

Returns

None

convert_sent2idx_sent(sent_tokens, start_end_token=True)[source]

Convert the given tokenized string sentence into the indices

Parameters
  • sent_tokens (list) –

  • start_end_token (bool) – string tokens forming a sentence

Returns

sentence tokens converted into the corresponding indices

Return type

list

convert_idx_sent2sent(idx_sent, rm_default_tokens=False)[source]

Convert from token indices back to the human-readable string tokens

Parameters
  • idx_sent – index tokens forming the sentence

  • rm_default_tokens (bool) – should the default tokens such as padding and start/end sentence tokens be removed from the result.

Returns

sentence represented as a sequence of the string tokens

Return type

list