vocabulary

class aitoolbox.nlp.core.vocabulary.Vocabulary(name, document_level=False)[source]

Bases: object

Vocabulary used for storing the tokens and converting between the indices and the tokens

Parameters:
  • name (str) – name of the vocabulary / type of vocabulary. Needed just for tracking purposes

  • document_level (bool) – If the vocabulary is on the sentence level or on the document level. Document consists of multiple sentences. This in effect means that we are adding additional tokens for start and the end of the doc.

add_sentence(sentence_tokens)[source]

Add tokenized sentence to the vocabulary

Parameters:

sentence_tokens (list) – sentence tokens, e.g. list of words representing the sentence

Returns:

None

add_word(word)[source]

Add the single word to the vocabulary

Parameters:

word (str) – single word string

Returns:

None

trim(min_count)[source]

Remove words below a certain count threshold

Parameters:

min_count (int) –

Returns:

None

convert_sent2idx_sent(sent_tokens, start_end_token=True)[source]

Convert the given tokenized string sentence into the indices

Parameters:
  • sent_tokens (list) –

  • start_end_token (bool) – string tokens forming a sentence

Returns:

sentence tokens converted into the corresponding indices

Return type:

list

convert_idx_sent2sent(idx_sent, rm_default_tokens=False)[source]

Convert from token indices back to the human-readable string tokens

Parameters:
  • idx_sent – index tokens forming the sentence

  • rm_default_tokens (bool) – should the default tokens such as padding and start/end sentence tokens be removed from the result.

Returns:

sentence represented as a sequence of the string tokens

Return type:

list