NLP_metrics

class aitoolbox.nlp.experiment_evaluation.NLP_metrics.ROUGEMetric(y_true, y_predicted, target_actual_text=False, output_text_dir=None, output_text_cleaning_regex=('<.*?>', '[^a-zA-Z0-9.?! ]+'))[source]

Bases: AbstractBaseMetric

ROGUE score calculation

From this package:

https://github.com/pltrdy/rouge

Parameters:
  • y_true (numpy.array or list) –

  • y_predicted (numpy.array or list) –

  • target_actual_text (bool) –

  • output_text_dir (str) –

  • output_text_cleaning_regex (list) –

calculate_metric()[source]

Perform metric calculation and return it from this function

Returns:

return metric_result

Return type:

float or dict

prepare_text()[source]
static dump_answer_text_to_disk(true_text, pred_text, output_text_dir, output_text_cleaning_regex, target_actual_text)[source]
Problems:

Defined regex text cleaning to deal with Illegal division by zero https://ireneli.eu/2018/01/11/working-with-rouge-1-5-5-evaluation-metric-in-python/

Parameters:
  • true_text (list) –

  • pred_text (list) –

  • output_text_dir (str) –

  • output_text_cleaning_regex (list) –

  • target_actual_text (bool) –

Returns:

class aitoolbox.nlp.experiment_evaluation.NLP_metrics.ROUGEPerlMetric(y_true, y_predicted, output_text_dir, output_text_cleaning_regex=('<.*?>', '[^a-zA-Z0-9.?! ]+'), target_actual_text=False)[source]

Bases: AbstractBaseMetric

ROGUE score calculation using the Perl implementation

Use this package:

https://pypi.org/project/pyrouge/ https://github.com/bheinzerling/pyrouge

Problems:

Defined regex text cleaning to deal with Illegal division by zero https://ireneli.eu/2018/01/11/working-with-rouge-1-5-5-evaluation-metric-in-python/

Parameters:
  • y_true (numpy.array or list) – gold standard summaries are ‘model’ summaries

  • y_predicted (numpy.array or list) – your summaries are ‘system’ summaries

  • output_text_dir (str) –

  • output_text_cleaning_regex (list) –

  • target_actual_text (bool) –

calculate_metric()[source]

Perform metric calculation and return it from this function

Returns:

return metric_result

Return type:

float or dict

static dump_answer_text_to_disk(true_text, pred_text, output_text_dir, output_text_cleaning_regex, target_actual_text)[source]
Problems:

Defined regex text cleaning to deal with Illegal division by zero https://ireneli.eu/2018/01/11/working-with-rouge-1-5-5-evaluation-metric-in-python/

Parameters:
  • true_text (list) –

  • pred_text (list) –

  • output_text_dir (str) –

  • output_text_cleaning_regex (list) –

  • target_actual_text (bool) –

Returns:

static regex_clean_text(text, cleaning_regex_list)[source]
Parameters:
  • text (list) –

  • cleaning_regex_list (list) –

Return type:

list

class aitoolbox.nlp.experiment_evaluation.NLP_metrics.ExactMatchTextMetric(y_true, y_predicted, target_actual_text=False, output_text_dir=None)[source]

Bases: AbstractBaseMetric

Calculate exact match of answered strings

Parameters:
  • y_true (numpy.array or list) –

  • y_predicted (numpy.array or list) –

  • target_actual_text (bool) –

  • output_text_dir (str) –

calculate_metric()[source]

Perform metric calculation and return it from this function

Returns:

return metric_result

Return type:

float or dict

static normalize_answer(text_str)[source]

Convert to lowercase and remove punctuation, articles and extra whitespace.

All methods below this line are from the official SQuAD 2.0 eval script https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/

Parameters:

text_str (str) –

Returns:

str

class aitoolbox.nlp.experiment_evaluation.NLP_metrics.F1TextMetric(y_true, y_predicted, target_actual_text=False, output_text_dir=None)[source]

Bases: AbstractBaseMetric

Calculate F1 score of answered strings

Parameters:
  • y_true (numpy.array or list) –

  • y_predicted (numpy.array or list) –

  • target_actual_text (bool) –

  • output_text_dir (str) –

calculate_metric()[source]

Perform metric calculation and return it from this function

Returns:

return metric_result

Return type:

float or dict

static compute_f1(a_gold, a_pred)[source]
static get_tokens(s)[source]
class aitoolbox.nlp.experiment_evaluation.NLP_metrics.BLEUSentenceScoreMetric(y_true, y_predicted, source_sents=None, output_text_dir=None)[source]

Bases: AbstractBaseMetric

BLEU score calculation

NLTK provides the sentence_bleu() function for evaluating a candidate sentence against one or more reference sentences.

https://machinelearningmastery.com/calculate-bleu-score-for-text-python/

The reference sentences must be provided as a list of sentences where each reference is a list of tokens. The candidate sentence is provided as a list of tokens. For example:

reference = [[‘this’, ‘is’, ‘a’, ‘test’], [‘this’, ‘is’ ‘test’]] candidate = [‘this’, ‘is’, ‘a’, ‘test’] score = sentence_bleu(reference, candidate)

Parameters:
  • y_true (list) –

  • y_predicted (list) –

  • source_sents (list or None) –

  • output_text_dir (str or None) –

calculate_metric()[source]

Perform metric calculation and return it from this function

Returns:

return metric_result

Return type:

float or dict

static dump_translation_text_to_disk(source_sents, pred_translations, true_translations, sentence_bleu_results, output_text_dir)[source]
Parameters:
  • source_sents (list) –

  • pred_translations (list) –

  • true_translations (list) –

  • sentence_bleu_results (list) –

  • output_text_dir (str) –

Returns:

static check_transl_sent_num_match(sent_types)[source]
Parameters:

sent_types (list) – list of lists

Raises:

ValueError

class aitoolbox.nlp.experiment_evaluation.NLP_metrics.BLEUCorpusScoreMetric(y_true, y_predicted, source_sents=None, output_text_dir=None)[source]

Bases: AbstractBaseMetric

BLEU corpus score calculation

Function called corpus_bleu() for calculating the BLEU score for multiple sentences such as a paragraph or a document.

https://machinelearningmastery.com/calculate-bleu-score-for-text-python/

The references must be specified as a list of documents where each document is a list of references and each alternative reference is a list of tokens, e.g. a list of lists of lists of tokens. The candidate documents must be specified as a list where each document is a list of tokens, e.g. a list of lists of tokens.

references = [[[‘this’, ‘is’, ‘a’, ‘test’], [‘this’, ‘is’ ‘test’]]] candidates = [[‘this’, ‘is’, ‘a’, ‘test’]] score = corpus_bleu(references, candidates)

Parameters:
  • y_true (list) –

  • y_predicted (list) –

  • source_sents (list or None) –

  • output_text_dir (str or None) –

calculate_metric()[source]

Perform metric calculation and return it from this function

Returns:

return metric_result

Return type:

float or dict

class aitoolbox.nlp.experiment_evaluation.NLP_metrics.BLEUScoreStrTorchNLPMetric(y_true, y_predicted, lowercase=False, source_sents=None, output_text_dir=None)[source]

Bases: AbstractBaseMetric

BLEU score calculation using the TorchNLP implementation

Example

hypotheses = [

“The brown fox jumps over the dog 笑”, “The brown fox jumps over the dog 2 笑” ]

references = [

“The quick brown fox jumps over the lazy dog 笑”, “The quick brown fox jumps over the lazy dog 笑” ]

get_moses_multi_bleu(hypotheses, references, lowercase=True) 46.51

Parameters:
  • y_true (list) –

  • y_predicted (list) –

  • lowercase (bool) –

  • source_sents (list or None) –

  • output_text_dir (str or None) –

calculate_metric()[source]

Perform metric calculation and return it from this function

Returns:

return metric_result

Return type:

float or dict

class aitoolbox.nlp.experiment_evaluation.NLP_metrics.PerplexityMetric(batch_losses)[source]

Bases: AbstractBaseMetric

Perplexity metric used in MT

Parameters:

batch_losses (numpy.array or list) –

calculate_metric()[source]

Perform metric calculation and return it from this function

Returns:

return metric_result

Return type:

float or dict

class aitoolbox.nlp.experiment_evaluation.NLP_metrics.GLUEMetric(y_true, y_predicted, task_name)[source]

Bases: AbstractBaseMetric

GLUE evaluation metrics

Wrapper around HF Transformers glue_compute_metrics()

Parameters:
  • y_true

  • y_predicted

  • task_name (str) – name of the GLUE task

calculate_metric()[source]

Perform metric calculation and return it from this function

Returns:

return metric_result

Return type:

float or dict

class aitoolbox.nlp.experiment_evaluation.NLP_metrics.XNLIMetric(y_true, y_predicted)[source]

Bases: AbstractBaseMetric

XNLI evaluation metrics

Wrapper around HF Transformers xnli_compute_metrics()

Parameters:
  • y_true

  • y_predicted

calculate_metric()[source]

Perform metric calculation and return it from this function

Returns:

return metric_result

Return type:

float or dict