Metric评价指标-Perplexity语言模型

Source

欢迎关注知乎：世界是我改变的

知乎上的原文链接

一. 原理介绍

在研究生实习时候就做过语言模型的任务，当时让求PPL值，当时只是调包，不求甚解，哈哈哈，当时也没想到现在会开发这个评价指标，那现在我来讲一下我对这个指标的了解，望各位大佬多多指教。

1. 这个困惑度是如何发展来的呢？

在得到不同的语言模型（一元语言模型、二元语言模型....）的时候，我们如何判断一个语言模型是否好还是坏，一般有两种方法：

一种方法将其应用到具体的问题当中，比如机器翻译、speech recognition、spelling corrector等。然后看这个语言模型在这些任务中的表现（extrinsic evaluation，or in-vivo evaluation）。但是，这种方法一方面难以操作，另一方面可能非常耗时，可能跑一个evaluation需要大量时间，费时难操作。
针对第一种方法的缺点，大家想是否可以根据与语言模型自身的一些特性，来设计一种简单易行，而又行之有效的评测指标。于是，人们就发明了perplexity这个指标。

简单说一下语言模型：语言模型（Language Model，LM），给出一句话的前k个词，希望它可以预测第k+1个词是什么，即给出一个第k+1个词可能出现的概率的分布p(xk+1|x1,x2,...,xk)。

2. 困惑度的概念

困惑度的计算是基于单词的，在自然语言处理中，困惑度是用来衡量语言概率模型优劣的一个方法。一个语言概率模型可以看成是在整过句子或者文段上的概率分布。例如每个分词位置上有一个概率分布，这个概率分布表示了每个词在这个位置上出现的概率；或者每个句子位置上有一个概率分布，这个概率分布表示了所有可能句子在这个位置上出现的概率。

困惑度（perplexity）的基本思想是：给测试集的句子赋予较高概率值的语言模型较好,当语言模型训练完之后，测试集中的句子都是正常的句子，那么训练好的模型就是在测试集上的概率越高越好，公式如下：

结论：困惑度越小，句子概率越大，语言模型越好。

下面是一些 ngram 模型经训练文本后在测试集上的困惑度值：

preview

可以看到，之前我们学习的 trigram 模型经训练后，困惑度由955跌减至74，这是十分可观的结果。在Brown corpus (1 million words of American English of varying topics and genres) 上报告的最低的困惑度就是使用一个trigram model（三元语法模型）。在一个特定领域的语料中，常常可以得到更低的困惑度。

二. 使用场景

我们在想衡量语言模型的好坏，然后对语言模型输出的句子进行评价。场景很简单，话不多说，上代码。

三. MindSpore的Perplexity代码实现

原理就是上面说的那些，比较好理解，话不多说直接上代码。使用的是MindSpore框架实现的代码。

MindSpore代码实现

"""Perplexity"""
import math
import numpy as np
from mindspore._checkparam import Validator as validator
from .metric import Metric


class Perplexity(Metric):
   
    def __init__(self, ignore_label=None):
        super(Perplexity, self).__init__()

        if ignore_label is None:
            self.ignore_label = ignore_label
        else:
            self.ignore_label = validator.check_value_type("ignore_label", ignore_label, [int])
        self.clear()

    def clear(self):
        """清除历史数据"""
        self._sum_metric = 0.0
        self._num_inst = 0

    def update(self, *inputs):
        # 输入数量的校验
        if len(inputs) != 2:
            raise ValueError('Perplexity needs 2 inputs (preds, labels), but got {}.'.format(len(inputs)))
        
        # 将输入统一转化为numpy，变为list
        preds = [self._convert_data(inputs[0])]
        labels = [self._convert_data(inputs[1])]

        if len(preds) != len(labels):
            raise RuntimeError('preds and labels should have the same length, but the length of preds is{}, '
                               'the length of labels is {}.'.format(len(preds), len(labels)))

        loss = 0.
        num = 0
        # 遍历句子中的每一个单词，再进行维度的变换
        for label, pred in zip(labels, preds):
            if label.size != pred.size / pred.shape[-1]:
                raise RuntimeError("shape mismatch: label shape should be equal to pred shape, but got label shape "
                                   "is {}, pred shape is {}.".format(label.shape, pred.shape))
            # 将label进行重塑
            label = label.reshape((label.size,))
            # 将label_expand转为int类型
            label_expand = label.astype(int)
            # label_expand扩展一维
            label_expand = np.expand_dims(label_expand, axis=1)
            first_indices = np.arange(label_expand.shape[0])[:, None]
            pred = np.squeeze(pred[first_indices, label_expand])
            if self.ignore_label is not None:
                ignore = (label == self.ignore_label).astype(pred.dtype)
                num -= np.sum(ignore)
                pred = pred * (1 - ignore) + ignore
            loss -= np.sum(np.log(np.maximum(1e-10, pred)))
            num += pred.size
        self._sum_metric += loss
        self._num_inst += num

    def eval(self):
        # 分母不能为0
        if self._num_inst == 0:
            raise RuntimeError('Perplexity can not be calculated, because the number of samples is 0.')

        return math.exp(self._sum_metric / self._num_inst)

使用方法如下：

import numpy as np
from mindspore import Tensor
from mindspore.nn.metrics import Perplexity

x = Tensor(np.array([[0.2, 0.5], [0.3, 0.1], [0.9, 0.6]]))
y = Tensor(np.array([1, 0, 1]))
metric = Perplexity(ignore_label=None)
metric.clear()
metric.update(x, y)
perplexity = metric.eval()
print(perplexity)

2.231443166940565

每个batch（比如两组数据）进行计算的时候如下：

import numpy as np
from mindspore import Tensor
from mindspore.nn.metrics import Perplexity

metric = Perplexity(ignore_label=None)
metric.clear()
x = Tensor(np.array([[0.2, 0.5], [0.3, 0.1], [0.9, 0.6]]))
y = Tensor(np.array([1, 0, 1]))
metric.update(x, y)

x1= Tensor(np.array([[0.2, 0.5], [0.3, 0.1], [0.9, 0.6]]))
y1 = Tensor(np.array([1, 0, 1]))
metric.update(x1, y1)
perplexity = metric.eval()
print(perplexity)

Perplexity的影响因素

训练数据集越大，PPL会下降得更低，1billion dataset和10万dataset训练效果是很不一样的；
数据中的标点会对模型的PPL产生很大影响，一个句号能让PPL波动几十，标点的预测总是不稳定；
预测语句中的“的，了”等词也对PPL有很大影响，可能“我借你的书”比“我借你书”的指标值小几十，但从语义上分析有没有这些停用词并不能完全代表句子生成的好坏。

所以，语言模型评估时我们可以用perplexity大致估计训练效果，作出判断和分析，但它不是完全意义上的标准，具体问题还是要具体分析。