评估语言模型的困惑度-青云TOP-AI综合资源站平台|青云聚合API大模型调用平台|全网AI资源导航平台

📢 转载信息

原文链接：https://machinelearningmastery.com/evaluating-perplexity-on-language-models/

原文作者：Adrian Tam

语言模型是一种在标记序列上定义的概率分布。当您训练一个语言模型时，您希望衡量它预测人类语言使用的准确性。这是一项艰巨的任务，您需要一个指标来评估模型。在本文中，您将了解困惑度（perplexity）指标。具体来说，您将学习：

什么是困惑度，以及如何计算它
如何使用样本数据评估语言模型的困惑度

让我们开始吧。

评估语言模型的困惑度
图片作者：Lucas Davis。保留部分权利。

概述

本文分为两个部分：

什么是困惑度以及如何计算它
使用 HellaSwag 数据集评估语言模型的困惑度

什么是困惑度以及如何计算它

困惑度是衡量语言模型预测文本样本能力的指标。它的定义是样本中标记概率的几何平均数的倒数。从数学上讲，困惑度的定义为：

$$
PPL(x_{1:L}) = \prod_{i=1}^L p(x_i)^{-1/L} = \exp\big(-\frac{1}{L} \sum_{i=1}^L \log p(x_i)\big)
$$

困惑度是特定标记序列的函数。在实践中，更方便地将困惑度计算为对数概率的平均值，如上图所示。

困惑度量化了语言模型在下一个标记上的平均犹豫程度。如果语言模型绝对确定，则困惑度为 1。如果语言模型完全不确定，则词汇表中的每个标记都同样可能，此时困惑度等于词汇表大小。您不应期望困惑度超出此范围。

使用 HellaSwag 数据集评估语言模型的困惑度

困惑度是一个依赖于数据集的指标。HellaSwag 是一个可用于评估的数据集，它包含训练、测试和验证集。该数据集可在 Hugging Face Hub 上获取，您可以使用以下代码加载它：

import datasets

dataset = datasets.load_dataset("HuggingFaceFW/hellaswag")

print(dataset)

for sample in dataset["validation"]:

print(sample)

break

运行此代码将打印以下内容：

DatasetDict({

train: Dataset({

features: ['ind', 'activity_label', 'ctx_a', 'ctx_b', 'ctx', 'endings',

'source_id', 'split', 'split_type', 'label'],

num_rows: 39905

})

test: Dataset({

features: ['ind', 'activity_label', 'ctx_a', 'ctx_b', 'ctx', 'endings',

'source_id', 'split', 'split_type', 'label'],

num_rows: 10003

})

validation: Dataset({

features: ['ind', 'activity_label', 'ctx_a', 'ctx_b', 'ctx', 'endings',

'source_id', 'split', 'split_type', 'label'],

num_rows: 10042

})

{'ind': 24, 'activity_label': 'Roof shingle removal',

'ctx_a': 'A man is sitting on a roof.', 'ctx_b': 'he',

'ctx': 'A man is sitting on a roof. he', 'endings': [

'is using wrap to wrap a pair of skis.', 'is ripping level tiles off.',

"is holding a rubik's cube.", 'starts pulling up roofing on a roof.'

], 'source_id': 'activitynet~v_-JhWjGDPHMY', 'split': 'val', 'split_type': 'indomain',

'label': '3'}

您可以看到，验证集中有 10,042 个样本。这是本文将使用的数据集。每个样本都是一个字典。键 "activity_label" 描述了活动类别，键 "ctx" 提供了需要完成的上下文。模型需要通过从四个结尾中选择一个来完成序列。键 "label"（值为 0 到 3）表示哪个结尾是正确的。

有了这些信息，您可以编写一个简短的代码来评估您自己的语言模型。让我们以 Hugging Face 中的一个小模型为例：

import datasets
import torch
import torch.nn.functional as F
import tqdm
import transformers

model = "openai-community/gpt2"

# Load the model
torch.set_default_device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = transformers.AutoTokenizer.from_pretrained(model)
model = transformers.AutoModelForCausalLM.from_pretrained(model)

# Load the dataset: HellaSwag has train, test, and validation splits
dataset = datasets.load_dataset("hellaswag", split="validation")

# Evaluate the model: Compute the perplexity of each ending
num_correct = 0
for sample in tqdm.tqdm(dataset):
 # tokenize text from the sample
 text = tokenizer.encode(" " + sample["activity_label"] + ". " + sample["ctx"])
 endings = [tokenizer.encode(" " + x) for x in sample["endings"]] # 4 endings
 groundtruth = int(sample["label"]) # integer, 0 to 3
 
 # generate logits for each ending
 perplexities = [0.0] * 4
 for i, ending in enumerate(endings):
 # run the entire input and ending to the model
 input_ids = torch.tensor(text + ending).unsqueeze(0)
 output = model(input_ids).logits
 
 # extract the logits for each token in the ending
 logits = output[0, len(text)-1:, :]
 token_probs = F.log_softmax(logits, dim=-1)
 
 # accumulate the probability of generating the ending
 log_prob = 0.0
 for j, token in enumerate(ending):
 log_prob += token_probs[j, token]
 
 # convert the sum of log probabilities to perplexity
 perplexities[i] = torch.exp(-log_prob / len(ending))
 
 # print the perplexity of each ending
 print(sample["activity_label"] + ". " + sample["ctx"])
 correct = perplexities[groundtruth] == min(perplexities)
 for i, p in enumerate(perplexities):
 if i == groundtruth:
 symbol = '(O)' if correct else '(!)'
 elif p == min(perplexities):
 symbol = '(X)'
 else:
 symbol = ' '
 print(f"Ending {i}: {p:.4g} {symbol} - {sample['endings'][i]}")
 
 if correct:
 num_correct += 1
 
print(f"Accuracy: {num_correct}/{len(dataset)} = {num_correct / len(dataset):.4f}")

import datasets

import torch

import torch.nn.functional as F

import tqdm

import transformers

model = "openai-community/gpt2"

# Load the model

torch.set_default_device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = transformers.AutoTokenizer.from_pretrained(model)

model = transformers.AutoModelForCausalLM.from_pretrained(model)

# Load the dataset: HellaSwag has train, test, and validation splits

dataset = datasets.load_dataset("hellaswag", split="validation")

# Evaluate the model: Compute the perplexity of each ending

num_correct = 0

for sample in tqdm.tqdm(dataset):

# tokenize text from the sample

text = tokenizer.encode(" " + sample["activity_label"] + ". " + sample["ctx"])

endings = [tokenizer.encode(" " + x) for x in sample["endings"]] # 4 endings

groundtruth = int(sample["label"]) # integer, 0 to 3

# generate logits for each ending

perplexities = [0.0] * 4

for i, ending in enumerate(endings):

# run the entire input and ending to the model

input_ids = torch.tensor(text + ending).unsqueeze(0)

output = model(input_ids).logits

# extract the logits for each token in the ending

logits = output[0, len(text)-1:, :]

token_probs = F.log_softmax(logits, dim=-1)

# accumulate the probability of generating the ending

log_prob = 0.0

for j, token in enumerate(ending):

log_prob += token_probs[j, token]

# convert the sum of log probabilities to perplexity

perplexities[i] = torch.exp(-log_prob / len(ending))

# print the perplexity of each ending

print(sample["activity_label"] + ". " + sample["ctx"])

correct = perplexities[groundtruth] == min(perplexities)

for i, p in enumerate(perplexities):

if i == groundtruth:

symbol = '(O)' if correct else '(!)'

elif p == min(perplexities):

symbol = '(X)'

else:

symbol = ' '

print(f"Ending {i}: {p:.4g} {symbol} - {sample['endings'][i]}")

if correct:

num_correct += 1

print(f"Accuracy: {num_correct}/{len(dataset)} = {num_correct / len(dataset):.4f}")

此代码加载了 Hugging Face Hub 中最小的 GPT-2 模型。这是一个 1.24 亿参数的模型，可以在低配置计算机上轻松运行。模型和分词器使用 Hugging Face transformers 库加载。同时，您也加载了 HellaSwag 验证数据集。

在 for 循环中，您对活动标签和上下文进行分词（tokenize）。同时，您也对四个结尾进行分词。请注意，tokenizer.encode() 是使用 transformers 库中分词器的方法。它与您在上一篇文章中使用的分词器对象不同。

接下来，对于每个结尾，您将连接后的输入和结尾输入到模型中。input_ids 张量是一个具有批次维度 1 的整数标记 ID 的 2D 张量。模型返回一个对象，您可以从中提取输出 logits 张量。这与您在上一篇文章中构建的模型不同，因为这是一个来自 transformers 库的模型对象。通过少量修改，您可以轻松地将其替换为您训练的模型对象。

GPT-2 是一个仅解码器的 Transformer 模型。它使用因果掩码（causal mask）处理输入。对于形状为 $(1, L)$ 的输入张量，输出 logits 张量的形状为 $(1, L, V)$，其中 $V$ 是词汇表大小。位置 $p$ 处的输出对应于模型对位置 $p+1$ 标记的估计，具体取决于位置 1 到 $p$ 的输入。因此，您从偏移量 $n-1$ 处提取 logits，其中 $n$ 是组合后的活动标签和上下文的长度。然后，您将 logits 转换为对数概率，并计算每个结尾长度的平均值。

token_probs[j, token] 值是位置 j 处 ID 为 token 的标记的对数概率。使用每个结尾中标记的平均对数概率来计算困惑度。一个好的模型应该能够以最低的困惑度识别出正确的结尾。您可以通过计算整个 HellaSwag 验证数据集中正确预测的数量来评估模型。当您运行此代码时，您将看到以下内容：

...

Finance and Business. [header] How to buy a peridot [title] Look at a variety of stones...

Ending 0: 13.02 (X) - You will want to watch several of the gemstones, particularly eme...

Ending 1: 30.19 - Not only are they among the delicates among them, but they can be...

Ending 2: 34.96 (!) - Familiarize yourself with the different shades that it comes in, ...

Ending 3: 28.85 - Neither peridot nor many other jade or allekite stones are necess...

Family Life. [header] How to tell if your teen is being abused [title] Pay attention to...

Ending 0: 16.58 - Try to figure out why they are dressing something that is frowned...

Ending 1: 22.01 - Read the following as a rule for determining your teen's behaviou...

Ending 2: 15.21 (O) - [substeps] For instance, your teen may try to hide the signs of a...

Ending 3: 23.91 - [substeps] Ask your teen if they have black tights (with stripper...

Accuracy: 3041/10042 = 0.3028

代码打印每个结尾的困惑度，并用 (O) 或 (!) 标记正确答案，用 (X) 标记模型的错误预测。您可以看到，即使对于正确的答案，GPT-2 的困惑度也在 10 到 20 之间。更先进的 LLM 可以在困惑度低于 10 的情况下取得成绩，即使它们的词汇表大小远大于 GPT-2。更重要的是模型是否能识别出正确的结尾：那个能自然完成句子的结尾。它应该具有最低的困惑度；否则，模型就无法生成正确的结尾。GPT-2 在此数据集上的准确率仅为 30%。

您也可以使用不同的模型重复此代码。结果如下：

模型 openai-community/gpt2：这是代码中使用的最小的 GPT-2 模型，拥有 1.24 亿参数。准确率为 3041/10042，即 30.28%
模型 openai-community/gpt2-medium：这是更大的 GPT-2 模型，拥有 3.55 亿参数。准确率为 3901/10042，即 38.85%
模型 meta-llama/Llama-3.2-1B：这是 Llama 系列中最小的模型，拥有 10 亿参数。准确率为 5731/10042，即 57.07%

因此，较大的模型具有较高的准确率是很自然的。

请注意，您不应比较架构差异很大的模型之间的困惑度。由于困惑度是一个在 1 到词汇表大小范围内的指标，它在很大程度上取决于分词器。当您用 Llama 3.2 1B 替换 GPT-2 时，您可以从上面的代码中比较困惑度看出原因：Llama 3 的困惑度高出一个数量级，但准确率确实更高。这是因为 GPT-2 的词汇表大小仅为 50,257，而 Llama 3.2 1B 的词汇表大小为 128,256。

总结

在本文中，您了解了困惑度指标以及如何使用 HellaSwag 数据集评估语言模型的困惑度。具体来说，您学到了：

困惑度衡量模型对下一个标记的平均犹豫程度。
困惑度是一个对词汇表大小敏感的指标。
计算困惑度意味着计算样本中标记概率的几何平均数。

🚀 想要体验更好更全面的AI调用？

欢迎使用青云聚合API，约为官网价格的十分之一，支持300+全球最新模型，以及全球各种生图生视频模型，无需翻墙高速稳定，文档丰富，小白也可以简单操作。

目录CONTENT

评估语言模型的困惑度

概述

什么是困惑度以及如何计算它

使用 HellaSwag 数据集评估语言模型的困惑度

延伸阅读

总结

评论区