为BERT训练准备数据-青云TOP-AI综合资源站平台|青云聚合API大模型调用平台|全网AI资源导航平台

📢 转载信息

原文链接：https://machinelearningmastery.com/preparing-data-for-bert-training/

原文作者：Jason Brownlee

BERT（Bidirectional Encoder Representations from Transformers）是一种基于Transformer架构的预训练语言模型。它在各种自然语言处理（NLP）任务上实现了最先进的性能，例如问答、情感分析和命名实体识别。

要使用BERT模型，您需要遵循一套特定的数据准备步骤，这些步骤旨在将原始文本转换为BERT模型可以理解的输入格式。这些步骤通常包括：

文本分词（Tokenization）：将文本分解为模型可以处理的单元，称为token。
添加特殊标记（Special Tokens）：在输入序列的开头和结尾添加特定的标记。
生成注意力掩码（Attention Mask）：指示模型哪些token是实际内容，哪些是填充（padding）。
创建Token类型ID（Token Type IDs）：用于区分输入序列中的不同句子（如果适用）。

本教程将引导您完成这些步骤，并展示如何使用Hugging Face transformers库来实现它，特别是使用BertTokenizer。

1. 文本分词

BERT模型使用一种称为WordPiece的分词方案。此方案将文本分解为子词单元。通常，您会使用预训练模型的特定分词器来确保分词与模型训练时使用的方法一致。

例如，Hugging Face Transformers库允许您加载一个预训练的BERT模型的分词器。

from transformers import BertTokenizer

# 加载预训练的BERT模型的分词器
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# 示例文本
text = "Hello, this is a BERT tokenizer example."

# 对文本进行分词
encoded_input = tokenizer.encode(text)

print(encoded_input)
# 输出可能类似于：[101, 7592, 1010, 2023, 2003, 1037, 19204, 6251, 1012, 102]

分词器会自动添加特殊标记，例如：

[CLS]：序列开始标记（通常表示为ID 101）。
[SEP]：序列分隔标记（通常表示为ID 102）。

分词器还会自动处理句末标点符号，并将其转换为相应的Token ID。

2. BERT输入格式

BERT的输入要求一个三元组，其中每个元素都是一个整数序列：

input_ids：分词后的Token ID序列。
token_type_ids：Token类型ID，用于区分两个句子（用于句子对任务，如问答或自然语言推理）。
attention_mask：注意力掩码，用于指示哪些token是实际的，哪些是填充的。

对于单句输入，token_type_ids通常全部为0。attention_mask中，1表示实际token，0表示填充token。

我们可以使用tokenizer()方法来更方便地获取所有必需的输入ID。

# 示例：使用tokenizer()获取完整输入
# 默认情况下，分词器会自动处理padding和truncation
encoded_dict = tokenizer.encode_plus(
                    text,
                    add_special_tokens=True,  # 是否添加 [CLS] 和 [SEP]
                    max_length=128,           # 最大长度
                    padding='max_length',     # 使用最大长度填充
                    truncation=True,          # 如果超过最大长度则截断
                    return_attention_mask=True, # 返回注意力掩码
                    return_tensors='pt'       # 返回PyTorch张量
                   )

print('Token IDs:', encoded_dict['input_ids'])
print('Attention Mask:', encoded_dict['attention_mask'])
print('Token Type IDs:', encoded_dict['token_type_ids'])

重点关注：

attention_mask：这是关键部分。它告诉模型忽略填充的Token。
token_type_ids：对于单句输入，它们通常是全0的。

3. 准备句子对（Sentence Pairs）

BERT在设计时考虑到了需要处理两个输入序列的任务（例如，判断句子B是否是句子A的下一句）。在这种情况下，您需要为每个序列提供一个Token类型ID来区分它们。

tokenizer.encode_plus()方法可以自动处理句子对的格式化，包括使用[SEP]来分隔两个句子，并正确设置token_type_ids：

sentence_a = "What is the capital of France?"
sentence_b = "The capital of France is Paris."

# 自动处理句子对
encoded_pair = tokenizer.encode_plus(
                    sentence_a,
                    sentence_b,
                    add_special_tokens=True,
                    max_length=128,
                    padding='max_length',
                    truncation=True,
                    return_attention_mask=True,
                    return_tensors='pt'
                   )

print('Token IDs:', encoded_pair['input_ids'])
print('Token Type IDs:', encoded_pair['token_type_ids'])

在上述输出中，您会发现token_type_ids现在包含0和1的混合：

Token Type IDs: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])

这个结构表明：序列A（包括[CLS]和第一个[SEP]）被标记为类型0；序列B（包括第二个[SEP]）被标记为类型1。

4. 实际操作中的数据准备

在实际项目中，您需要对整个数据集执行这些步骤，然后将结果组织成可用于训练模型的数据集。

以下是使用pandas和transformers库组合准备数据集的通用流程示例：

import pandas as pd
import torch

# 假设 df 是一个包含'text'列的数据框
df = pd.DataFrame({'text': [sentence_a, sentence_b, text]})

# 编码函数
def encode_data(texts, tokenizer, max_len=128):
    input_ids = []
    attention_masks = []
    
    for text in texts:
        # 使用encode_plus处理单句
        encoding = tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=max_len,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )
        input_ids.append(encoding['input_ids'])
        attention_masks.append(encoding['attention_mask'])
        
    # 堆叠张量
    input_ids = torch.cat(input_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)
    
    return input_ids, attention_masks

# 编码数据集
input_ids, attention_masks = encode_data(df['text'].values, tokenizer)

print(f"Input IDs shape: {input_ids.shape}")
print(f"Attention Masks shape: {attention_masks.shape}")

完成这些步骤后，您就有了BERT模型所需的标准输入格式：input_ids、attention_mask（和可选的token_type_ids），这些数据可以直接输入到您的BERT训练循环中。

🚀 想要体验更好更全面的AI调用？

欢迎使用青云聚合API，约为官网价格的十分之一，支持300+全球最新模型，以及全球各种生图生视频模型，无需翻墙高速稳定，文档丰富，小白也可以简单操作。

目录CONTENT

为BERT训练准备数据

1. 文本分词

2. BERT输入格式

3. 准备句子对（Sentence Pairs）

4. 实际操作中的数据准备

评论区