使用流水线并行在多GPU上训练您的大模型-青云TOP-AI综合资源站平台|青云聚合API大模型调用平台|全网AI资源导航平台

📢 转载信息

原文链接：https://machinelearningmastery.com/train-your-large-model-on-multiple-gpus-with-pipeline-parallelism/

原文作者：Adrian Tam

有些语言模型大到无法在单个GPU上进行训练。如果模型可以装入单个GPU，但无法使用较大的批次大小进行训练，则可以使用数据并行（data parallelism）。然而，当模型大到无法装入单个GPU时，就需要将模型拆分到多个GPU上。在本文中，您将学习如何使用流水线并行（pipeline parallelism）来拆分模型进行训练。具体来说，您将了解到：

什么是流水线并行
如何在PyTorch中使用流水线并行
如何使用流水线并行保存和恢复模型

让我们开始吧！

使用流水线并行在多GPU上训练您的大模型。
照片作者：Ivan Ivankovic。保留部分权利。

概述

本文分为六个部分，它们分别是：

流水线并行概述
流水线并行的模型准备
阶段和流水线调度
训练循环
分布式检查点
流水线并行的局限性

流水线并行概述

流水线并行意味着将模型创建为一个阶段的流水线。如果您处理过scikit-learn项目，您可能对流水线的概念很熟悉。scikit-learn流水线的一个示例如下：

当您将数据传递给这个流水线时，它首先由第一个阶段（StandardScaler）处理，然后输出传递给第二个阶段（LogisticRegression）。

Transformer模型通常只是Transformer块的堆栈。每个块接受一个张量作为输入并产生一个张量作为输出。这使其成为流水线的完美候选：每个阶段是一个Transformer块，并且这些块串联在一起。执行这个流水线在数学上等同于执行整个模型。

对于Transformer模型，手动创建流水线是很直接的。在大局上，您需要做的就是以下几点：

然而，这种方法效率不高。当您在GPU 0上运行stage1模型时，GPU 1和GPU 2是空闲的。只有在stage1完成且张量output1准备就绪后，您才能在GPU 1上处理stage2模型，依此类推。

在PyTorch中，有基础设施可以管理流水线以使所有GPU保持忙碌。这基于微批次（micro-batches）的概念：不是处理大小为 $N$ 的一个批次，而是将该批次分成 $n$ 个大小为 $N/n$ 的微批次。当stage2处理第 $i$ 个微批次时，stage1可以处理第 $(i+1)$ 个微批次。一旦所有微批次都处理完毕，就聚合结果以产生最终输出。

让我们看看如何在PyTorch中实现流水线并行的训练脚本。

警告：PyTorch的流水线并行API仍处于实验阶段，未来可能会发生变化。本文中的代码在PyTorch 2.9.1上进行了测试。在其他PyTorch版本上运行此代码可能无法正常工作。

模型准备 for Pipeline Parallelism

如果您的模型可以装入单个GPU，那么分布式数据并行是更优的选择。当您需要流水线并行时，您的模型可能太大，无法装入单个设备。

在设置流水线之前，您需要先创建模型。您有两种选择：要么为单个阶段创建模型，使其适合您的GPU；要么在一个虚拟设备上创建完整模型，然后在将其传输到实际GPU之前对其进行修剪。前者要求在模型的构造函数中定义一个阶段参数，以便可以创建特定的阶段。对于后者，您可以这样做：

... with torch.device("meta"): 
    model_config = LlamaConfig()
    model = LlamaForPretraining(model_config, stage=rank) # Partition the model by removing some layers
    num_layers = model_config.num_hidden_layers
    partition = [num_layers // 3, 2 * num_layers // 3, num_layers]

if rank == 0: 
        # from embedding to 1/3 of the decoder layers
        for n in range(partition[0], partition[2]): 
            model.base_model.layers[str(n)] = None
        model.base_model.norm = None
        model.lm_head = None
    elif rank == 1: 
        # from 1/3 to 2/3 of the decoder layers
        model.base_model.embed_tokens = None
        for n in range(0, partition[0]): 
            model.base_model.layers[str(n)] = None
        for n in range(partition[1], partition[2]): 
            model.base_model.layers[str(n)] = None
        model.base_model.norm = None
        model.lm_head = None
    elif rank == 2: 
        # from 2/3 to the end of the decoder layers and the final norm layer, LM head
        model.base_model.embed_tokens = None
        for n in range(partition[1]): 
            model.base_model.layers[str(n)] = None
    else: 
        raise ValueError(f"Invalid rank: {rank}")

上述代码使用上一个帖子中定义的LlamaForPretraining类来创建模型。如果模型太大，实例化它将导致内存不足错误。在这里，您在虚拟设备meta上创建模型。当模型创建在meta上时，权重不会被分配。

在上面的代码中，您将模型划分为三个阶段：在rank == 0（第一阶段），模型保留嵌入层和前1/3的解码器层。在rank == 1（第二阶段），模型只保留中间1/3的解码器层。在rank == 2（第三阶段），模型保留最后1/3的解码器层、最终的归一化层和预测头。未在特定阶段需要的组件被设置为None。这些阶段没有重叠，并且紧密划分了模型。

为了使这样的模型工作，您需要修改模型代码，以便当某个组件为None时，它在正向传播中被跳过。这需要在LlamaModel和LlamaForPretraining类中完成：

.. class LlamaModel(nn.Module):
    """The full Llama model without any pretraining heads."""
    def __init__(self, config: LlamaConfig) -> None:
        super().__init__()
        self.rope = RotaryPositionEncoding(
            config.hidden_size // config.num_attention_heads,
            config.max_position_embeddings,
        )
        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
        self.layers = nn.ModuleDict({
            str(i): LlamaDecoderLayer(config) for i in range(config.num_hidden_layers)
        })
        self.norm = nn.RMSNorm(config.hidden_size, eps=1e-5)

def forward(self, input_ids: Tensor) -> Tensor:
        # Convert input token IDs to embeddings
        if self.embed_tokens is not None:
            hidden_states = self.embed_tokens(input_ids)
        else:
            hidden_states = input_ids
        # Process through all transformer layers, then the final norm layer
        for n in range(len(self.layers)):
            if self.layers[str(n)] is not None:
                hidden_states = self.layers[str(n)](hidden_states, self.rope)
        if self.norm is not None:
            hidden_states = self.norm(hidden_states)
        # Return the final hidden states, and copy over the attention mask
        return hidden_states

class LlamaForPretraining(nn.Module):
    def __init__(self, config: LlamaConfig, stage) -> None:
        super().__init__()
        self.base_model = LlamaModel(config)
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
        self.stage = stage

def forward(self, input_ids: Tensor) -> Tensor:
        hidden_states = self.base_model(input_ids)
        if self.lm_head is not None:
            hidden_states = self.lm_head(hidden_states)
        return hidden_states

您可以看到，添加了几个if语句来检查组件是否为None，然后再允许它处理hidden_states张量。

在创建了部分模型之后，您需要将其传输到实际的GPU上。传输...

🚀 想要体验更好更全面的AI调用？

欢迎使用青云聚合API，约为官网价格的十分之一，支持300+全球最新模型，以及全球各种生图生视频模型，无需翻墙高速稳定，文档丰富，小白也可以简单操作。

目录CONTENT

使用流水线并行在多GPU上训练您的大模型

概述

流水线并行概述

模型准备 for Pipeline Parallelism

评论区