📢 转载信息

原文链接：https://www.microsoft.com/en-us/research/blog/renderformer-how-neural-networks-are-reshaping-3d-rendering/

原文作者：Yue Dong, Lead Researcher

RenderFormer：神经网络如何革新3D渲染技术

发布时间：2025年9月10日

作者：Yue Dong，首席研究员

Three white icons on a gradient background transitioning from blue to green. From left to right: network node icon, lightbulb-shaped icon with a path tool icon in the center; a monitor icon showing a web browser icon

3D渲染——即将三维模型转换为二维图像的过程——是计算机图形学的基石技术，广泛应用于游戏、电影、虚拟现实和建筑可视化等领域。传统上，这一过程依赖于基于物理的技术，如光线追踪和光栅化，它们通过数学公式和专家设计的模型来模拟光线行为。

如今，得益于人工智能（AI），特别是神经网络的进步，研究人员正开始用机器学习（ML）取代这些传统方法。这一转变催生了一个新兴领域，即神经渲染（Neural Rendering）。

神经渲染将深度学习与传统图形技术相结合，使模型能够在不显式建模物理光学的情况下模拟复杂的光传输。这种方法具有显著优势：它消除了对手工规则的需求，支持端到端训练，并且可以针对特定任务进行优化。然而，大多数现有的神经渲染方法依赖于2D图像输入，缺乏对原始3D几何和材质数据的支持，并且通常需要为每个新场景重新训练，这限制了它们的泛化能力。

RenderFormer：迈向通用神经渲染模型

为了克服这些限制，微软研究院的研究人员开发了 RenderFormer，这是一种新型的神经架构，旨在仅使用机器学习即可支持全功能的3D渲染——无需传统图形计算。RenderFormer 是第一个证明神经网络可以学习完整的图形渲染管线（包括对任意3D场景和全局照明的支持）的模型，而无需依赖光线追踪或光栅化。这项研究工作已被SIGGRAPH 2025接收，并已在GitHub上开源（在新窗口中打开）。

架构概览

如图1所示，RenderFormer 使用三角形Token（triangle tokens）来表示整个3D场景——每个Token都编码了空间位置、表面法线以及漫反射颜色、高光颜色和粗糙度等物理材质属性。光照也作为三角形Token建模，其发射值表示强度。

Figure 1: The figure illustrates the architecture of RenderFormer. It includes a Triangle Mesh Scene with a 3D rabbit model inside a colored cube, a Camera Ray Map grid, a View Independent Transformer (12 layers of Self-Attention and Feed Forward Network), a View Dependent Transformer (6 layers with Cross-Attention and Self-Attention), and a DPT Decoder. Scene attributes—Vertex Normal, Reflectance (Diffuse, Specular, Roughness), Emission, and Position—are embedded into Triangle Tokens via Linear + Norm operations. These tokens and Ray Bundle Tokens (from the Camera Ray Map) are processed by the respective transformers and decoded to produce a rendered image of a glossy rabbit in a colored room. — 图1. RenderFormer 的架构

为了描述观察方向，模型使用从光线图中派生的光束Token（ray bundle tokens）——输出图像中的每个像素对应于这些光束之一。为了提高计算效率，像素被分组成长方形块，块内的所有光线同时处理。

模型输出一组Token，这些Token被解码成图像像素，从而在神经网络内部完全完成了渲染过程。

播客系列

重访：AI革命在医学中的应用

欢迎收听微软的Peter Lee，一同探索AI如何影响医疗保健以及它对医学未来的意义。

立即收听

在新窗口中打开

双分支设计：实现与视角无关和与视角相关的效果

RenderFormer 的架构围绕两个Transformer构建：一个用于处理与视角无关的特征，另一个用于处理与视角相关的特征。

与视角无关的Transformer：使用三角形Token之间的自注意力机制，捕获与视点无关的场景信息，例如阴影和漫反射光传输。
与视角相关的Transformer：通过三角形Token与光束Token之间的交叉注意力机制，模拟能见度、反射和高光等受视角影响的效果。

额外的图像空间效果，如抗锯齿和屏幕空间反射，则通过光束Token之间的自注意力机制来处理。

为了验证该架构，团队进行了消融实验和视觉分析，确认了渲染管线中每个组件的重要性。

Table 1: A table comparing the performance of different network variants in an ablation study. The columns are labeled Variant, PSNR (↑), SSIM (↑), LPIPS (↓), and FLIP (↓). Variants include configurations such as — 表1. 消融研究，分析不同组件和注意力机制对训练网络最终性能的影响。

为了测试与视角无关Transformer的能力，研究人员训练了一个解码器来生成仅包含漫反射的渲染结果。如图2所示，结果表明模型可以准确地模拟阴影和其他间接光照效果。

Figure 2: The figure displays four 3D-rendered objects showcasing view-independent rendering effects. From left to right: a purple teapot on a green surface, a blue rectangular object on a red surface, an upside-down table casting shadows on a green surface, and a green apple-like object on a blue surface. Each object features diffuse lighting and coarse shadow effects, with distinct highlights and shadows produced by directional light sources. — 图2. 直接从与视角无关的Transformer解码出的效果，包括漫反射光照和粗糙的阴影效果。

研究人员通过注意力可视化来评估与视角相关的Transformer。例如，在图3中，注意力图显示茶壶上的一个像素同时关注其表面三角形和附近的一堵墙——这捕捉到了镜面反射的效果。这些可视化还显示了材质变化如何影响反射的清晰度和强度。

Figure 3: The figure contains six panels arranged in two rows and three columns. The top row displays a teapot in a room with red and green walls under three different roughness values: 0.3, 0.7, and 0.99 (left to right). The bottom row shows the corresponding attention outputs for each roughness setting, featuring the teapot silhouette against a dark background with distinct light patterns that vary with roughness. — 图3. 注意力输出的可视化

训练方法和数据集设计

RenderFormer 使用 Objaverse 数据集进行训练，该数据集包含超过80万个带注释的3D对象，旨在推动3D建模、计算机视觉及相关领域的研究。研究人员设计了四种场景模板，并在每个模板中填充了1到3个随机选择的对象和材质。场景使用Blender的Cycles渲染器，在各种光照条件和相机角度下以高动态范围（HDR）进行渲染。

基础模型包含2.05亿个参数，使用AdamW优化器分两个阶段进行训练：

在 256×256 分辨率下训练 500,000 步，最多支持 1,536 个三角形。
在 512×512 分辨率下训练 100,000 步，最多支持 4,096 个三角形。

该模型支持任意基于三角形的输入，并且能很好地泛化到复杂的现实场景。如图4所示，它能准确地再现阴影、漫反射着色和镜面高光。

Figure 4: The figure presents a 3×3 grid of diverse 3D scenes rendered by RenderFormer. In the top row, the first scene shows a room with red, green, and white walls containing two rectangular prisms; the second features a metallic tree-like structure in a blue-walled room with a reflective floor; and the third depicts a red animal figure, a black abstract shape, and a multi-faceted sphere in a purple container on a yellow surface. The middle row includes three constant width bodies (black, red, and blue) floating above a colorful checkered floor; a green shader ball with a square cavity inside a gray-walled room; and crystal-like structures in green, purple, and red on a reflective surface. The bottom row showcases a low-poly fox near a pink tree emitting particles on grassy terrain; a golden horse statue beside a heart-shaped object split into red and grey halves on a reflective surface; and a wicker basket, a banana and a bottle placed on a white platform. — 图4. RenderFormer 生成的不同3D场景的渲染结果

由于RenderFormer能够模拟视角变化和场景动态，它还可以通过渲染单个帧来生成连续的视频。

RenderFormer 渲染的3D动画序列

展望：机遇与挑战

RenderFormer 代表了神经渲染领域的一个重大飞跃。它证明了深度学习可以复制并可能取代传统的渲染管线，支持任意3D输入和逼真的全局照明——所有这些都无需任何手动编写的图形计算。

然而，关键的挑战仍然存在。要扩展到具有复杂几何结构、先进材料和多样化光照条件的更大、更复杂的场景，还需要进一步的研究。尽管如此，这种基于Transformer的架构为未来与更广泛的AI系统（包括视频生成、图像合成、机器人技术和具身智能）的集成提供了一个坚实的基础。

研究人员希望RenderFormer能成为图形学和AI领域未来突破的基石，为视觉计算和智能环境开启新的可能性。

🚀 想要体验更好更全面的AI调用？

欢迎使用青云聚合API，约为官网价格的十分之一，支持300+全球最新模型，以及全球各种生图生视频模型，无需翻墙高速稳定，小白也可以简单操作。

青云聚合API官网https://api.qingyuntop.top

支持全球最新300+模型：https://api.qingyuntop.top/pricing

详细的调用教程及文档：https://api.qingyuntop.top/about

目录CONTENT

RenderFormer：神经网络如何革新3D渲染技术

RenderFormer：神经网络如何革新3D渲染技术

RenderFormer：迈向通用神经渲染模型

架构概览

重访：AI革命在医学中的应用

双分支设计：实现与视角无关和与视角相关的效果

训练方法和数据集设计

展望：机遇与挑战

评论区