目 录CONTENT

文章目录

评估人工智能在科学研究任务中的能力

Administrator
2026-01-17 / 0 评论 / 0 点赞 / 0 阅读 / 0 字

📢 转载信息

原文链接:https://openai.com/index/frontierscience

原文作者:OpenAI


2025年12月16日

评估人工智能在科学研究任务中的能力


我们推出 FrontierScience,这是一项全新的基准,用于衡量 AI 在物理、化学和生物等领域进行专家级科学推理的能力。


局部图像:背景为柔和的绿黄渐变,右侧有一段被裁切的以 “Fro…” 开头的大字样,叠放着多个文字块,内容包括“Factual”、“Gradable”、“Objective”、“Difficult”等评审标准。

推理能力是科学工作的核心。科学家不仅需要记住事实,更要提出假设、不断验证与修正,并在不同领域之间整合观点。随着模型能力不断提升,我们面临的关键问题是:它们如何进行更深入的推理,从而真正推动科学研究。


过去一年里,我们的模型取得了重要突破,包括在国际数学奥林匹克和国际信息学奥林匹克中达到金牌水平。同时,我们也开始看到,最先进的模型(如 GPT‑5)正在大幅提升科研速度。研究人员利用这些系统跨学科、跨语言检索文献,或处理复杂的数学证明。在许多情况下,模型将原本需要数天甚至数周的工作压缩到数小时。相关进展记录在我们于 2025 年 11 月发布的论文利用 GPT‑5 加速科学研究的早期实验中,该论文展示了 GPT‑5 能够显著加快科研速度的初步证据。

推出 FrontierScience

加快科研速度是人工智能造福人类最具潜力的方向之一。因此,我们持续提升模型在高难度数学与科学任务上的表现,并开发能帮助科研人员充分利用这些模型的工具。


当由博士专家撰写、旨在“防谷歌搜索”的科学基准 GPQA(在新窗口中打开)于 2023 年 11 月发布时,GPT‑4 的得分为 39%,低于专家基线的 70%。两年后,GPT‑5.2 的得分为 92%。随着模型的推理与知识能力不断提升,更具挑战性的基准对于评估和预测模型加快科研速度的潜力变得愈发重要。此前的科学基准大多集中于多项选择题,而且已趋于饱和,或并未真正聚焦于科学本身。


为弥补这一缺口,我们推出了 FrontierScience:一个用于衡量专家级科学能力的新基准FrontierScience 由物理、化学、生物等领域的专家撰写并审核,包含数百道兼具难度、原创性与科学意义的问题。它分为两个方向:Olympiad(评估类似奥赛的科学推理能力)和 Research(评估真实科研场景中的研究能力)。更深入地了解模型的科学能力,有助于我们追踪进展,并实现 AI 为科学研究提速。


在我们的初步评测中,GPT‑5.2FrontierScienceOlympiad 方向(得分 77%)和 Research 方向(得分 25%)均取得了目前最好的成绩,领先其他前沿模型。我们看到,模型在解决专家级问题上进步明显,同时仍存在提升空间,尤其是在开放式、研究型任务上。对科学家而言,这意味着当前模型已经能够支持部分需要结构化推理的研究环节,但也凸显了在提升其开放式思考能力方面仍需大量工作。这些结果与科学家目前使用 AI 的方式高度吻合:利用模型有效缩短科研流程,同时依靠人类判断来界定问题并验证结果;越来越多地借助模型探索那些原本需要更长时间才能发现的思路与联系。这包括,在某些情况下由模型提出新见解,再由专家进行评估与验证。


归根结底,衡量 AI 科学能力最重要的标准,是它能帮助产生多少新的科学发现;这才是对科学与社会真正重要的成果。FrontierScience 为专家级科学推理提供了一个“上游”参考点,让我们能够在标准化的问题集上测试模型、观察其成功与不足,并确定改进方向。当然,FrontierScience 在某些重要方面相对有限(例如主要聚焦于专家撰写的受限问题),无法覆盖科学家日常工作的全部内容。但科学领域确实需要更具难度、原创性和意义的基准,而 FrontierScience 正朝这一方向迈出重要一步。

FrontierScience 的评测内容与构建方式

FrontierScience 评测包含 700 多道文本题目(其中 160 道属于金集),涵盖物理、化学、生物的多个子领域。该基准由 OlympiadResearch 两个部分组成。FrontierScience-Olympiad 包含 100 道由国际奥赛奖牌获得者设计的问题,旨在通过受限的简答形式评估科学推理能力。该题集专门设计为至少达到国际奥赛难度的理论问题。FrontierScience-Research 则由博士科研人员(博士生、教授或博士后)设计的 60 个原创研究子任务构成,并采用 10 分制评分标准。这些任务旨在模拟博士科研人员在研究过程中可能遇到的、具有自洽性且需要多步推理的高难度问题。

示例问题

B1 reacts with aqueous bromine (Br2) to form B2. B2 reacts with potassium nitrite (KNO2) to form B3. B3 is nitrated in nitric acid (HNO3) and sulfuric acid (H2SO4) to form B4.

  • B1 contains a monosubstituted aromatic 5-membered heterocycle and has a molar mass of 96.08 g/mol. It may be produced by dehydrating 5-carbon sugars (e.g. xylose) in an acid catalyst.
  • B2 has the molecular formula C4H2Br2O3 and contains a tetrasubstituted alkene with 2 substituents being bromines cis to each other.
  • B3 is a dipotassium salt with a molar mass of 269.27 g/mol. It contains 1 hydrogen.
  • B4 is an achiral pseudohalogen dimer with 2 carbons, no hydrogens and a molar mass of 300. g/mol.

When B4 decomposes in solution, it forms an intermediate B5 and 1 equivalent of dinitrogen tetroxide (N2O4) as a side product. Intermediate B5 can be trapped and detected as a Diels-Alder adduct.

Provide the structures of B1, B2, B3, B4, and B5 in the following format, "B1: X; B2: X; B3: X; B4: X; B5: X".

FrontierScience 中的每一个任务均由物理、化学或生物领域的专家编写并审核。对于 Olympiad 题集,所有专家至少在一次(通常是多次)国际奥赛中获得过奖牌。对于 Research 题集,所有专家均拥有相关领域的博士学位。


Olympiad 题目由 42 位相关领域的前国际奥赛奖牌获得者或国家队教练共同编写,编写者累计拥有 109 枚奥赛奖牌。Research 题目则由 45 位具备资质的科学家与领域专家合作完成。所有参与的科研人员均为博士生、博士后或教授,他们的专业背景覆盖一系列重要且高度专业化的科学学科,包括量子电动力学、有机合成化学和进化生物学等。


在两个题集的构建过程中,我们都会根据 OpenAI 内部模型的表现进行一定筛选(例如,剔除模型已能正确回答的题目,因此相较其他模型,这些评测可能对内部模型略有不利偏向)。我们已开源 Olympiad 金集的 100 道题目和 Research 金集的 60 道题目,其余题目则留用于监测潜在的数据污染。

流程图展示了任务开发流程的四个阶段 — 创建、审查、解决和修订。每个步骤都有简要说明,并附有“事实性”“可评分性”“客观性”“难度”等标准的勾选标记。

任务开发经历四个阶段:创建、审查、解决和修订。独立专家会相互审查彼此的任务,以确保其符合既定标准。

我们如何评估模型表现

Olympiad 题集可以通过简答形式评分:答案可以是数字、表达式,或通过模糊字符串匹配判断正确性。这种方式便于验证答案,但也会在一定程度上限制题目的表达空间与开放性。对于 Research 题集,我们采用基于评分细则 (rubric) 的架构来评估开放式任务。每道题都配有包含评分细则(由多个独立、可客观判断的评分项组成),总分为 10 分。评分细则不仅考察最终答案是否正确,也评估中间推理步骤的合理性,从而实现更细致的模型性能分析与错误诊断。若某个解答分数至少为 7/10,则被视为“正确”。


模型的回答由基于模型的评分器 (GPT‑5) 进行评估,依据的是简答答案或评分细则。理想情况下,我们希望由人类专家逐题评分,但这种方式难以规模化,因此我们设计了可由模型评分器执行的评分细则。我们还开发了验证流程,以确保评分细则与题目在难度与正确性方面校准良好。

Sample chemistry research problem

Phthalocyanines are important macrocyclic compounds widely utilized in material and life sciences due to their outstanding physicochemical properties, including intense absorption in the visible-near-IR region and high electron-accepting ability. However, unlike porphyrins, the structural diversity of phthalocyanines has been limited by traditional synthetic methods that rely on one-step condensation of isoindoline units. The development of methodologies that allow controlled modification of the phthalocyanine framework, particularly at the meso-positions, represents a significant advancement in expanding the structural and electronic diversity of these compounds.

Analyze how the modification of meso-nitrogen atoms in nickel(II) phthalocyanine affects the electronic structure, aromaticity, spectroscopic properties, and reactivity of the resulting macrocycles. Explain the synthetic methodology that enables these modifications, the relationship between π-electron count and observed properties, and how these systems compare to traditional phthalocyanines and related porphyrinoids.

Sample response

1. Introduction: Why Modify the meso-Nitrogens of Phthalocyanines?

Phthalocyanines (Pc) are a class of macrocyclic dyes known for their remarkable chemical, electronic, and optical properties. Their widespread use in materials science, catalysis, and biology stems from strong absorptions in the visible/near-IR regions, exceptional redox characteristics (i.e., high electron affinity), and rigid π-conjugation. Traditional synthetic methods, however, typically form the macrocycle in a single step from phthalonitrile or isoindoline precursors, restricting the ability to introduce substitutions selectively at key ring positions-especially the "meso" (bridging) nitrogens. Recently, methodologies allowing targeted modification of the meso-nitrogen atoms in Ni(II) phthalocyanines have emerged as a powerful means to expand the structural and electronic diversity of these macrocycles.

2. Effect on Electronic Structure and Aromaticity

  1. π-Electron Count and Ring Currents

    Phthalocyanines are classical 18-electron aromatic systems, conferring high stability and planarity. Altering the meso-nitrogens through substitution or ring expansion can shift the effective π-electron count. For example, introducing additional unsaturated moieties or heteroatoms may expand the cyclic conjugation to 20 or 22 electrons, moving the molecule between Hückel aromaticity and other regimes that affect ring currents and magnetic properties.

  2. Charge Distribution and Frontier Orbitals

    Meso-substitution can tune the electron density distribution around the macrocycle. Electron-donating groups can push electron density toward the ring center, lowering the energy of the lowest unoccupied molecular orbital (LUMO). Electron-withdrawing groups, conversely, stabilize the highest occupied molecular orbital (HOMO) and shift redox potentials toward more positive values, altering both the electrochemical profile and the Q- and B-band positions in the UV-vis spectrum.

3. Spectroscopic Consequences

  1. UV-Vis Absorption (Q and B Bands)

    The principal absorption features of phthalocyanines lie in the visible (Q-band, typically 600-700 nm) and near-UV (B-band, typically 300-400 nm).

    Substitution that expands the ring conjugation or introduces strong electron-donating/withdrawing groups can:

    • Shift the Q-band to longer wavelengths (bathochromic shift), reaching into the near-IR, which is highly desirable for optoelectronic and photodynamic applications.
    • Alter relative intensities of these bands and merge or split them, reflecting changes in orbital symmetries and energies.
  2. NMR Spectroscopy and Aromatic Ring Currents

    Modifications to the π-electron count and distribution are directly observed in 1H and 13C NMR chemical shifts.

    More highly conjugated (or expanded) aromatic rings exhibit distinct downfield shifts for protons located within induced ring currents, while any partial loss of aromaticity or incorporation of antiaromatic segments can cause atypical shielding/deshielding patterns.

4. Reactivity and Coordination Chemistry

Because phthalocyanines are often used as redox catalysts or sensors, the meso-nitrogen modifications can significantly influence reactivity:

  • Electron-rich meso substituents facilitate nucleophilic or electrophilic attacks at the ring periphery, enabling site-selective functionalizations that are otherwise difficult.

(... shortened for the purposes of this figure)

Sample grading rubric
Analysis of Traditional Phthalocyanine Synthesis Limitations (1 point)
通过 1.0 point: Correctly explains that traditional phthalocyanine synthesis involves one-step condensation with simultaneous formation of all meso-nitrogen bridges, providing limited control over substitution patterns at these positions.

0.5 point: Mentions limitations of traditional methods but without specific focus on meso-position control challenges.

0.0 point: Fails to identify key limitations of traditional synthetic approaches or provides incorrect analysis.

Thiolate-Mediated Tetramerization Process (1 point)

1.0 point: Correctly describes the thiolate-mediated reductive tetramerization and explains how counter cation size (K+ or Cs+ vs. Na+) affects selectivity between tetramer formation and direct macrocyclization.

0.5 point: Mentions thiolate-mediated tetramerization but without explaining factors controlling selectivity.

失败 0.0 point: Incorrectly describes the oligomerization process or omits critical details about selectivity control.
Analysis of NMR Spectroscopic Features (1 point)

1.0 point: Correctly explains that upfield shifts in the 16π system indicate paratropic ring current (antiaromaticity), contrasts this with the broad signals in 17π systems due to paramagnetism, and connects these observations to the underlying electronic structures.

通过 0.5 point: Identifies basic NMR patterns but without clear connection to ring currents or electronic structure.

0.0 point: Incorrectly interprets NMR data or fails to connect spectral features to electronic properties.

Electrochemical Property Analysis (1 point)

1.0 point: Correctly explains that the 16π system shows two reversible reductions reflecting conversion to 17π radical and 18π aromatic states, while 17π systems show narrow redox gaps due to facile interconversion between 16π, 17π, and 18π states, and relates these patterns to the underlying electronic structures.

通过 0.5 point: Describes redox patterns without clearly connecting them to specific electronic state changes.

0.0 point: Incorrectly interprets electrochemical data or fails to connect redox behavior to electronic properties.

Analysis of Absorption Spectroscopy (1 point)

1.0 point: Correctly explains that the 16π system shows weak/broad absorption due to symmetry-forbidden HOMO-LUMO transitions in antiaromatic systems, while 17π systems show Q-like bands plus NIR-II absorptions characteristic of radical species, and contrasts these with typical phthalocyanine spectral features.

通过 0.5 point: Describes absorption features but provides limited connection to underlying electronic structures.

0.0 point: Incorrectly interprets absorption data or fails to relate spectral features to electronic properties.

Reactivity Analysis of Antiaromatic System (1 point)

1.0 point: Correctly explains the high reactivity of the 16π system toward nucleophiles, details specific reactions with hydroxide (ring opening) and hydrazine (ring expansion), and explains how these transformations relieve antiaromatic destabilization.

0.5 point: Mentions reactivity but provides limited analysis of specific transformations or the driving forces behind them.

失败 0.0 point: Incorrectly analyzes reactivity patterns or fails to connect them to the antiaromatic character of the 16π system.

(... and more)

Research 题集中的每个任务都采用总分 10 分的评分细则,由专家或模型评分器使用。为了扩大评估规模,我们使用另一模型对回答进行评分。

模型性能

我们在 FrontierScience-OlympiadFrontierScience-Research 上评测了多款前沿模型:GPT‑5.2Claude Opus 4.5Gemini 3 ProGPT‑4oOpenAI o4-miniOpenAI o3。除 GPT‑5.2 以 “xhigh” 推理强度运行外,其余推理模型均以 “high” 推理强度评测。在我们的初步评测中,GPT‑5.2FrontierScienceOlympiad 方向(得分 77%)和 Research 方向(得分 25%)均取得了目前最好的成绩,领先其他前沿模型。Gemini 3 ProOlympiad 部分的表现与 GPT‑5.2 接近 (76%)。


我们观察到,模型在解决专家级问题方面取得了显著进展,尤其是在开放式研究类任务上,但仍有提升空间。从失败案例的记录来看,前沿模型仍会出现推理、逻辑或计算错误,对某些小众科学概念理解不足,或产生事实性错误。

我们比较了多款前沿模型的准确率,其中 GPT‑5.2 在 FrontierScience-Research 和 Olympiad 两个部分均表现最佳。

我们对 GPT‑5.2 和 o3 在不同推理强度下的准确率进行了比较。延长思考时间可以带来更高的准确率。

局限性与未来方向

尽管 FrontierScience 在科学基准的难度上迈出了重要一步,但仍存在不少局限。FrontierScience 的题目采用受限的问题形式,主要关注最终答案的正确性 (Olympiad) 或完成研究任务的推理过程 (Research)。此外,对于篇幅较长的任务,使用包含多个评分项的细则,其客观性仍不如直接核对最终答案。


FrontierScience 能够更清晰地呈现模型在专家编写的高难度问题上的推理表现,但它并不能全面反映科学研究在现实中的运作方式。尤其是,它并未评估科研中至关重要的一部分:模型如何提出真正新颖的假设,或如何处理多模态信息,包括视频数据以及物理世界中的真实实验系统。


展望未来,我们预计科学推理能力的提升将来自两方面:更强的通用推理系统,以及更专注的科学能力改进。FrontierScience 只是众多工具之一。随着模型不断进步,我们计划持续迭代这一基准,将其扩展到更多领域,并与更贴近现实的评测结合,关注这些系统在实际科研中真正能帮助科学家完成的工作。像 FrontierScience 这样的基准有助于我们识别当今 AI 系统的不足,从而聚焦于如何让模型成为科学探索中的可靠合作伙伴。




🚀 想要体验更好更全面的AI调用?

欢迎使用青云聚合API,约为官网价格的十分之一,支持300+全球最新模型,以及全球各种生图生视频模型,无需翻墙高速稳定,文档丰富,小白也可以简单操作。

0

评论区