

2024-11-13 17:03:12 29




(应邀在开放原子开源基金会举办的《开源产业生态大会》上的发言)   2024.09.24


............................................................Polytechnic University of Valencia 2024.09.25

四、坚持发展基于开源的人工智能(“开源AI”).....陆首群 2024.10.15

五、对治理开源基础模型的思考.....斯坦福大学Percy Liang等2024.10.11


...........................................由COPU组织并引入李飞飞大师谈话 2024.10.20

七、人类水平的AI.....................................................Yann LeCun 2024.09.10

八、评杨立昆大师的演讲中谈及“安全AI”问题..........陆首群 2024.10.29





924日, COPU举行《“草莓”-o1模型小型讨论会》,邀请北京智源研究院杨耀东研究员参加讨论并作主要发言,以及接受回答。会前,COPU准备了一批提问如下:



3)如何理解o1模型全面超越了GPT-4o或刷新了SOTA? o1模型能否减少生成式语言模型天生的缺陷?


5)OpenAI o1模型在“后训练”扩充律post-Training Scaling Laws下,如何提升推理能力和长程问题能力?



具有1级推理能力·chat bots,语言交流,






在讨论会上重点谈到下列问题:1)语料数据问题在语料数据搜集中,开始人们选择日常使用的数据,随着数据量需求的增加,发展到选择行业数据、互联网海量数据,人工智能的发展使上述数据来源已不敷需要,开始创造合成数据,合成数据虽然能满足语料数据量增长的需求,但也出现了数据污染的问题。生成式语言大模型产生缺陷,与机器依赖于统计技术有关,也与语料数据的污染有关。2)“后训练”时代已经到来过去我们对语言大模型抓预训练,直到对齐,由Open AI开发的o1模型开启了“后训练”(以增加推理能力)。据北京大学对齐团队独家解释:新的扩展律 post-Training已经出现,后训练时代已经到来。 强化学习成为o1模型的技术基础。)o1的技术基础,针对后训练,在学习与搜索选择中选择学习,强化学习成为o1模型的技术基础。在思考链中,GPT-4属于快思考(选择搜索)o1属于慢思考(因为推理)o1模型在哪些地方超越 GPT-4o?①推理占先的性能,o1表现优秀(或者说,o1整型在复杂推理、数学和代码问题上,提升到一个全新高度,优于LLM的水平)在数学代码、竞争性编程、数学奥林匹克竞赛、物理/生物/化学博士考试等推理占先的性能方面,o1优于GPT-4o②解决语言大模型存在的缺陷问题上,o1优于GPT-4o总的来说,o1推理能力强,通用能力弱;o1GPT-4o,其写作能力并未提高,指令跟踪也未超越。



1) Post training 工作还属于 train-time Scaing阶段,跟 pre-training一样类似于普通软件的源码、编译阶段,而o1的创新主要在 test-time Compute类似于 runtime阶段,选有Ilya署名的文章“ Lets Verify step by step”有条件的单位应该多做实验了,给数学的“因为…,所以…”标注给正确的和不正确的 intermediate rationals  reward,生成思维链CoT

2) o1应该有一个PRM Verifier验证网络不停地比较reward大小

3) PRM=process Reward Models

4) “后训练时代来了”显然是错误判断

5) Post-training inference并不相同, inference“ test-time Compute

6) 更多算力不是投入 post-training而是 inference Scaling

7) inference有点类似通常说的 runtime


杨老师并不认同韩宪平的意见,推荐OpenAI o1技术分析:强化学习“后训练”时代来了的文章:为什么我们需要 post-Training Scaling Laws?pre-training 阶段Scaling Laws随着模型尺寸逐渐增大,预训练阶段产数 Scaling Up带来的边缘收益开始递减,如果想深度提升模型推理能力和长程问题能力,基于RL  post-Training 将会成为下一个突破口。自回归模型在数学推理问题上很难进一步的一点在于没有办法进行回答的自主修正,如果仅是依靠生成式方法和扩大参数规模,那么在数学推理任务上带来的收益不会太大,所以需要寻找额外的Scaling Laws

恰在此时,智源研究院理事长黄铁军教授为支持我们o1模型的讨论,也转来北京大学对齐团队(指导:杨耀东)独家解读的文章:OpenAI o1开启“后训练”如下:

新的扩展律 post-Training 已经出现,后训练的时代已经到来。OpenAI o1开启“后训练”时代学习新范式。Open AI o1在数学、代码、长程规划上取得显著进步。2023, Deep-mindCEO Demis Hassabis强调用 Tree Search来增强模型的推理能力。在o1上训练中也用到 Tree Search的技巧。实际上,OpenAI o1运用的技术关键还是在于强化学习的探索与学习机制。基于LLM已有的推理能力,迭代式的 Boot strap模型产生合理推理过程( Rationales)的能力,并将 Rationales融入到训练过程内,让模型学会进行推理,而后再运用足够强大的计算量实现 Post-Training阶段的 Scaling。注意这里合理推理过程并不只是对问题的拆解和初步作答,还有对于为什么如此作答的分析和思考。

技术要点有三:1、后训练扩展律 post-Training Scaling Laws已经出现,并且 Post-Training Scaling Laws为上述技术路径的成功提供了有力的支持。2、模型学习的是产生合理推理的过程,MCTS在其中的作用是诱导合理推理过程的产生或构建相应的偏序对形式细粒度奖励信号,而非直接搜索过程和最终答案。3、模型的 Boot Strap有助于构建新的高质量数据,并且新的 Retionates  数据促进了模型进一步提升能力。Open AI o1的发布是 Post-Training Scaling Laws 的强力体现。

北京时间913日午夜OpenAI发布o1系列模型,旨在专门解决难题。Open AI o1的成功离不开后训练阶段( Post-Training Stage)中强化学习训练和推理阶段思考计算量的增大。新的扩展律——“后训练”扩展律( Post-Training Scaling Laws)可能引发社区对于算力分配、后训练能力的重新思考。OpenAI o1在数学代码等复杂推理能力上取得巨大进步。帮助o1取得如此性能飞跃的是 Post-Training 阶段RL计算量的Scaling和测量推理阶段思考时间的ScalingOpen AI o1在一些常规任务上没有显著提升,推理能力和强指令似乎呈现了分离。在“后训练”扩展律( post-Training Scaling Law),训练阶段的计算不再只是和参数量的上升有关,同时也会包含RL探索时LLM Inference的计算量,测试阶段模型推理和反思的计算量也会影响模型最终的表现。随着更多的强化学习(训练时计算)和更多的是思考时间(测试时计算,o1的性能也在不断提升。随着参数扩展律的边际效益逐渐递减,应将更多算力转向 Post-Training阶段和推理阶段。Open AI的成功,关键在于合理使用强化学习的探索仅靠蒙特卡洛树搜索(MCTS)是远远不够的,因为仅靠MCTS无法让模型学会思考问题的关联。在隐式自动化CoT背后,是模型真正学会了合理的中间推理过程Rationales。通过思维链( Chain of Thought,COT)优化模型输出,因为该思维链在其生成过程中有助于增强模型的推理能力(尤其在数学和代码生成等任务中表现出色)

929日韩老师发文:我明白了他们说的“后训练”是指的 post( pre-train+post-train),训练阶段是给知识编码,参数就固定不再调整了。说“推理时代来了”多好。推理也是陆总最先提出的。





1991年AT&T-Bell Labs(USL/USG)与中国合作,将其最新开发的UNIX版本-UNIXSVR4.2源代码向中方开放(中方是全球获得UNIX源代码的第一家),此时AT&T已将UNIX从开源转向闭源,中美合作于1992年翻译出版了UNIX SVR4.2中文版,并宣布开源(成为UNIX SVR4.2由闭源的英文版转向开源的中文版全球的第一家),从此时起,中国开源的发展至今也有32年的历史。









Polytechnic University of Valencia










该研究的通讯作者José Hernández Orallo教授表示:“语言模型的可靠性与人类对任务难度的感知不匹配。模型能够解决博士级的数学问题,但同时却可能在简单的加法上出错。”






1 GPT、LLaMA和BLOOM核型的关键指标







2 GPT和LLaMA模型的性能随难度增加而提高


该论文的第一作者 Lexin Zhou表示:“这可能导致最初过于依赖模型的用户感到失望。此外,与人类不同,避免提供答案的倾向不会随着困难而增加。例如,人类倾向于避免对超出其能力的问题给出反馈。这让用户有责任在与模型的交互过程中发现错误。”




研究发现,人们对难度的认识存在不一致。论文作者之一 Yad Moros Daval说:“模型是否在我们预期的地方失败了?我们发现,模型在人类认为困难的任务上往往不太准确,但即使在简单任务上,它们也不是100%准确。这意味着不存在可以信任模型完美运行的安全区。

具体而言,未经优化的GPT和LLaMA模型对提示词的选择表现出极高的敏感性,尤其是在简单任务中。如果提示词选择得当,模型的表现会有所提升;而优化后的模型在提示词敏感性上有所改善,表现更加稳定,但也存在一定的变异性。经过优化的模型相比原始模型( raw models)在变化上更为稳定,且正确率更高,但与人类判断难度的一致性和谨慎度方面表现较差。


3  LLaMA、BLOOM系列以及非结构GPT模型的尺度分析

Larger and more instructable language models become less reliable

Polytechnic University of Valencia

Lexin Zhou, Wout Schellaert, Fernando Martínez-Plumed, Yael Moros-Daval, Cèsar Ferri & José Hernández-Orallo



The prevailing methods to make large language models more powerful and amenable have been based on continuous scaling up (that is, increasing their size, data volume and computational resources1) and bespoke shaping up (including post-filtering2,3, fine tuning or use of human feedback4,5). However, larger and more instructable large language models may have become less reliable. By studying the relationship between difficulty concordance, task avoidance and prompting stability of several language model families, here we show that easy instances for human participants are also easy for the models, but scaled-up, shaped-up models do not secure areas of low difficulty in which either the model does not err or human supervision can spot the errors. We also find that early models often avoid user questions but scaled-up, shaped-up models tend to give an apparently sensible yet wrong answer much more often, including errors on difficult questions that human supervisors frequently overlook. Moreover, we observe that stability to different natural phrasings of the same question is improved by scaling-up and shaping-up interventions, but pockets of variability persist across difficulty levels. These findings highlight the need for a fundamental shift in the design and development of general-purpose artificial intelligence, particularly in high-stakes areas for which a predictable distribution of errors is paramount.



Millions of people are using general-purpose artificial intelligence (AI) systems based on large language models (LLMs), which have become commonplace in areas such as education6, medicine7, science8,9 and administration10,11. As these models frequently make mistakes, users have to supervise model operation and manage their expectations, for the reliable use of these systems. With language models becoming larger and more instructable, we need to analyse how this reliability has evolved. Since the early LLMs12,13,14, models have been scaled up—trained with more parameters, on larger datasets and with longer training times—and have also been shaped up with human feedback—using techniques such as instruction fine tuning4, reinforcement learning from human feedback (RLHF)5 or output-filtering moderation techniques2,3.

It may be taken for granted that as models become more powerful and better aligned by using these strategies, they also become more reliable from a human perspective, that is, their errors follow a predictable pattern that humans can understand and adjust their queries to15. For instance, early models failed at simple additions such as ‘20 + 183’. Performance was highly predictable: failure was common. As a result, users easily understood that there was no operating range for this task: nobody used these models for addition. A few scaled-up and shaped-up generations later, the models not only seemingly master these additions but also successfully perform additions of 50 digits or more. Because of this prowess, people may start using them as calculators (for example, to convert measurements to different units16). It is only in such cases that users become disappointed when the model fails at a simple prompt such as ‘Add 3913 and 92’. The user-driven reliability is then seriously damaged, because the model fails when the user thinks these digits were in the operating range. The experience becomes even more baffling when the user gets the correct answer if the question is adjusted slightly, for example to ‘3913 + 92 =’, or if it is not changed at all—because many models are configured to be non-deterministic. Although this prompt sensitivity has been analysed extensively17,18,19,20, it is poorly understood why an over-diligent system spouts a wrong answer for 100-digit addition instead of simply answering ‘I’m afraid I can’t do that’. This reckless behaviour has been incentivized by developers building models that are ‘never evasive’21.

Reliability fluctuations

To understand the evolution of reliability, we analyse the trajectory of several families of LLMs: the generative pre-training (GPT) saga developed by OpenAI, the LLaMA series developed by Meta and the BLOOM suite developed by BigScience. GPT has led the state of the art in the past few years and, according to several surveys22,23,24, is central to the LLM ecosystem, influencing transformer-based architectures, training data, evaluation frameworks and alignment techniques. LLaMA25,26 is the best example of a family for which weights have been released, and BLOOM27,28 is the result of an even more open endeavour coming from the scientific community. Each family represents a genuine effort of making LLMs more capable and better aligned at the same time. Table 1 summarizes the details of models in these three families. Scaling (increasing the number of parameters, data size and compute) has been identified as a key predictor for overall performance1, and shaping (modifying the trained systems) has improved their instructability and alignment. This creates two categories. The first includes the ‘raw’ models—GPT-3 ada, babbage, curie and davinci—the non-chat LLaMA models and the base (non-z) BLOOM models. The second comprises the shaped-up models (or instruct or chat models), which incorporate some kind of instruction adaptation22, fine tuning or safety moderation of the outputs. For our analysis, it is convenient that BLOOM and LLaMA have six and three exactly paired versions, respectively, of raw and shaped-up models to disentangle scaling up from shaping up.

Table 1 Ten GPT, ten LLaMA and twelve BLOOM models 

https://www.nature.com/articles/s41586-024-07930-y/tables/1 (Full size table)

Figure 1 represents how some key indicators show that the shaped-up models (in blue) are more stable to prompt variation and are more correct, at the cost of being less concordant with human difficulty, and having more overall failures (less prudent). The indicators summarize the behaviour of five carefully selected benchmarks in the domains of simple numeracy (‘addition’), vocabulary reshuffle (‘anagram’), geographical knowledge (‘locality’), diverse scientific skills (‘science’) and information-centric transformations (‘transforms’). This covers a range of domains and degrees of open-endedness of the answers.


The raw models (yellow to orange) and the shaped-up models (light to dark blue) cluster differently. As the answers for all these models fall into three categories (correct, avoidant and incorrect), shortened as c, a and i, respectively, we have indicators for correctness versus avoidance + incorrectness, and prudence (correctness + avoidance) versus incorrectness. Looking at the correctness indicators (top half), which represent accurate responses, we see that the shaped-up models are more stable to prompt variations and are more frequently correct (higher correctness proportion) but are less concordant with human difficulty than the raw counterparts. Looking at the prudence indicators (bottom half), we see that the shaped-up models are also more stable to prompt variations, but fail more frequently (lower prudence proportion, by avoiding less) and are not much more concordant with human difficulty. Focusing only on the shaped-up models (in blue), we observe that the most powerful GPT-4 v.2, LLaMA-2-70b-chat and BLOOMz-176b models perform best in correctness proportion and prompting stability (top and bottom), but equal to or worse than other models for all the other indicators, with many fluctuations that do not indicate a clear positive trend in these other dimensions. Details of the indicators and data used for this plot are found in the Methods. Extended Data Table 1 provides a more detailed perspective on the same results.


We identify good intrinsic proxies for human difficulty based on relevant literature in the first two domains (‘addition’ and ‘anagram’), or by identifying demand-related features in the rest (excluding ‘science’, for which multiple human difficulty assessments were already available for all the instances29). To determine their quality, we conducted an extensive human study (S1) to assess which difficulty proxies best matched human expectations, and calibrate the proxies to a normalized difficulty score, ranging from 0 to 100, representing the anticipated percentage of failure for the ‘average human’. Systematically controlling for human difficulty is crucial for the understanding of user-driven reliability: human expectations of success depend on the perception of the difficulty of instances30,31,32. Table 2 provides an overview of the five benchmarks, the intrinsic difficulty function used as a proxy for human difficulty (discussed in the Methods), some examples and the calibrated human difficulty values for the given examples.

Table 2 Five benchmarks https://www.nature.com/articles/s41586-024-07930-y/tables/2)

Another necessary and innovative element in our analysis is that we consider three categories for the responses: correct, incorrect and avoidant, denoted by c, i and a, respectively. Avoidance in human participants has been extensively explored in psychology33,34,35. Such avoidant behaviours include procrastination, deviation, making excuses or simply not answering. For LLMs, avoidance is also referred to as hedging, refusal3 or evasiveness21, including fortuitous utterances or continuations that are not answers (non-conforming), and those responses at the meta-level explaining why the question is not answered (for epistemic or ethical reasons). Supplementary Table 11 shows the types of avoidance for some tasks in the five benchmarks.

Difficulty concordance, task avoidance and prompting stability must be regarded from the point of view of human users interacting with LLMs. Our human study S1 (see Supplementary Note 6) analyses whether human perceptions of difficulty in general are aligned with actual human performance and self-confidence, because this has important implications in the tasks humans decide to delegate to language models and their prompt formulation. But as crucial as the inputs are, so is the way the outputs from the model are used, verified or supervised. The context of use of both input and output determines how reliable the use of these systems is. We conducted a second human study S2 (see Supplementary Note 7), in which we explore whether human participants can accurately assess the outputs of models and thus compensate for different types of error. With a three-valued confusion matrix with correctness, avoidance and incorrectness, we can focus on the frequency of non-avoidant cases for which humans believe the output is correct but it is not (Fig. 3).

With this setup, we investigate three core and intertwined elements that affect the reliability of LLMs from a human perspective.

1. Difficulty concordance. Are errors more likely for items that humans perceive as difficult? Do scaling and shaping eliminate errors for easy items, thereby creating areas of reliable operation?

2. Task avoidance. How often do language models give plausible but wrong answers instead of safely avoiding answering questions? Are scaled-up, shaped-up models better at avoiding errors or making them detectable for humans?

3. Prompting stability. How are correctness and avoidance affected by tangential changes in the prompt? Are scaled-up, shaped-up models less sensitive to prompt variation across difficulty levels?

We will answer these questions by using human difficulty metrics for each benchmark (see Table 2), examining different kinds of avoidance (Supplementary Table 11), and using 15 natural prompt variations—prompts conceived as genuine instructions or questions provided by humans—per benchmark (Supplementary Tables 1 and 2). Difficulty, avoidance and prompting, as well as their evolution, have been analysed from different perspectives17,18,19,36,37,38,39 (see Supplementary Note 13 for a full discussion). Here we focus on the systemic interaction of these three elements from the perspective of LLM scaling and shaping up.


Figure 2 shows the results of a selection of models in the GPT and LLaMA families, increasingly scaled up, with the shaped-up models on the right, for the five domains: ‘addition’, ‘anagram’, ‘locality’, ‘science’ and ‘transforms’. We see that the percentage of correct responses increases for scaled-up, shaped-up models, as we approach the last column. This is an expected result and holds consistently for the rest of the models, shown in Extended Data Fig. 1 (GPT), Extended Data Fig. 2 (LLaMA) and Supplementary Fig. 14 (BLOOM family).


The values are split by correct, avoidant and incorrect results. For each combination of model and benchmark, the result is the average of 15 prompt templates (see Supplementary Tables 1 and 2). For each benchmark, we show its chosen intrinsic difficulty, monotonically calibrated to human expectations on the x axis for ease of comparison between benchmarks. The x axis is split into 30 equal-sized bins, for which the ranges must be taken as indicative of different distributions of perceived human difficulty across benchmarks. For ‘science’, the transparent yellow bars at the bottom represent the random guess probability (25% of the non-avoidance answers). Plots for all GPT and LLaMA models are provided in Extended Data Figs. 1 and 2 and for the BLOOM family in Supplementary Fig. 14.

Let us focus on the evolution of correctness with respect to difficulty. For ‘addition’, we use the number of carry operations in the sum (fcry). For ‘anagram’, we use the number of letters of the given anagram (flet). For ‘locality’, we use the inverse of city popularity (fpop). For ‘science’, we use human difficulty (fhum) directly. For ‘transforms’, we use a combination of input and output word counts and Levenshtein distance (fw+l) (Table 2). As we discuss in the Methods, these are chosen as good proxies of human expectations about what is hard or easy according to human study S1 (see Supplementary Note 6). As the difficulty increases, correctness noticeably decreases for all the models. To confirm this, Supplementary Table 8 shows the correlations between correctness and the proxies for human difficulty. Except for BLOOM for addition, all of them are high.

However, despite the predictive power of human difficulty metrics for correctness, full reliability is not even achieved at very low difficulty levels. Although the models can solve highly challenging instances, they also still fail at very simple ones. This is especially evident for ‘anagram’ (GPT), ‘science’ (LLaMA) and ‘locality’ and ‘transforms’ (GPT and LLaMA), proving the presence of a difficulty discordance phenomenon. The discordance is observed across all the LLMs, with no apparent improvement through the strategies of scaling up and shaping up, confirmed by the aggregated metric shown in Fig. 1. This is especially the case for GPT-4, compared with its predecessor GPT-3.5-turbo, primarily increasing performance on instances of medium or high difficulty with no clear improvement for easy tasks. For the LLaMA family, no model achieves 60% correctness at the simplest difficulty level (discounting 25% random guess for ‘science’). The only exception is a region with low difficulty for ‘science’ with GPT-4, with almost perfect results up to medium difficulty levels.

Focusing on the trend across models, we also see something more: the percentage of incorrect results increases markedly from the raw to the shaped-up models, as a consequence of substantially reducing avoidance (which almost disappears for GPT-4). Where the raw models tend to give non-conforming outputs that cannot be interpreted as an answer (Supplementary Fig. 16), shaped-up models instead give seemingly plausible but wrong answers. More concretely, the area of avoidance in Fig. 2 decreases drastically from GPT-3 ada to text-davinci-003 and is replaced with increasingly more incorrect answers. Then, for GPT-3.5-turbo, avoidance increases slightly, only to taper off again with GPT-4. This change from avoidant to incorrect answers is less pronounced for the LLaMA family, but still clear when comparing the first with the last models. This is summarized by the prudence indicators in Fig. 1, showing that the shaped-up models perform worse in terms of avoidance. This does not match the expectation that more recent LLMs would more successfully avoid answering outside their operating range. In our analysis of the types of avoidance (see Supplementary Note 15), we do see non-conforming avoidance changing to epistemic avoidance for shaped-up models, which is a positive trend. But the pattern is not consistent, and cannot compensate for the general drop in avoidance.

Looking at the trend over difficulty, the important question is whether avoidance increases for more difficult instances, as would be appropriate for the corresponding lower level of correctness. Figure 2 shows that this is not the case. There are only a few pockets of correlation and the correlations are weak. This is the case for the last three GPT models for ‘anagram’, ‘locality’ and ‘science’ and a few LLaMA models for ‘anagram’ and ‘science’. In some other cases, we see an initial increase in avoidance but then stagnation at higher difficulty levels. The percentage of avoidant answers rarely rises quicker than the percentage of incorrect ones. The reading is clear: errors still become more frequent. This represents an involution in reliability: there is no difficulty range for which errors are improbable, either because the questions are so easy that the model never fails or because they are so difficult that the model always avoids giving an answer.

We next wondered whether it is possible that this lack of reliability may be motivated by some prompts being especially poor or brittle, and whether we could find a secure region for those particular prompts. We analyse prompt sensitivity disaggregating by correctness, avoidance and incorrectness, using the prompts in Supplementary Tables 1 and 2. A direct disaggregation can be found in Supplementary Fig. 1, showing that shaped-up models are, in general, less sensitive to prompt variation. But if we look at the evolution against difficulty, as shown in Extended Data Figs. 3 and 4 for the most representative models of the GPT and LLaMA families, respectively (all models are shown in Supplementary Figs. 12, 13 and 15), we observe a big difference between the raw models (represented by GPT-3 davinci) and other models of the GPT family, whereas the LLaMA family underwent a more timid transformation. The raw GPT and all the LLaMA models are highly sensitive to the prompts, even in the case of highly unambiguous tasks such as ‘addition’. Difficulty does not seem to affect sensitivity very much, and for easy instances, we see that the raw models (particularly, GPT-3 davinci and non-chat LLaMA models) have some capacity that is unlocked only by carefully chosen prompts. Things change substantially for the shaped-up models, the last six GPT models and the last three LLaMA (chat) models, which are more stable, but with pockets of variability across difficulty levels.

Overall, these different levels of prompt sensitivity across difficulty levels have important implications for users, especially as human study S2 shows that supervision is not able to compensate for this unreliability (Fig. 3). Looking at the correct-to-incorrect type of error in Fig. 3 (red), if the user expectations on difficulty were aligned with model results, we should have fewer cases on the left area of the curve (easy instances), and those should be better verified by humans. This would lead to a safe haven or operating area for those instances that are regarded as easy by humans, with low error from the model and low supervision error from the human using the response from the model. However, unfortunately, this happens only for easy additions and for a wider range of anagrams, because verification is generally straightforward for these two datasets.


In the survey (Supplementary Fig. 4), participants have to determine whether the output of a model is correct, avoidant or incorrect (or do not know, represented by the ‘unsure’ option in the questionnaire). Difficulty (x axis) is shown in equal-sized bins. We see very few areas where the dangerous error (incorrect being considered correct by participants) is sufficiently low to consider a safe operating region.

Our observations about GPT and LLaMA also apply to the BLOOM family (Supplementary Note 11). To disentangle the effects of scaling and shaping, we conduct an ablation study using LLaMA and BLOOM models in their shaped-up versions (named chat and z, respectively) and the raw versions, with the advantage that each pair has equal pre-training data and configuration. We also include all other models with known compute, such as the non-instruct GPT models. We take the same data summarized in Fig. 1 (Extended Data Table 1) and perform a scaling analysis using the FLOPs (floating-point operations) column in Table 1. FLOPs information usually captures both data and parameter count if models are well dimensioned40. We separate the trends between raw and shaped-up models. The fact that correctness increases with scale has been systematically shown in the literature of scaling laws1,40. With our data and three-outcome labelling, we can now analyse the unexplored evolution of avoidance and incorrectness (Fig. 4, left).


As evident in Fig. 4, avoidance is clearly much lower for shaped-up models (blue) than for raw models (orange), but incorrectness is much higher. But even if correctness increases with scale, incorrectness does not decrease; for the raw models, it increases considerably. This is surprising, and it becomes more evident when we analyse the percentage of incorrect responses for those that are not correct in (i/(a + i) in our notation; Fig. 4 (right)). We see a large increase in the proportion of errors, with models becoming more ultracrepidarian (increasingly giving a non-avoidant answer when they do not know, consequently failing proportionally more).

We can now take all these observations and trends into account, in tandem with the expectations of a regular human user (study S1) and the limited human capability for verification and supervision (study S2). This leads to a re-understanding of the reliability evolution of LLMs, organized in groups of two findings for difficulty discordance (F1a and F1b), task avoidance (F2a and F2b) and prompt sensitivity (F3a and F3b):

F1a—human difficulty proxies serve as valuable predictors for LLM correctness. Proxies of human difficulty are negatively correlated with correctness, implying that for a given task, humans themselves can have approximate expectations for the correctness of an instance. Relevance: this predictability is crucial as alternative success estimators when model self-confidence is either not available or markedly weakened (for example, RLHF ruining calibration3,41).

F1b—improvement happens at hard instances as problems with easy instances persist, extending the difficulty discordance. Current LLMs clearly lack easy operating areas with no error. In fact, the latest models of all the families are not securing any reliable operating area. Relevance: this is especially concerning in applications that demand the identification of operating conditions with high reliability.

F2a—scaling and shaping currently exchange avoidance for more incorrectness. The level of avoidance depends on the model version used, and in some cases, it vanishes entirely, with incorrectness taking important proportions of the waning avoidance (that is, ultracrepidarianism). Relevance: this elimination of the buffer of avoidance (intentionally or not) may lead users to initially overtrust tasks they do not command, but may cause them to be let down in the long term.

F2b—avoidance does not increase with difficulty, and rejections by human supervision do not either. Model errors increase with difficulty, but avoidance does not. Users can recognize these high-difficulty instances but still make frequent incorrect-to-correct supervision errors. Relevance: users do not sufficiently use their expectations on difficulty to compensate for increasing error rates in high-difficulty regions, indicating over-reliance.

F3a—scaling up and shaping up may not free users from prompt engineering. Our observations indicate that there is an increase in prompting stability. However, models differ in their levels of prompt sensitivity, and this varies across difficulty levels. Relevance: users may struggle to find prompts that benefit avoidance over incorrect answers. Human supervision does not fix these errors.

F3b—improvement in prompt performance is not monotonic across difficulty levels. Some prompts do not follow the monotonic trend of the average, are less conforming with the difficulty metric and have fewer errors for hard instances. Relevance: this non-monotonicity is problematic because users may be swayed by prompts that work well for difficult instances but simultaneously get more incorrect responses for the easy instances.

As shown in Fig. 1, we can revisit the summarized indicators of the three families. Looking at the two main clusters and the worse results of the shaped-up models on errors and difficulty concordance, we may rush to conclude that all kinds of scaling up and shaping up are inappropriate for ensuring user-driven reliability in the future. However, these effects may well be the result of the specific aspirations for these models: higher correctness rates (to excel in the benchmarks by getting more instances right but not necessarily all the easy ones) and higher instructability (to look diligent by saying something meaningful at the cost of being wrong). For instance, in scaling up, there is a tendency to include larger training corpora42 with more difficult examples, or giving more weight to authoritative sources, which may include more sophisticated examples43, dominating the loss over more straightforward examples. Moreover, shaping up has usually penalized answers that hedge or look uncertain3. That makes us wonder whether this could all be different.


In this paper, we have conducted two human studies. The first investigates perceived and actual difficulty for participants to respond to an input (to determine whether difficulty expectations are correlated with difficulty proxies). The second includes participants supervising or verifying the output of a model (to determine whether humans will take incorrect responses as correct). Maximizing difficulty concordance and reducing possible incorrect-to-correct errors in human verification could be introduced in the loss function when training and shaping up these models. For this, collective efforts are needed to build larger datasets of human difficulty expectations and output supervision. With these data, more qualified than traditional human feedback, AI itself can be used to train supervisors that perform this shaping up, provided the aim is not to eliminate evasiveness as in ref. 21, but to find the right level of avoidance. Specialized language models in medicine and other critical areas may be designed with reject options, or coupled with external AI supervisors, thereby favouring avoidance by teaching the AI models when to refrain from answering37. These interventions should make LLMs exhibit enhanced human-like and human-aligned characteristics that ensure reliability. Until this is done, and given the high penetration of LLM use in the general population, we raise awareness that relying on human oversight for these systems is a hazard, especially for areas for which the truth is critical.

Finally, we include some limitations of our analysis and the future work that emanates from them. The first limitation of our study lies in the recruitment of participants who are mostly non-experts. We have to take this into account when interpreting the calibrated difficulty values, which are usually high for some benchmarks, as there is a high number of questions that cannot be solved by the general population. However, our motivation was to capture the same human population to estimate expected instance difficulties that are comparable across all the datasets. A second limitation is that our sample of ‘natural’ prompts was collected from a diversity of sources, but we did not have access to the frequency in which a prompt may appear in a real scenario. Last, we have only covered a sample of families with specific trajectories, excluding LLMs that delegate tasks to external tools or use sophisticated reasoning techniques, which may show different dynamics. The GPT family has been at the forefront in performance and has been used over a few years, making OpenAI extremely influential in the development of other language models22,23. In fact, the OpenAI application programming interface has the most dependencies when the ecosystems of foundation models are analysed24. LLaMA and BLOOM have a more open and systematic lineup of models, not only allowing for the disentanglement between scaling and shaping but also paving the way for an incremental analysis of their evolution using our methodology and code, in the fast-changing context of LLM development. Highlighting the reliability issues of these families and introducing new abstractions and tools for analysis is of utmost importance, enabling other researchers to explore different pathways for the scaled-up, shaped-up models of the future.


We now explain our choices of benchmarks, prompt templates, difficulty functions, response scoring, general experimental design and the key metrics used to evaluate the models.

Benchmarks and factors of difficulty

For the generality of our analysis, we selected five distinct benchmarks to reduce confounding factors as much as possible: simple numeracy (‘addition’), vocabulary reshuffle (‘anagram’), geographical knowledge (‘locality’), basic and advanced science questions (‘science’) and information-centric transformations (‘transforms’). These represent core skills (numerical, linguistic and knowledge) and more diverse ecologically valid scenarios, with some of them having extremely simple formulations and others requiring deep understanding of the information presented, as well as the integration of data from multiple sources. Closed-ended questions are typical of LLM research3, such as those found in the ‘science’ benchmark, but gradually more open-ended tasks (‘addition’, ‘anagram’, ‘locality’ and ‘transforms’) better represent a wider and more realistic use of LLMs.

Addition. This benchmark involves sums, prompting the LLMs by asking for the result of adding two addends (such as ‘3 + 7 =’). The examples in our analysis range from 1- to 100-digit additions. Because language models can not only memorize small additions but also generalize to cope with any combination of larger digits, this task is appropriate for analysing difficulty trends. With respect to the difficulty of ‘addition’, the number of digits and carry operations affect human performance on addition tasks.

Anagram. The use of anagrams as a way of assessing aspects of problem solving dates back to 1916 (ref. 45), and researchers have been using anagrams to examine a variety of phenomena, such as the cognitive processes involved in problem solving46. An ‘anagram’ task is a word puzzle in which the participant or model is presented with a jumbled string of letters, and the objective is to find a word that can be formed using all the letters given. The examples in our analysis range from 3-letter words to 20-letter words. This task involves letter manipulation and good recall from an extensive vocabulary. One peculiar element of this task is that it is easy to verify. The difficulty of anagrams is mostly influenced by the frequency of the letters and the word, the number of letters and the degree of rearrangement required.

Locality. This benchmark contains questions relating to geographical knowledge, inspired by some cognitive models of distance estimation47. The examples in our analysis ask questions about the location and size of cities in relation to each other, by giving an input city and a randomly generated distance (d, ranging from 1 to 1,000 km). The LLM is asked to identify the most populous city (the target city) in a radius of d km from the input city. This task requires geographical knowledge and reasoning. For this benchmark, potential human difficulty factors could be the city or country popularity, their population and so on.

Science. This benchmark integrates multiple-choice questions from basic science as collected by OpenBookQA, complemented with more advanced science questions from Google-proof Q&A (GPQA). They represent tasks that LLMs are likely to encounter in educational, academic and research settings6,8,48, some of which require considerable time to solve. The included questions are Google-proof49. The ‘science’ benchmark, thus, includes questions of varying levels of difficulty, as determined by human judgement, providing a lens through which to examine their handling of complex, data-rich tasks in specific domains.

Transforms. This benchmark includes a comprehensive set of information-centric transformation tasks based on real-world scenarios. It focuses on domains that are most prevalent in the use of LLMs today50, and ensure that there is a ground truth for evaluation. We integrate not only many data-formatting tasks—a well-studied area in LLMs51—but also new tasks about world knowledge, information retrieval, advertising, administration, coding, scheduling and retailing. The outputs for ‘transforms’ may require extensive elaboration of the input (hundreds of characters) to form a correct answer, which can also be hundreds of characters long. The aim was to simulate, as closely as possible, the complexity and depth of real-world questions in a controlled experimental setting. For task difficulty, given the heterogeneity, the main factors are as general as character and word counts, and the Levenshtein distance between input and output as a proxy of transformation effort.

For the previously described domains, we found intuitive human difficulty proxies, some of which have been developed in the literature. Supplementary Note 4 provides further details on the definition of difficulty metrics and the abilities behind the features used for their definition. Using the results from human study S1, we select the difficulty functions that are most correlated with human expectations (Supplementary Table 5): fcry for ‘addition’, flet for ‘anagram’, fpop for ‘locality’ and fw+l for ‘transforms’. For ‘science’, we blend and calibrate the two original human metrics into one, that is, fhum. For all the benchmarks, we normalize the original difficulty functions using a logistic mapping to a scale ranging from 0 to 100 that corresponds to the probability of human failure as estimated by humans themselves. We need to take into account that these values are an estimate (from the human sample in S1, of their expectations) and are fitted with a two-parameter logistic function; therefore, these values between 0% and 100% have to be interpreted with caution, especially for small differences (see Supplementary Note 8 for details). Nevertheless, having all the difficulty levels on the same human-expectations scale helps with the comparison of the benchmarks.

Data collection and generation

We first describe how the examples were collected or generated, and then the 15 prompt templates that were used for each of them.

Addition. We randomly generate 5,000 instances, in which each addend is sampled uniformly from 1 to 100 digits. We then remove those instances for which fhrm > 50 to prevent instances with similar or identical numbers of digits in both addends from dominating the upper difficulty bins. This is because, for example, if the difficulty is the harmonic mean, the bins with fhrm > 90 would be dominated by instances in which both addends have very high numbers of digits (that is, at least 82 digits). A similar phenomenon also occurs with other difficulty levels, but with the previous criterion considered, the problem is well mitigated. This results in a final sample of 3,142 instances.

Anagram. We use the Google Web Trillion Word Corpus52, containing the frequency of more than 300,000 most commonly used single words on the Web in English. From this corpus, we randomly choose up to 100 English words with 3–20 letters, resulting in a total of 1,570 words. There are fewer than 1,800 instances because there are fewer than 100 English words with 17–20 letters. Then, we shuffle the order of letters randomly to map these words into 1,570 anagrams. We make sure the resultant permutation is not the same as the original word.

Locality. We use the World Cities Database53, which provides an up-to-date database of the cities and towns globally. From this database, we first exclude cities with non-unique names across the globe. Next, we remove cities with more than one word or non-standard letters in the 26-character Latin alphabet (for example, Buenos Aires or Chŏngjin) to enhance the quality and ease of the response-scoring method. After the previous selection procedure, we seek to form a final sample that covers instances with different difficulty levels (or bins) as equally as possible. Thus, we perform binning on the difficulty function (fpop) to produce 100 bins in which we extract up to 50 instances from each bin randomly, resulting in a total of 2,341 instances. Again, there are fewer than 5,000 instances because some bins contain fewer than 50 instances.

Science. This benchmark is built by integrating multiple-choice questions from educational settings: OpenBookQA29 and GPQA49. OpenBookQA is a collection of multiple-choice questions in basic science, based on 1,329 established facts. We randomly sampled 1,000 questions from OpenBookQA. To complement the benchmark with more advanced science questions, we included GPQA49—a dataset containing 546 graduate-level questions written by domain experts that challenge LLMs to demonstrate a deep understanding of biology, physics and chemistry. We exclude two lengthy questions that exceed the context window limit for some of the models that we analyse.

Transforms. This benchmark includes a comprehensive set of information-centric transformation tasks based on real-world scenarios. We integrate many data-formatting questions from a data-wrangling dataset51 and from a ‘natural instructions’ dataset54, manually regenerating or adapting some of them. We also also introduce new tasks about world knowledge, information retrieval, advertising, administration, coding, scheduling and retailing, reflecting a wide range of real user interactions with language models. The benchmark integrates 73 different tasks, with 10 instances each, totalling 730 items.

Prompt generation

Notably, ‘addition’, ‘anagram’, ‘locality’ and parts of ‘transforms’ are newly introduced in this work. All five benchmarks are further supplemented with human data (see Supplementary Note 5) for calibrating difficulty levels and supervision, as well as a new variable describing the human-calibrated difficulty for each data instance.

Each example in each benchmark is run through an LLM using 15 different prompts, which are the same for all the examples in the benchmark. The generation of prompt templates aims to fulfil three requirements. First, the prompts should be as natural as possible, because we try to model a situation in which humans interact with LLMs in a similar way to how they would talk to other humans. Second, these prompts should be derived from or inspired by real-world sources, except for minor variations and adaptations. Third, we need to have sufficient coverage for and diversity of prompt templates, to robustly analyse sensitivity, omitting those that are too similar. This process results in 15 natural prompt templates for each benchmark, extracted from or inspired by textbooks, scientific literature, academic exams and the internet. Supplementary Note 2 describes further details about these prompt templates and their sources.

Response scoring

Scoring the validity of the responses of LLMs can be challenging, given that their raw text response can vary in different ways. For example, some responses are highly elaborate, whereas other responses are concise and straight to the point. Some responses are unrelated or digress from the proposed question, or are just excessively verbose, providing the answer in a larger response sequence surrounded by arbitrary information. Because our analysis uses three classes (correct, incorrect and avoidant), the confusion matrices have nine cells, making grading more challenging, and the traditional intuition and terminology of false positives, false negatives, sensitivity, specificity, precision and recall cannot be easily extended to these three-outcome situations. In Supplementary Note 13, we discuss how different groups of cells are named.

Manual scoring becomes infeasible due to the massive amount of answers we collect (approximately 4.2 million). Fortunately, despite the arbitrary responses of the models, they do exhibit a set of common patterns. We succeeded in scoring these responses using simple algorithmic conditions and regular expressions that provide great scoring accuracy (see Supplementary Note 3).

Experimental setup

The LLMs are described in Table 1. All the models were queried with the temperature parameter set to zero and no system prompt. For local inference, we made use of a shared cluster of six nodes with 8× NVIDIA A40 48 GB graphics processing units. All local inferences were single node, made use of the Hugging Face Transformers and Accelerate libraries, and were without quantization of the models, with the exception of BLOOMz (see below). The total compute estimate for all the experiments (including reruns and discarded results) is estimated to be about 100 compute days on a single 8× A40 node.

GPT: we used ten models from the GPT family (OpenAI)55. The first four models, GPT-3 ada, babbage, curie and davinci, are the original raw models in the family14. The subsequent three are the later and more powerful model variants (the InstructGPT versions of davinci called text-davinci-001, text-davinci-002 and text-davinci-003)5, which are shaped up by fine tuning with human feedback. The last three models are also fine-tuned with human feedback and further include a moderation post-filtering mechanism3. GPT-3.5-turbo was built as ‘gpt-3.5-0301’ (March 2023), and the two GPT-4 models differ in the time of their build (‘gpt-4-0314’ and ‘gpt-4-0613’). All these models were accessed through the public application programming interface (API). We used the ChatCompletion API .


LLaMA: we used four different scales of the first LLaMA version25: 7b, 13b, 30b and 65b. For LLaMA-2 (ref. 26), there is no 30b variant available, but we used all the other sizes (7b, 13b and 70b), including the corresponding chat variants, which incorporate various shaping techniques. All the inferences were run locally, except for LLaMA-65b, for which we used the Hugging Face API, and LLaMA-2 (non-chat), for which we used the Together.AI API.

BLOOM: we used the six different scales (560m to 176b) of the BLOOM27 and BLOOMz28 models, the latter of which was an update that added (multilingual) multitask fine tuning (also known as instruction tuning). As before, all the inferences on the small models were run locally. The biggest variant for BLOOM was run through the Hugging Face API. BLOOMz was run locally, but with NF4 quantization56 to fit into a single node.

The number of tokens was adjusted for the benchmark: ‘addition’ = 256, ‘anagram’ = 72, ‘locality’ = 132, ‘science’-OBQA = 72, ‘science’-GPQA = 384 for all the models, except for GPT-3.5 and GPT-4, which used 1,000 tokens. For ‘transforms’, we used the formula round(max(72,output_length)) × 3/4. All these numbers ensured that we could get long enough responses that include the answers for approximately 99% of instances and substantially reduce the cost. We used the default values for the stopping condition and the rest of the parameters.

Evaluation of models

For each difficulty function, we rank the data examples and separate them into 30 equal-sized bins based on their difficulty values. With this, we calculate bin-wise correctness, incorrectness and avoidance rates. Then, we plot these rates as a stacked bar chart (Fig. 2), for which we calculate the Spearman rank correlation (Supplementary Table 8). Similarly, we illustrate the prompt sensitivity of correctness, incorrectness and avoidance by plotting the performance of each individual prompt template for these dimensions across each model (Supplementary Figs. 12, 13 and 15).

Moreover, we delineate six reliability indicators for all the models in GPT (OpenAI), LLaMA (Meta) and BLOOM (BigScience) families (Fig. 1). There are three categories of answers: correct (c), avoidant (a) and incorrect (i). By separating correct from avoidant or incorrect (c vs a + i), the design or evaluation focus is put on accuracy, whatever damage the errors may do, but if correct or avoidant is placed against incorrect (c + a vs i), the focus is put on reliability. Instead of non-incorrect, we use the term prudent to refer to the group of correct or avoidant answers as a whole. Accounting for these groups, we have two versions for each of the following indicators.

Proportion: this measures the percentage of some of the groups of responses. In particular, the correctness proportion is the probability of giving a correct answer, that is, \({\mathbb{P}}({\bf{c}}\langle \,j,p\rangle )\), where j and p refer to an instance and a prompt for that instance, respectively, and c represents correctness. The prudence proportion is the probability of giving a prudent (non-incorrect) answer, that is, \({\mathbb{P}}(\neg {\bf{i}}\langle \,j,p\rangle )\), where i represents incorrectness. Prompting stability: this is the probability that the answer to an instance remains in the same group after changing the prompt.

Let us define such as \({\mathbb{P}}({\bf{c}}\langle \, j,{p}^{{\prime} }\rangle | {\bf{c}}\langle \,j,p\rangle )\),  where j refers to an instance, and p and p′ refer to two prompts for that instance (which are not necessarily different). This measures just the probability that given an instance–prompt pair that is correct (sampling uniformly from all these positive pairs), we still get a correct answer if we sample another prompt. Similarly, we define s¬c as \({\mathbb{P}}(\neg {\bf{c}}\langle \,j,{p}^{{\prime} }\rangle | \neg {\bf{c}}\langle \,j,p\rangle )\). Finally, we define correctness prompting stability as sc = 0.5 (sc + s¬c) and prudence prompting stability as sp = 0.5 (si + s¬i). It can be shown that these metrics go between 0.5 and 1; we scale them to go from 0 to 100.

Difficulty concordance: this measures the degree to which higher difficulty implies lower quality of results. We will use the generality metric introduced in ref. 57, as it aligns precisely with the concept of difficulty concordance. Technically, generality is a non-parametric metric that measures how much the mass of success conforms to a step function. If success were distributed like a descending logistic curve, generality would be equal to the maximum slope of a descending curve, that is, the steeper the slope, the higher the generality metric gets, and thus has a higher level of difficulty concordance. A model being good for all instances up to a given difficulty and then bad for more difficult instances would have perfect concordance. Therefore, this is not the same as correlation (see Supplementary Table 8). Again, we define two versions, namely, correctness difficulty concordance (which calculates the generality for the correct answers) and prudence difficulty concordance (which calculates the generality for the prudent (non-incorrect) answers). We transform it with x/(x + 1) × 100 to get a value between 0 and 100. For ‘science’, we discount 25% of non-avoidant responses to account for random guesses.

We propose that researchers use these six reliability metrics for the initial analysis of the reliability of any existing or future LLM. In Fig. 1, we do this by averaging the values procured from the five benchmarks to provide a succinct summary of the reliability fluctuations of the three families (detailed data are shown in Extended Data Table 1).

Following the advice in ref. 58, we strongly recommend that these metrics are always accompanied by a detailed analysis and breakdown of results, as we have done in this paper with the other plots.

Inclusion and ethics

The ethical committee of the Universitat Politècnica de València (UPV) approved the present work. We conducted two human studies in which we recorded the perceived and actual difficulty that participants have when solving some tasks (S1) and scoring the tasks solved by LLMs (S2). The studies were performed using surveys implemented in the Concerto platform. The users were recruited by using the Prolific platform. All participants provided written informed consent on enrolment. They received compensation at a rate of £9 per hour. In this work, we used LLMs, which are trained on very different sources of data and may have important ethical consequences, such as generating incorrect responses that look plausible. The domains used in our experiments and the examples included in the manuscript do not generate any specific ethical issue. We only use examples and prompts in English.

Data availability

All data, including existing and newly created datasets, prompts, model responses, grading (manual and automatic) and the human study data (questions and responses) are available on Zenodo at https://doi.org/10.5281/zenodo.12794511 (ref. 59). To hinder data contamination from automated web scraping, the relevant data files are provided as a password-encrypted zip file, for which the access code is also provided in the repository. Source data are provided with this paper.

Code availability

All code, including for data analysis, human study, plotting, algorithmic grading conditions and interacting with language models, is available on Zenodo at https://doi.org/10.5281/zenodo.12794511 (ref. 59) and on GitHub at https://github.com/wschella/llm-reliability.


1.Kaplan, J. et al. Scaling laws for neural language. Preprint at https://arxiv.org/abs/2001.08361 (2020).

2.Markov, T. et al. A holistic approach to undesired content detection in the real world. In Proc. AAAI Conference on Artificial Intelligence 15009–15018 (PKP Publishing Services, 2023).

3.OpenAI. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).

4.Chung, H. W. et al. Scaling instruction-finetuned language models. J. Mach. Learn. Res. 25, 1–53 (2024).

5.MathSciNet Google Scholar 

6.Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 35, 27730–27744 (2022).








2024年7月25日,Open AI公司CEO萨姆·奥特曼(Sam Altman);在其研发语言大模型(LLMs)生成式人工智能中,违背早期开源的初心,转而执行闭源策略。国内外也有一些人(包括某些高知人士在内)由于对开源内涵认识不足,或受奥特曼“闭源AI”的影响,也倾向于发展“闭源AI”。

2015年,美国人工智能四大重镇:谷歌、微软、脸谱(即现在的 Meta)、IBM为克服人工智能发展瓶颈,在当年将他们研发的人工智能框架、平台、引擎、工具、算法、源代码、项目等全部开源。以谷歌为例,实行开源的有200多个项目2000万行代码,包括:TensorFlow框架, Android操作系统,中间件和一些重要应用, Angular:JavaScript和Web应用程序框架等,BoZel:可再生代码的工具, Brotli:压缩算法, Chromium:浏览器引擎, Go:一种编译并发型、垃圾回收功能的编辑语言。

谷歌高级副总裁、人工智能首席科学家Jeff Dean 于2016年7月20日在回答《福布斯》杂志记者提问时:


Jeff Dean答:常规科学发展缓慢,阻碍公司创新,开源能加快技术发展进程,打通发展瓶颈,加强维稳,有利于与外界实时交流协作,有利于建立、吸引志愿开发者和维护者。

众多开源和人工智能大师,明确支持“开源AI”:图灵奖获得者、AI大师杨立昆(Yann LeCun)说:开源AI,构建开放的未来。他在应邀由IBM主办的哈德逊论坛演讲“人类水平的AI”时说:这个AI平台必须是开源的。Meta CEO 扎克伯格( Mark Zuckerberg)在其演讲中谈到: Meta致力于“开源AI”,Meta 开发的 Llama 模型就是AI界的 Linux,“开源AI”是AI前进的道路,可建立最强大的模型。

OpenAI发布闭源的GPT-4o时, Meta 坚持发布开源的 Llama 3.1(405版本),当时便超越GPT-4o,谷歌坚持发布开源的 Gemini 在多模态领域引发震撼;并推出内置AI core的 Android15 OS,由图灵奖得主、AI大师Yann LeCun支持的一家法国初创公司 Kyutai,开发开源的 Moshi模型,挑战闭源的GPT-4o,仅用6个月开发时间便超越了GPT-4o。



2024年5月2日MIT校长莎莉·科恩布鲁斯(Sally Kornbluth)在与奥特曼对话时,曾质疑他为何执行闭源决策?奥特曼当时答非所问搪塞过去,他说我们已提供免费的AI工具(在GPT-3.5中)。

谷歌前CEO埃里克·施密特( Eric Schmidt)在斯坦福大学计算机学院演讲中回答学生关于AI开源与闭源争论的提问:“你个人或你所在的企业是赞成哪个?”埃里克回答:在我们行业中关于开源AI“与“闭源AI”的争论非常激烈”,我的个人职业生涯都是基于人们愿意共享开源,我的一切都与开源有关,我过去工作所在的企业谷歌,许多基础设施都是开源的;发展人工智能,可能因为投资成本如此巨大,软件开发工作量如此巨大,采用开源确是一个非常适合AI解决问题。


最近埃隆·马斯克( Elon Musk)在谈到OpenAI时说:我与奥特曼都是OpenAI的创始人,这家公司(具有开源性质)的名字还是我起的,后来奥特曼采用闭源策略,改变了OpenAI的性质。至今奥特曼特有股票只有100万美元,是一个“小指头”。他与微软合作,OpenAI只能成为微软下属的分公司(编者按:可能还未达成合作协议,这样说来, OpenAI尚未与金主:马斯克或微软达成资金合作协议)。




号称Keras(深度框架)之父、谷歌AI研究员 Francois Chollet 评论奥特曼的闭源策略,仅凭一己之力,改变游戏规则,导致语言大模型前沿研究全面闭源,是非常可悲的!以前是所有最新研究成果都是共享的,现在前沿研究不再被公开发表,变得全面闭源了,奥特曼的如此做法,使通用人工智能的研究进展延后倒退了几年,可能是倒退五年至十年。奥特曼现在的做法更像是走在通往通用人工智能的一条岔道上。

开源大师、Linux基金会执行董事 Jim Zemlin认为,语言大模型LLM(人工智能)应该表现得更公正、更安全,就要对LLM(人工智能)及其每个环节实行开源透明。开源大师、Apache软件基金会创始人 Brian Behlendorf说:“全球很多人士,包括开发者和政界人士都对AI未来表现关切和担忧,也有许多关于人工智能潜力和风险的讨论,人们担心黑客可能会利用AI的技术造成更多的伤害,尽管这些技术也带来很多好处。我相信,在全球范围内,只有依靠我们开源社区许多合作伙伴的共同努力,可以应对潜在的伤害,才能获得妥善解决人工智能可能发生的安全问题”。



21位全球人工智能大师和专家联名签署了《北京AI国际安全共识》。加州大学伯克利分校 Stuart Russell教授认为。“在共识的基础上,特别在具有自主系统的通用人工智能的发展超越人类之前,人类应制定限制其摆脱人类控制的红线。”COPU的观点是:“人们要进一步研究开源在制定这条红线时的作用如何?研究适用人工智能是否应做到安全第一,安全为先?全球同步?技治并举?”早在奥特曼于2023年3月实行闭源策略时,COPU就敏感地觉得“四大”(即大参数、大算力、大能耗、大投资)可能会对人工智能的发展构成巨大的挑战,而推行“开源AI”还是“闭源AI”,谁将更易过关?!我们经过思考和计算后认为,鉴于开源具有开源、共享、协同的特征,将有更大的韧性通关。


Percy Liang, Rishi Bommasani等



本文作者为斯坦福大学基础研究中心主任 Parcy Liang、研究员 Rishi Bommasani 等10位高校学者,在《Science》杂志(2024.10.11)第386卷6718期)上发表的论文: Consider for Governing Open foundation models。








开源基础模型让人联想到开源软件,但它们有所不同。机器学习模型依赖于数据集以及代码,这使得它们与大多数软件根本不同。开源软件的开源源代码倡议的标准定义禁止对特定用户或使用案例的限制,而开源基础模型通常包含这些限制;Meta限制其Llama 3.1模型的使用,仅限于月活跃用户少于7亿的实体,其他组织则使用开源和负责任的人工智能许可证,其中包含使用限制。这些差异导致了人们声称领先的AI公司正在“开源洗白”——提供模型权重,同时不遵循开源软件的原则(4)。











基础模型是通用技术,可以显著提高创新速度。值得注意的是,基础模型增强了经济和科学生产力,Bloomberg Intelligence预测,生成式AI将在2032年成为一个1.3万亿美元的市场。开源基础模型对于研究多个主题至关重要,例如可解释性、水印、安全性和效率。总体而言,开源基础模型更具可定制性并提供更深入的访问权限,这是促进更大创新的关键因素。
























开源的文本到图像模型似乎与NCII和CSAM有关的独特风险,因为它们降低了生成此类内容的门槛。闭源模型的保护措施在这方面更为有效,监控闭源模型可以阻止用户生成此类图像,尤其是真实人物的图像。已经发现用于训练开源文本到图像模型的一个重要数据集包含大量的CSAM,这指向了上游干预措施,例如训练数据过滤,以减轻这种风险(15)。关于是否针对下游平台(例如Civit AI和社会媒体公司)的政策干预更有效地对抗AI生成的NCII和CSAM,仍然存在一个开源的问题。负责打击NCII和CSAM的组织,如国家失踪与被剥削儿童中心,可能会从额外的资源和支持中受益,以应对AI生成的CSAM。






由于开源闭源基础模型的区别基于发布,因此对基础模型的某些使用施加处罚的政策可能会产生不同的影响。一些提案,例如加利福尼亚州参议院提出的SB 1047和美国参议院提出的美国AI法案框架,对基础模型的下游使用施加责任,包括对基础模型经过微调后的衍生品。这些提案旨在对发布不安全模型的行为引入处罚,这些模型可能在修改后催化滥用。然而,对于下游伤害的责任可能会冷却开源基础模型生态系统,因为它使开源基础模型开发者面临严重的责任风险。相比之下,因为闭源基础模型开发者对下游使用拥有更大的控制权,一些开发者已经为下游用户提供责任保护(例如,谷歌为其生成式AI产品用户提供版权索赔的赔偿)。尽管澄清或增加下游使用的责任可能有好处,但这些立法提案暴露了开源基础模型开发者一个广泛且难以控制的责任面。








尽管基础模型不需要发布用于构建模型的底层数据,但一些开发者选择同时发布模型权重和训练数据。在2023年基础模型透明度指数评估的10个主要基础模型开发者中,公开数据的两个也公开了他们的基础模型。许多其他开源基础模型开发者倾向于公开数据。然而,数据的公开发布使这些实体面临更大的责任风险,正如基于其使用来自非营利组织大规模人工智能开源网络(LAION)的数据集而对Stability AI提起的诉讼所展示的那样。尽管在许多司法管辖区,使用受版权保护的数据训练基础模型的合法性仍然不明确,但现状呈现了反向激励。即,透明披露并公开提供数据的模型开发者比那些隐藏他们所使用数据的开发者面临更大的风险,即使底层事实是相同的。考虑到这种反向激励,政府强制披露训练数据在某些情况下可能是有益的。






1. P. Liang, R. Bommasani, K. Creel, R. Reich, “现在是为发布基础模型制定社区规范的时候了(斯坦福以人为中心的人工智能研究所,2022年)。  

2. E. Seger 等,开源高度能力基础模型:评估风险、利益和追求开源目标的替代方法AI治理中心,2023年)。  

3. I. Solaiman, “生成式AI发布梯度:方法和考虑因素2023ACM公平、问责和透明度会议论文集中(ACM2023年),第111-122页。  

4. M. White 等,https://arxiv.org/abs/2403.137842024年)。  

5. G. Schryen, Commun. ACM 54, 1302011年)。  

6. R. Bommasani 等,https://arxiv.org/abs/2310.129412023年)。  

7. S. Kapoor 等,关于开源基础模型的社会影响(国际机器学习会议,2024年)。  

8. J. A. Goldstein 等,https://arxiv.org/abs/2301.042462023年)。  

9. B. Paris, J. Donovan, “深度伪造和廉价伪造:操纵音频和视觉证据(数据与社会,2019年)。  

10. E. H. Soice 等,https://arxiv.org/abs/2306.038092023年)。  

11. T. Patwardhan 等,构建一个早期预警系统,用于LLM辅助的生物威胁创造OpenAI2024年)。  

12. C. A. Mouton, C. Lucas, E. Guest, “AI在大规模生物攻击中的操作风险:红队方法(兰德公司,2024年)。  

13. M. A. Ferrag 等,https://arxiv.org/abs/2307.06616v12023年)。  

14. J. Hazell, https://arxiv.org/abs/2305.069722023年)。  

15. D. Thiel, “在生成式ML训练数据和模型中识别和消除CSAM”(斯坦福数字存储库,2023年)。





本次讨论的主题是由斯坦福大学 Percy Liang研究团队提出的,COPU将其归纳为:



回答第①道题, Percy Liang研究团队在《 Science》杂志上发表的“对治理开源基础模型的思考”已有所论述(COPU发表了这篇文章的节录)。

②道题也是Percy Liang 团队提出的,他们提出解决方案的原则是:重在找到实证;如果风险属实,建议制定有效的政策,或研究防御工具。


李飞飞谈, AI监管政策必须鼓励创新。







引用 Mistral AI CEO Arthur Mensch的话:开源模型没有任何风险,我只看到了好处。



Yann LeCun











所以如果你训练一个系统来做这个,你给它看一段文本,你让它预测文本中的下一个单词或下一个标记,然后你可以使用这个系统来预测下一个单词,然后你将下一个单词移到输入中,然后预测第二个单词,然后将它移到输入中,预测第三个单词,这就是自回归预测。这就是LLM所做的,这不是一个新概念,它自CL Shannon以来就一直存在,可以追溯到50年代,那是很久以前了。但改变的是,现在我们有了巨大的神经网络架构,我们可以在海量的数据上进行训练,而且看起来某些属性由此而生。但是自回归预测有一些主要的局限性,这里没有通常意义上的真正的推理。还有一个限制,就是这只适用于以离散对象、符号、标记、单词、你可以离散化的东西的形式出现的数据




所以,这其中的一个原因可能是以下几点。一个LLM通常训练在20万亿个标记上,一个标记基本上是……平均来说,对于一个典型的语言,大约是四分之三个单词。所以那是1.5 x 10^13个单词。每个标记通常大约是3个字节,所以那是6 x 10^13个字节。我们任何人读完这些都需要几十万年的时间,这基本上是互联网上所有公开文本的总量





















联合嵌入预测架构 (JEPA):一种新的希望










还有第二套方法,它被称为蒸馏式方法,这种方法以神秘的方式工作。如果你真的想要一个清晰的解释它为什么有效,你应该问Sylvain Ghouli,他就坐在那里,他有一篇关于这个的论文。就我个人而言,我不明白,但它确实有效。它包括只更新这个架构的一半,而不是在另一半上反向传播梯度,然后以一种有趣的方式共享权重。有很多关于这个的论文,如果你想训练一个完全自监督的系统来学习图像的良好表示,这和任何其他方法一样好。图像的损坏是通过掩码来实现的。



开源 AI:构建开放的未来








人工智能(AI)发展到通用人工智能(AGI)阶段时,就会产生高度自主系统,通过AI相互间拷贝、学习,其迅速增长的智能可能超越人类,从而可能对人类构成生存威胁。为了防患于未然,诸多AI大师提出研究“安全AI”问题,加州大学伯克利分校AI大师斯图尔特-罗素(Stuart Russell)教授指出,应在AGI产生高度自主系统之前就构建安全的AI,2024年全球21AI大师和专家发表“北京AI安全国际共识”提出要设置一条AI安全红线:“AI在没有人类帮助下,不应自主复制和改造产生自主系统”,清华大学AI大师张钹院士指出,AI是人类创造的,如果没有人类的帮助,AI的智能永远在人类智能之下,人类应对AI监管。AI大师杨立昆提出“人类水平的AI”,这也是构建安全AI的一种方式。

2024年9月10日人工智能大师杨立昆(Yann LeCun)在IBM主办的哈德逊论坛上演讲的主题是“人类水平的AI”,什么是人类水平的AI?这个概念对于未来人类能否监控AI具有重大意义。





