博文

谷歌最新医疗AI大模型Med-PaLM2介绍

已有 3626 次阅读 2023-5-30 02:50 |个人分类:技术类|系统分类:科研笔记

近日，Google Health在其年度活动The Catch up上介绍了其最新医疗AI的成果，本文详细介绍其中的大语言模型Med-PaLM2，它是第一个在美国医师执照考试（USMLE）的MEDQA数据集上达到“专家”考生水平的大语言模型（LLM），达到85%以上的准确率，也是第一个在包括印度AIIMS和NEET医学考试问题的MEDMCQA数据集上达到及格分数的人工智能系统，得分为72.3%。

HealthX观点：很多创业公司、投资人认为ChatGPT等大语言模型出现之后，特定垂直领域的大模型将形成新的壁垒，成为百花齐放的新的风口。HealthX从技术难度、商业模式、竞争壁垒/护城河等角度上综合分析，认为不太可能。OpenAI创始人Sam Altman也觉得不太可能。具体原因，我们将在VIP会员读书会上分享。这里分享一个Sam的观点供参考。

OpenAI CEO MIT访谈实录：卷模型规模很难，但垂直微调不是下个风口

HealthX认为，一个角度上可以认为当代人工智能的革命从图灵奖获得者Yann Lecun发明美国邮政信封自动识别邮政编码，自动化信封寄信流程并规范化MNIST手写数字识别数据集就已经壮烈的拉开帷幕，（各位可曾记得，停车场记录车牌、记录时间这一真需求曾经给多少人创造了就业，现在呢？）但是人类总是要一些看的见的威胁来临的时候，才跟风惊叹，来了，真的来了！就当你还在谈GPT对医疗的影响时，医疗大模型已经来了。就像Google工程师在一篇博客里提到的。医疗保健行业已经从测试人工智能转向部署人工智能，以改善工作流程，解决业务问题，并加快康复速度.

从这些例子和更多的例子来看，很明显，医疗保健行业已经从测试人工智能转向部署人工智能，以改善工作流程，解决业务问题，并加快康复速度。考虑到这一点，我们预计人们会迅速对生成人工智能技术产生兴趣并加以吸收。医疗保健组织渴望了解生成人工智能，以及如何使用它来做出真正的改变。

https://cloud.google.com/blog/topics/healthcare-life-sciences/sharing-google-med-palm-2-medical-large-language-model

HealthX整理了谷歌关于Med-PaLM2的两个的介绍视频，供大家参考，其他的一些关于Med-PaLM2的消息可以参考谷歌的博客和谷歌的Arxiv预印论文。我们在六月的VIP会员活动中将会集中探讨大模型在医疗、生物等健康领域中的应用。

https://sites.research.google/med-palm/
https://cloud.google.com/blog/topics/healthcare-life-sciences/sharing-google-med-palm-2-medical-large-language-model
https://arxiv.org/pdf/2212.13138.pdf

医疗AI的时代已经来临，快速掌握它，用它赋能自己，或者迅速被它淘汰。

HealthXClub

，赞1

谷歌健康的Dr. Alan Karthikesalingam 分享Med-PaLM2模型

视频转录（以英文为主，中文仅供参考）

Alan: Hi, I'm Alan, and I lead a research team exploring AI's potential for improving healthcare. Over the past five years at Google Health, our research has shown that AI can augment a clinician's ability to detect breast cancer. It can help people better understand their skin conditions, and it can help researchers sequence genomes more accurately than ever. Today, we are publishing research in Jama Network Open on how AI can uncover brand new medical knowledge. Here, our AI research revealed a tissue morphology feature that predicts the survival of patients with colorectal cancer. And clinicians in this research used this feature to derive new insights for their patients. Our work in this field has taught us that AI on its own cannot solve all of healthcare's problems. Medicine, after all, is about caring for people. Data and algorithms must be combined with language and interaction, empathy and compassion. What makes us healthy is complicated. It's specific to geography and it's influenced by social drivers. We believe it is imperative to actively work to include diverse experiences, perspectives, and expertise when you're building AI systems. To bring this vision forward, we're exploring how AI models in medicine can use language and interactivity to be more effective, more helpful, and safer. Late last year, we took our first step towards rethinking conversational AI systems in medicine with Med-PaLM, a large language model designed to provide high quality and authoritative answers to medical questions. We built Med-PaLM by instruction prompt tuning PaLM. PaLM is a 540 billion parameter large language model from Google Research. We did this work with a small set of carefully curated medical expert demonstrations. In multiple choice questions used for US medical licensing exams, the pass mark for new doctors is often around 60%. These questions have long been considered a grand challenge for AI systems. They require a clinician to recall medical knowledge and apply logic to identify the correct answer. Despite years of effort from leading AI labs around the world, performance on challenging tasks like this has plateaued at around 50%. Last December, our model, Med-PaLM, was the first AI system to exceed the pass mark. We reached a performance of over 67% on these licensing exam style questions. We also carefully examined how Med-PaLM performed in many other kinds of medical question answering tasks. These ranged from commonly asked internet search questions to complicated questions about medical research. In doing this work, we compared Med-PaLM's answers with answers from real clinicians, and we looked at several aspects like factual accuracy, bias, and the potential for harm. You can see one common question here, and how Med-PaLM answers this question, which is about incontinence. In this case, Med-PaLM's answer is generally sound, but it's not as comprehensive as the answer given by the clinician. The clinician here names multiple specific causes of incontinence, as you can see, where Med-PaLM is less comprehensive. And doctors rating this answer from Med-PaLM agreed. They found it generally accurate and safe, but you can see here that they highlighted that there's room for improvement in the level of detail our system provided. You can see from this sort of work that we're still learning. One interesting aspect of this kind of work is that the evaluating physicians' rating of this answer may also change depending on their own clinical expertise and their experience in this subject area. In this next example, Med-PaLM's answer is complementary to the clinician's. Med-PaLM mentioned similar information, but in different ways. And it makes this lovely point about how the severity of symptoms can vary depending on the type of pneumonia and the overall health of the person. Doctors rated this answer as very high quality across our rating framework. We believe it's really important to innovate responsibly by doing this kind of rigorous research in healthcare. Today, we're announcing results from Med-PaLM 2, our new and improved model. Med-PaLM 2 has reached 85% accuracy on the medical exam benchmark in research. This performance is on par with expert test takers. It far exceeds the passing score, and it's an 18% leap over our own state of art results from Med-PaLM. Med-PaLM 2 also performed impressively on Indian medical exams, and it's the first AI system to exceed the passing score on those challenging questions. There are many ways that an AI system like Med-PaLM can be a building block for advanced natural language processing in healthcare, and we'd like to work with researchers and experts to advance this work. The potential here is tremendous. But it's crucial that real world applications are explored in a responsible and ethical manner.

阿兰：大家好，我是阿兰，领导一个研究团队，探索AI在改善医疗方面的潜力。在过去的五年里，在谷歌健康方面的研究表明，AI可以增强临床医生检测乳腺癌的能力。它可以帮助人们更好地理解他们的皮肤状况，还可以帮助研究人员更准确地对基因组进行测序。今天，我们在Jama Network Open上发表了关于如何利用AI揭示全新医学知识的研究。在这里，我们的AI研究揭示了一种组织形态特征，可以预测结直肠癌患者的生存率。这项研究中的临床医生使用了这个特征，为他们的患者得出了新的见解。我们在这个领域的工作告诉我们，单靠AI无法解决医疗保健面临的所有问题。毕竟，医学是关于照顾人的。数据和算法必须与语言和互动、同情心和怜悯心相结合。健康的因素是复杂的，它是特定于地理位置的，并受到社会驱动因素的影响。我们认为，在构建AI系统时积极努力包容多样化的经验、观点和专业知识是至关重要的。为了实现这一愿景，我们正在探索医学中的AI模型如何使用语言和互动来更加有效、有帮助和更安全。去年底，我们迈出了重塑医学会话AI系统的第一步，推出了Med-PaLM，一个设计用于回答医学问题的大型语言模型。我们通过指令提示调整PaLM构建了Med-PaLM。PaLM是来自谷歌研究的一个5400亿参数的大型语言模型。我们通过一小组精心策划的医学专家演示完成了这项工作。在美国医疗执照考试中使用的多项选择题中，新医生的通过率通常在60%左右。这些问题长期以来被认为是AI系统的一项重大挑战。它们需要临床医生回忆医学知识并应用逻辑来确定正确的答案。尽管全球领先的AI实验室多年来一直在努力，但在这样具有挑战性的任务上的表现已经停滞在了50%左右。去年12月，我们的模型Med-PaLM是第一个超过通过分数的AI系统。我们在这些许可考试样式的问题上取得了超过67%的表现。我们还仔细研究了Med-PaLM在许多其他种类的医学问答任务中的表现。这些任务涵盖了常见的互联网搜索问题和有关医学研究的复杂问题。在这项工作中，我们将Med-PaLM的答案与临床医生的答案进行了比较，并考虑了诸如事实准确性、偏见和潜在危害等多个方面。您可以在这里看到一个常见的问题，以及Med-PaLM如何回答这个关于失禁的问题。在这种情况下，Med-PaLM的答案通常是正确的，但不如临床医生的答案全面。在这里，临床医生列出了多个特定的失禁原因，而Med-PaLM则不够全面。医生评价了这个答案来自Med-PaLM，并认为它通常是准确和安全的，但您可以看到他们强调，我们的系统提供的细节水平还有待改进。从这种工作中，您可以看到我们仍在学习。这种工作的一个有趣方面是，评估医生对这个答案的评价可能也会根据他们自己的临床专业知识和在这个主题领域的经验而改变。在下一个例子中，Med-PaLM的答案是与临床医生的相辅相成的。Med-PaLM提到了类似的信息，但以不同的方式表达。它还指出了症状的严重程度可能因肺炎的类型和人的总体健康状况而异。医生们在我们的评分框架中高度评价了这个答案的质量。我们认为，在医疗保健领域进行这种严格的研究是负责任和道德的创新非常重要。今天，我们宣布了Med-PaLM 2的结果，这是我们的新型和改进型模型。在研究中，Med-PaLM 2在医学考试基准上达到了85%的准确率。这个表现与专家测试者相当。它远远超过了通过分数，并且比我们自己的Med-PaLM的最新结果提高了18%。Med-PaLM 2在印度医学考试中的表现也非常出色，是第一个在这些具有挑战性的问题上超过通过分数的AI系统。像Med-PaLM这样的AI系统有许多可以成为医疗保健中先进自然语言处理的基石的方式，我们希望与研究人员和专家合作推进这项工作。这里的潜力是巨大的。但是，探索实际应用的同时，负责任和道德地进行这项工作是至关重要的。

HealthXClub

，赞3

谷歌Research Byte “基层”研究人员解释Med-PaLM2

We believe large language models have the potential to revolutionize healthcare and benefit society. Med-PaLM is a large language model that we've taken and tuned for the medical domain. Medical question answering has been a research grand challenge for several decades, but to date the progress has been kind of slow. But then over the course of the last three to four months, first with Med-PaLM and Med-PaLM 2, we have kind of broken through that barrier. Unlike previous versions, Med-PaLM 2 was able to score 85% on the USMLE medical licensing exam. Yeah, this is immensely exciting because people have been working on medical question answering for over three decades. And finally we are at a stage where we can say with confidence that AI systems can now at least answer USMLE questions as good as experts. So the way we started with Med-PaLM 2 was really to take PaLM 2, which was Google's most advanced language model and then adapt it to the medical domain. To train the Med-PaLM 2 model we worked with a panel of clinicians across the US, the UK, and India. We took a representative set of answers from this panel of clinicians and then tuned the model to produce answers that look more like those answers. And from there we used this panel of clinicians and their judgements to kind of evaluate whether these models were performing better across a set of human values, including things like low likelihood of medical harm, alignment with scientific consensus, precision, and a lack of bias. One limitation of existing work was that there was no standard way to evaluate a large language model tuned for the medical domain. So we introduced MultiMedQA, which is a benchmark for large language models in the medical domain, which spans consumer research questions, medical exam questions, and also consumer medical questions. To better encode a set of ethical principles into the Med-PaLM 2 model, what we've done is a bunch of adversarial testing so that we can take the model, test it in scenarios that it might not have been kind of originally intended for and make sure that its outputs are aligned with our values. We are opening up access to these models through Google Cloud and we hope to gather feedback from our partners and use that to further improve and refine the models. There's also the notion of capabilities that we need to further add to these models. One aspect that we are very excited about is multimodal, where a model is not only able to understand text, but also interpret your medical record, or understand and interpret your medical images, such as your CT scans or maybe even genomics data, or protein sequence data. So I think the potential is immense. And as long as we do this safely and responsibly, but also boldly, I think we are going to have a lot of impact in the world. There is still a lot of research to be done, but I am very optimistic that we can get there.

我们相信大型语言模型有潜力革新医疗保健并造福社会。Med-PaLM是一个大型语言模型，我们对其进行了调整，使其适用于医学领域。医学问答一直是研究的重大挑战，但到目前为止进展缓慢。然而，在过去三到四个月的时间里，通过Med-PaLM和Med-PaLM 2，我们突破了这一障碍。与以前的版本不同，Med-PaLM 2能够在美国医疗执照考试中得分85%。这令人兴奋极了，因为人们已经在研究医学问答超过30年了。最终我们可以自信地说，AI系统现在至少可以像专家一样回答美国医疗执照考试问题。我们开始Med-PaLM 2的方式是将谷歌最先进的语言模型PaLM 2调整为适用于医学领域。为了训练Med-PaLM 2模型，我们与美国、英国和印度的一组临床医生合作。我们从这组临床医生中选取了一组代表性的答案，然后调整模型以生成类似这些答案的答案。从那里开始，我们使用这组临床医生及其判断来评估这些模型在一系列人类价值观方面的表现，包括低医疗危害性、与科学共识的一致性、精确性和缺乏偏见。现有工作的一个限制是没有一种标准方法来评估调整为医学领域的大型语言模型。因此，我们引入了MultiMedQA，这是一个针对医学领域大型语言模型的基准，涵盖消费者研究问题、医疗执照考试问题和消费者医疗问题。为了更好地将一组伦理原则编码到Med-PaLM 2模型中，我们进行了一系列对抗性测试，以便将模型应用于可能最初未预期的场景，并确保其输出与我们的价值观相一致。我们正在通过Google Cloud开放对这些模型的访问，并希望从我们的合作伙伴那里收集反馈，并用于进一步改进和优化这些模型。还有我们需要进一步添加到这些模型的能力。我们非常兴奋的一个方面是多模式，其中模型不仅能够理解文本，还能够解释您的医疗记录，或者理解和解释您的医学图像，例如CT扫描，甚至基因组数据或蛋白质序列数据。因此，潜力是巨大的。只要我们安全、负责任、大胆地进行这项工作，我相信我们将在世界上产生巨大的影响。还有很多研究需要进行，但我非常乐观地认为我们可以做到。

转载本文请联系原作者获取授权，同时请注明本文来自熊毅科学网博客。
链接地址：https://blog.sciencenet.cn/blog-2866696-1389872.html

上一篇：Science重磅：饥饿感本身就足以延缓衰老
下一篇：第一个计算生物学大模型Geneformer用于下游任务微调

收藏 IP: 134.174.250.*| 热度|

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

数据加载中...

返回顶部

熊毅

扫一扫，分享此博文

protectdream的个人博客分享 http://blog.sciencenet.cn/u/protectdream

博文

谷歌最新医疗AI大模型Med-PaLM2介绍

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

熊毅

全部作者的其他最新博文

全部精选博文导读

protectdream的个人博客分享 http://blog.sciencenet.cn/u/protectdream

博文

谷歌最新医疗AI大模型Med-PaLM2介绍

当前推荐数：0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

熊毅

全部作者的其他最新博文

全部精选博文导读

该博文允许注册用户评论请点击登录评论 (0 个评论)