kelvinryq820430的个人博客分享 http://blog.sciencenet.cn/u/kelvinryq820430

博文

[转载]临床试验中的多个终点-FDA

已有 287 次阅读 2024-4-30 18:38 |个人分类:临床试验|系统分类:论文交流|文章来源:转载

Multiple Endpoints in Clinical Trials

临床试验中的多个终点

 TABLE OF  CONTENTS 目录

 MULTIPLE ENDPOINTS IN CLINICAL TRIALS

临床试验中的多个终点

 I.         INTRODUCTION.

I. 引言

 II.        BACKGROUND AND SCOPE

II. 背景和范围

 A.    Demonstrating  the Study Objective of Effectiveness

A.展示有效性的研究目标

 B.     Type I Error

B. 1类错误

 C.    Multiplicity

C. 多重性

 III.      MULTIPLE ENDPOINTS: GENERAL PRINCIPLES .  6

III.多终点:一般原则 .6

 A.     The Hierarchy of Families of Endpoints.

A.端点家族的层次结构

 1. Primary Endpoint Family

1.主要终点家族

 2. Secondary and Exploratory Endpoint Families.

2.次要和探索性终点家族

 3. Selecting and Interpreting the Endpoints in the Primary and Secondary Endpoint Families

3.选择和解释主要和次要终点系列中的终点 .

 B.     Type II Error Rate and Sample Size

B. 第二类错误率和样本大小

 C.    Types  of Multiple Endpoints

C.多终端类型..

 1. When Demonstration  of Treatment Effects on Two or More  Distinct Endpoints Is Recommended to Establish Clinical Benefit (Co-Primary Endpoints)

1.进行疗效展示时建议对两个或两个以上不同终点确定临床疗效(共同主要终点)

 2. When Demonstration  of a Treatment  Effect on at Least One of Several Primary Endpoints Is Sufficient.

2.当证明对几个主要终点中至少一个终点的治疗效果是充足

 3. Composite Endpoints

3.综合终点

 4. Multi-Component  Endpoints.

4.多组件终点

 5. Clinically Critical Endpoints Too Infrequent for Use as a Primary  Endpoint

5.临床关键终点太少,无法用作主要终点

 

D.    The Individual Components of Composite and  Multi-Component Endpoints

D.综合终点和多组分终点的各个组成部分

 1. Evaluating and Reporting the Results of Composite Endpoints

1.Evaluating and Reporting the Results of Composite Endpoints

 2. Evaluating and Reporting the Results on Other Multi-Component Endpoints.

2.评估和报告其他多成分终点的结果.

 IV.      METHODOLOGICAL CONSIDERATIONS

IV.方法论方面的考虑

 V.        SUMMARY .

V.总结 .

 VI.       GENERAL  REFERENCES

六、一般参考资料

 APPENDIX:  STATISTICAL METHODS

附录:统计方法

 1. The Bonferroni Method..

1.Bonferroni 方法

 2. The Holm Procedure.

2.霍尔姆程序.

 3. The Hochberg Procedure.

3.霍赫贝格程序.

 4. Prospective Alpha Allocation Scheme .

4.前瞻性阿尔法分配计划

 5. The Fixed-Sequence Method.

5.固定序列法

 6. Resampling-Based,  Multiple-Testing Procedures..

6.基于重采样的多重测试程序.

 7. Gatekeeping Testing Strategies..

7.把关测试策略

 8. Graphical Approaches Based on Sequentially  Rejective Tests.

8.基于顺序拒绝测试的图形方法

 

Multiple Endpoints in Clinical Trials

临床试验中的多个终点

 

I.         INTRODUCTION I. 引 言

 

This guidance provides sponsors and review staff with the Agency’s thinking about the problems posed by multiple endpoints in the analysis and interpretation of study results and how these problems can be managed in clinical trials for human drugs, including drugs subject to licensing as biological products. Most clinical trials performed in drug development contain multiple endpoints to assess the effects of the drug and to document the ability of the drug to favorably affect one or more disease characteristics. When more than one endpoint is analyzed in a single trial, the likelihood of making false conclusions about a drug’s effects with respect to one or more of those endpoints could increase if there is no appropriate adjustment for multiplicity. The purpose of this guidance is to describe various strategies for grouping and ordering endpoints for analysis of a drug’s effects and applying some well-recognized statistical methods for managing  multiplicity within a study to control the chance of making erroneous conclusions about a drug’s effects. Basing a conclusion on an analysis where the risk of false conclusions has not been appropriately controlled can lead to false or misleading representations regarding a drug’s   effects.

本指南向申办者和审查人员提供了美国药监局对多终点在分析和解释研究结果时所带来的问题的看法,以及在人类药物(包括作为生物制品获得许可的药物)的临床试验中如何处理这些问题。在药物开发过程中进行的大多数临床试验都包含多个终点,以评估药物的效果,并记录药物对一种或多种疾病特征产生有利影响的能力。当在一项试验中分析一个以上的终点时,如果不对多重性进行适当调整,就可能会增加就药物对其中一个或多个终点的影响做出错误结论的可能性。本指南的目的是介绍对终点进行分组和排序以分析药物疗效的各种策略,并应用一些公认的统计方法来管理研究中的多重性,以控制对药物疗效做出错误结论的几率。在没有适当控制错误结论风险的分析基础上得出结论,可能会导致对药物效果的错误或误导性陈述。

 

 II.        BACKGROUND AND SCOPE

II. 背景和范围

 

Efficacy endpoints are measures designed to reflect the intended effects of a drug. They include assessments of clinical events (e.g., mortality, stroke, pulmonary exacerbation, venous thromboembolism), symptoms (e.g., pain, dyspnea, symptoms of depression),measures of function (e.g., ability to walk or exercise), or surrogate endpoints that are reasonably likely or expected to predict a clinical benefit.

疗效终点是旨在反映药物预期效果的指标。它们包括临床事件评估(如死亡率、中风、肺部恶化、静脉血栓栓塞)、症状评估(如疼痛、呼吸困难、抑郁症状)、功能评估(如行走或运动能力)或有合理可能或预期可预测临床获益的替代终点。

 

Because most diseases can potentially cause more than one clinical event, symptom, and/or altered function, many trials are designed to examine the effect of a drug on more than one aspect of the disease. In some cases, efficacy cannot be adequately established based on a single disease aspect, and the study should use either an endpoint that incorporates multiple aspects of  the disease into a single endpoint or effects should be demonstrated on multiple endpoints. In other cases, an effect on any of several endpoints could be sufficient to support approval of a marketing application.

由于大多数疾病都有可能导致一种以上的临床事件、症状和/或功能改变,因此许多试验都旨在考察一种药物对一种以上疾病方面的影响。在某些情况下,根据单一疾病方面无法充分确定疗效,因此研究应采用将疾病的多个方面纳入单一终点的终点,或在多个终点上证明疗效。在其他情况下,对多个终点中任何一个终点的影响都足以支持上市申请的批准。

 

Failure to account for multiplicity when there are several endpoints evaluated in a study can increase the chance of false conclusions regarding the effects of the drug. The regulatory concern regarding multiplicity arises principally in the evaluation of clinical trials intended to demonstrate effectiveness supporting drug approval and claims in FDA-approved labeling;  however, this issue is important for trials throughout the drug development process. For instance, if safety outcomes are to be assessed via hypothesis testing, they would be subject to the multiplicity considerations described in this guidance. Multiplicity problems for safety analyses that are not part of a prespecified set of hypotheses for formal statistical testing are outside the   scope of this guidance.

在一项研究中评估多个终点时,如果不考虑多重性,可能会增加对药物效果得出错误结论的几率。监管部门对多重性的关注主要出现在对临床试验的评估中,这些临床试验的目的是证明药物的有效性,以支持药物审批和美国食品及药物管理局批准的标签中的声明;不过,这个问题对整个药物开发过程中的试验也很重要。例如,如果要通过假设检验来评估安全性结果,就需要考虑本指南中所述的多重性问题。安全性分析的多重性问题不属于正式统计检验的预设假设集,不在本指南的讨论范围之内。

 

In the following sections, the issues of multiple endpoints and methods to address them are discussed. The issues of multiplicity and methods that apply to multiple endpoints also generally  apply to other sources of multiplicity, including other estimand3  attributes (e.g., multiple doses,    time points, or study population subgroups); however, these other sources of multiplicity will not be specifically addressed in this guidance. Furthermore, there maybe different considerations related to multiplicity in certain unique settings, such as the evaluation of multiple different drugs for a single disease in a master protocol, that are not addressed in this guidance. This guidance focuses on the analysis and interpretation of multiple endpoints within a single clinical trial.

以下各节将讨论多重终点的问题和解决这些问题的方法。适用于多终点的多重性问题和方法一般也适用于其他多重性来源,包括其他估计3 属性(如多剂量、时间点或研究人群亚组);不过,本指南不会专门讨论这些其他多重性来源。此外,在某些特殊情况下,例如在一个主方案中对一种疾病的多种不同药物进行评估时,可能会有与多重性相关的不同考虑因素,但本指南并不涉及这些因素。本指南侧重于分析和解释单项临床试验中的多个终点。

 

A.        Demonstrating the Study Objective of Effectiveness

A.证明研究目标的有效性

 

A conclusion that a study has demonstrated an intended effect of a drug is critical to meeting the legal standard for substantial evidence of effectiveness required to support approval of a new drug (i.e., “ … adequate and well-controlled investigations…on the basis of which it could fairly and responsibly be concluded…that the drug will have the effect it purports…to have…”)  (section 505(d) of the FD&C Act).4 FDA regulations further establish that to be adequate and well controlled, a clinical study of a drug must include, among other things, “an analysis of the results of the study adequate to assess the effects of the drug,” a requirement that furthers the “purpose of conducting clinical investigations of a drug,” which is “to distinguish the effect of a drug from other influences, such as spontaneous change in the course of the disease, placebo effect, or biased observation.”5  There are also other important factors (e.g., clinical relevance of the endpoint and estimated effect, relevant external information) that are considered in evaluating substantial evidence of effectiveness beyond the results of hypothesis tests in a single trial. A more general discussion of demonstrating substantial evidence of effectiveness can be found in other FDA guidance documents6  and is outside the scope of this document.

A conclusion that a study has demonstrated an intended effect of a drug is critical to meeting the legal standard for substantial evidence of effectiveness required to support approval of a new drug (i.e.,"......充分且对照良好的调查......在此基础上,可以公平、负责任地得出结论......该药物将具有其声称的......效果......")(《食品药物管理局法》第 505(d)条)。4 FDA 法规进一步规定,药物的临床研究必须包括 "足以评估药物效果的研究结果分析 "等内容,这样的临床研究才是充分的、控制得当的,这样的要求有利于 "开展药物临床研究的目的",即 "将药物效果与其他影响因素区分开来,如疾病过程中的自发变化、安慰剂效应或有偏差的观察"、5 除单项试验的假设检验结果外,在评估实质性有效性证据时还要考虑其它重要因素(如终点和估计效果的临床相关性、相关的外部信息)。关于证明实质性有效性证据的更一般性讨论可参见 FDA 的其它指导文件6 ,不在本文讨论范围之内。

 

Hypothesis testing is commonly used to address the uncertainty in the assessment of a treatment effect on a chosen endpoint. This approach begins with stating the relevant hypotheses for a chosen endpoint. In the simplest situation where the aim is to demonstrate the superiority of a test drug over control, two mutually exclusive hypotheses are specified for the endpoint in advance of conducting a clinical trial:

假设检验通常用于解决所选终点治疗效果评估中的不确定性问题。这种方法首先要说明所选终点的相关假设。在最简单的情况下,如果目的是证明试验药物优于对照药物,那么在进行临床试验之前,要为终点指定两个相互排斥的假设:

 

 •   One hypothesis, the null hypothesis, states that there is no treatment effect on the chosen endpoint.

-其中一个假设,即 "零假设",表示对所选终点没有治疗效果。

 

•   The other hypothesis is called the alternative hypothesis and posits that there is at least some treatment effect of the test drug.

-另一种假设称为替代假设,认为试验药物至少有一定的治疗效果。

 

 This pair of hypotheses are tested using a prespecified statistical test to determine whether the trial results are sufficiently unlikely under the null hypothesis so that the null hypothesis can be rejected in favor of the alternative hypothesis. Note that if the null hypothesis is not rejected, it does not necessarily mean that the null hypothesis is true. There are many other potential reasons that could lead to a failure to reject the null hypothesis, such as insufficient sample size.

使用预先指定的统计检验对这对假设进行检验,以确定试验结果在零假设下是否足够不可能,从而可以拒绝零假设,支持备择假设。请注意,如果没有拒绝零假设,并不一定意味着零假设为真。还有许多其他潜在原因可能导致无法拒绝零假设,例如样本量不足。

 

Sometimes (e.g., in some vaccine trials), demonstration of an effect of at least some minimum   size is considered critical for approval of a drug. In this case, if formal statistical testing is used for the demonstration, the null hypothesis might be modified to incorporate the smallest clinically meaningful effect that could be accepted.

有时(如在某些疫苗试验中),证明至少有某种最小规模的效应被认为是药物获得批准的关键。在这种情况下,如果使用正式的统计测试进行论证,可对零假设进行修改,以纳入可接受的具有临床意义的最小效应。

 

 This guidance focuses on a statistical framework based on hypothesis testing. Sponsors should  discuss early with FDA plans to use other approaches (e.g., Bayesian approaches) for a specific development program such as for pediatrics.

本指南侧重于基于假设检验的统计框架。申办者应尽早与 FDA 讨论在儿科等特定开发项目中使用其他方法(如贝叶斯方法)的计划。

 

 B.        Type I Error B. I 类错误

 

The rejection of the null hypothesis supports the study conclusion that there is a difference between treatment groups but does not constitute absolute proofthat the null hypothesis is false. There is always some possibility of mistakenly rejecting the null hypothesis when it is, in fact,  true. Such an erroneous conclusion is called a Type I error. For an endpoint, the probability of falsely rejecting its null hypothesis and, thus, concluding that there is a treatment effect due to the drug on this endpoint when, in fact, there is none, is called the Type I error probability or Type I error rate for this endpoint. The significance level, denoted as alpha (α), is the threshold below which the Type I error rate should be controlled. Null hypothesis rejection is based on a determination that the probability of observing a result at least as extreme as the result of the study assuming the null hypothesis is true (the p-value) is sufficiently low (usually no larger than α).

拒绝零假设支持研究结论,即治疗组之间存在差异,但并不构成零假设为假的绝对证 据。在零假设为真的情况下,总有可能错误地拒绝零假设。这种错误的结论被称为 I 型错误。对于一个终点来说,错误地拒绝其零假设,从而得出结论认为药物对该终点有治疗效果,而实际上没有治疗效果的概率,称为该终点的 I 类错误概率或 I 类错误率。显著性水平(用α表示)是阈值,I型错误率应控制在阈值以下。拒绝零假设的依据是,假设零假设为真,观察到与研究结果至少同样极端的结果的概率(p 值)足够低(通常不大于 α)。

 

 The alternative hypothesis can be one-sided or two-sided, and statistical tests are performed accordingly. For two-sided hypothesis statistical tests, the Type I error probability refers to the    probability of concluding that there is a difference (beneficial or harmful) between the drug and  control when there is no difference. For one-sided hypothesis tests, the Type I error probability   refers to the probability of concluding specifically that there is a beneficial difference due to the  drug when there is not. The most widely used values for αare 0.05 for two-sided tests and 0.025 for one-sided tests. In the case of two-sided tests, an αof 0.05 means that the probability of falsely concluding that the drug differs from the control in either direction (benefit or harm)  when no difference exists is no more than 5%, or 1 chance in 20. In the case of one-sided tests,  an αof 0.025 means that the probability of falsely concluding a beneficial effect of the drug when none exists is no more than 2.5%, or 1 chance in 40. Use of a two-sided test with an αof   0.05 that allocates the α symmetrically to each side generally also ensures that the probability of falsely concluding benefit when there is none is no more than approximately 2.5% (1 chance in   40). These Type I error rates are correct if the statistical test is appropriate. If there are issues with the statistical test (e.g., the underlying assumptions do not hold), the Type I error rate could be even larger.

备择假设可以是单侧假设,也可以是双侧假设,统计检验也是据此进行的。对于双侧假设统计检验,I 类错误概率指的是在药物和对照组之间没有差异的情况下,得出存在差异(有益或有害)结论的概率。对于单侧假设检验,I 类错误概率指的是在没有差异的情况下,得出药物导致有益差异的具体结论的概率。最广泛使用的 α 值为:双侧检验 0.05,单侧检验 0.025。在双侧试验中,α 值为 0.05 意味着在不存在差异的情况下,错误地断定药物与对照组在任一方向(有益或有害)上存在差异的概率不超过 5%,即 20 分之 1 的概率。在单侧试验中,α 值为 0.025 意味着,在药物没有产生有益影响的情况下,错误地得出药物产生有益影响的概率不超过 2.5%,即 40 分之 1 的概率。使用 α 为 0.05 的双侧检验,将 α 对称分配给每一方,一般也能确保在没有益处的情况下得出错误结论的概率不超过约 2.5%(40 分之 1)。如果统计检验是适当的,这些 I 类错误率就是正确的。如果统计检验存在问题(如基本假设不成立),第一类错误率可能会更大。

 

 FDA’s concern for controlling the Type I error probability is to minimize the chances of a false favorable conclusion for any primary or secondary endpoints (see section III.),regardless of which and how many of these endpoints in the study have no effect. The Type I error probability associated with testing multiple endpoints of a study is called overall Type I error probability.

FDA 对控制 I 类错误概率的关注是尽量减少对任何主要或次要终点(见第 III 节)得出错误的有利结论的机会,无论研究中哪些终点和多少终点没有影响。与测试一项研究的多个终点相关的 I 类错误概率称为总体 I 类错误概率。

 

The rationale for controlling this probability is given in the next subsection (section II.C.). When there is more than one primary or secondary endpoint, it is important to ensure that the evaluation of multiple hypotheses will not lead to inflation of the study’s overall Type I error probability (or rate) relative to the planned level. To control the Type I error rate, it is critical that sponsors prospectively specify the following:

控制这一概率的理由将在下一小节(第 II.C 节)中阐述。当主要终点或次要终点不止一个时,必须确保对多个假设的评估不会导致研究的总体 I 类错误概率(或率)相对于计划水平的膨胀。为了控制 I 类错误率,申办者必须在前瞻性研究中明确以下几点:

 

 •   all endpoints in the primary and secondary families (see section III. for definitions).

-主要和次要系列中的所有终点(定义见第 III 节)。

 

 •   all data analyses that will be performed to test hypotheses about the prespecified endpoints, regardless of whether they are considered primary or secondary.

-所有为检验预设终点假设而进行的数据分析,无论这些分析是主要的还是次要的。

 

 For a study with multiple endpoints, the analysis plan should describe the testing procedure for the hypotheses being tested with a proper control of overall Type I error rate.

对于有多个终点的研究,分析计划应说明所测试假设的测试程序,并适当控制总体 I 类错误率。

 

 C.        Multiplicity C.多重性

 

In a clinical trial with a single endpoint tested at two-sided α = 0.05, the probability of finding a   difference between the treatment group and a control group in favor of the treatment group when no difference exists in the population is 0.025 (a 2.5% chance). That is, there is a 97.5% chance   of appropriately not finding a favorable effect if there is no true effect for this endpoint. By contrast, if there are two independent endpoints, each tested at two-sided α = 0.05, and if success on either endpoint by itself would lead to a conclusion of a drug effect, the chance of appropriately not finding a favorable effect on both endpoints together is thus 0.975 * 0.975,  which is approximately 0.95, and so the probability of falsely finding a favorable effect on at least one endpoint is approximately 0.05. Thus, the overall Type I error rate in favor of the drug nearly doubles when two independent endpoints are tested. This higher-than-intended overall Type I error rate when multiple tests are conducted without adjustment is called the multiplicity problem. Thus, without correction for multiplicity, the chance of making a Type I error for this  example study as a whole would rise to approximately as high as 5% in favor of the drug, and,   therefore, the overall Type I error rate would not be adequately controlled. The problem is exacerbated when more than two endpoints are considered. For example, for three independent endpoints, the Type I error rate is 1 - (0.975 * 0.975 * 0.975), which is about 7%. For ten independent endpoints, the Type I error rate is about 22%. If the multiple endpoints are    correlated, the overall Type I error rate is also inflated but potentially by a lesser degree.

在一项以双侧 α = 0.05 检验单一终点的临床试验中,在人群中不存在差异的情况下,发现治疗组与对照组之间存在有利于治疗组的差异的概率为 0.025(2.5% 的概率)。也就是说,如果该终点没有真正的效果,那么有 97.5%的几率不能恰当地发现有利的效果。相比之下,如果有两个独立的终点,每个终点都在双侧 α = 0.05 的条件下进行测试,并且如果任何一个终点的成功本身都会导致药物效应的结论,那么在两个终点上都没有发现有利效应的概率就是 0.975 * 0.975,约为 0.95,因此在至少一个终点上错误地发现有利效应的概率约为 0.05。因此,当测试两个独立终点时,有利于药物的总体 I 类错误率几乎翻了一番。在不做调整的情况下进行多次测试时,这种高于预期的总体 I 类错误率被称为多重性问题。因此,如果不对多重性进行校正,本例研究中出现 I 类错误的几率将上升到约 5%,有利于药物的几率将上升到约 5%,因此,总体 I 类错误率没有得到充分控制。如果考虑两个以上的终点,问题就会更加严重。例如,对于三个独立终点,I 类错误率为 1 - (0.975 * 0.975 * 0.975),约为 7%。对于 10 个独立端点,I 类错误率约为 22%。如果多个端点相互关联,则总体 I 类错误率也会增加,但幅度可能较小。

 

 Even when a single outcome variable is being assessed, if multiple facets of that outcome are analyzed (e.g., multiple dose groups, multiple time points, or multiple subject subgroups based  on demographic or other characteristics) and if any one of the analyses is used to conclude that  the drug has been shown to produce a beneficial effect, the multiplicity of analyses may cause    inflation of the Type I error rate. Hence, by inflating the Type I error rate, multiplicity produces uncertainty in interpretation of the study results such that the conclusions about whether effectiveness has been demonstrated in the study become unreliable. There are various approaches that can be planned prospectively and applied to maintain the overall Type I error rate at 2.5% or below.

即使在评估单一结果变量时,如果对该结果的多个方面进行分析(如多个剂量组、多个时间点或基于人口统计或其他特征的多个受试者亚组),并且如果使用其中任何一项分析得出结论认为药物已被证明产生了有益效果,则分析的多重性可能会导致 I 类错误率升高。因此,通过扩大 I 类错误率,多重性会给研究结果的解释带来不确定性,从而使关于研究是否已证明有效的结论变得不可靠。有多种方法可用于前瞻性规划和应用,以将总体 I 类错误率保持在 2.5% 或以下。

 

For controlling multiplicity, an important principle is to first prospectively specify all planned endpoints,time points, analysis populations, doses, and analyses; then, once these factors are specified, appropriate adjustments for multiple endpoints and analyses can be selected,  prespecified, and applied, as appropriate. Changes in the analytic plan to perform additional analyses can reintroduce a multiplicity problem that can negatively impact the ability to interpret the study’s results unless these changes are made prior to data analysis and appropriate multiplicity adjustments are performed. The statistical analysis plan should not be changed after unmasking of treatment assignments and performing statistical analyses.

控制多重性的一个重要原则是首先明确所有计划的终点、时间点、分析人群、剂量和分析;然后,一旦明确了这些因素,就可以根据情况选择、预设和应用适当的多重终点和分析调整。改变分析计划以执行额外的分析可能会重新引入多重性问题,从而对解释研究结果的能力产生负面影响,除非在数据分析之前做出这些改变并执行适当的多重性调整。统计分析计划不应在解除治疗分配和进行统计分析后更改。

 

A focus of this guidance is control of the Type I error rate for the prespecified set of endpoints (i.e., primary and secondary endpoints) of a clinical trial to ensure that the major findings of a clinical trial are well supported, and the effects of the drug have been demonstrated. Analyses that explicate the characteristics of an effect on an endpoint that has been demonstrated—such as time of onset, distribution of effect sizes across the population, effects in subgroups, and effects   on the components of a composite endpoint—are all descriptive to provide a deeper understanding of the nature of that endpoint finding, and do not extend to effects outside of that endpoint. These descriptive analyses can be considered for inclusion in the FDA-approved labeling without presenting p-values.

本指南的重点之一是控制临床试验预设终点(即主要终点和次要终点)的 I 类错误率,以确保临床试验的主要结论得到充分支持,药物的疗效得到证实。说明已证实的终点效应特征的分析,如起始时间、效应大小在人群中的分布、亚组中的效应以及对复合终点各组成部分的效应等,均为描述性分析,旨在更深入地了解该终点发现的性质,而不涉及该终点以外的效应。这些描述性分析可考虑纳入 FDA 批准的标签中,而无需提供 p 值。

 

 Of note, there is not always a clear-cut distinction between an analysis closely related to a major finding and one that demonstrates additional effects. Therefore, when definitive conclusions are  to be drawn, such analyses should be prespecified and appropriately included in the prespecified multiple-testing strategy. A descriptive analysis that is not included in the prespecified multiple-testing strategy should not be presented in FDA-approved labeling in ways that imply a statistically rigorous conclusion or convey certainty about the effects that are not supported by that trial. Descriptive analyses are not the subject of this guidance and are not addressed in detail.

值得注意的是,与主要发现密切相关的分析和显示额外效果的分析之间并不总是有明确的区 别。因此,在得出明确结论时,应预先指定此类分析,并将其适当纳入预先指定的多重检测策略中。未纳入预先指定的多重试验策略的描述性分析不应在 FDA 批准的标签中以暗示统计学上严格的结论或传达该试验未支持的效果的确定性的方式呈现。描述性分析不是本指南的主题,因此不作详细论述。

 

 III.      MULTIPLE ENDPOINTS: GENERAL PRINCIPLES

III.多终点:一般原则

 

A.        The Hierarchy of Families of Endpoints

A.端点家族的层次结构

 

 Endpoints in adequate and well-controlled drug trials are usually grouped hierarchically, often according to their clinical importance, but also taking into consideration the expected frequency of the endpoint events and anticipated drug effects. The critical determination for grouping endpoints is whether they are intended to establish effectiveness to support approval or intended to demonstrate additional meaningful effects. Endpoints critical to establish effectiveness for approval are often designated as primary endpoints. Secondary endpoints can provide useful description to support the primary endpoint(s) and/or demonstrate additional clinically important effects. The third category in the hierarchy includes all other endpoints, which are referred to as   exploratory. Exploratory endpoints can include endpoints for research purposes or for new hypotheses generation. Each category in the hierarchy can contain a single endpoint or a family of endpoints.

在充分和对照良好的药物试验中,通常会根据终点的临床重要性对终点进行分级分组,但也会考虑到终点事件的预期频率和预期的药物效应。对终点进行分组的关键决定因素是,这些终点是为了确定有效性以支持审批,还是为了证明其他有意义的效果。对确定批准有效性至关重要的终点通常被指定为主要终点。次要终点可提供有用的描述,以支持主要终点和/或证明其他临床重要效果。层次结构中的第三类包括所有其他终点,被称为探索性终点。探索性终点可包括用于研究目的或产生新假设的终点。层次结构中的每个类别都可以包含一个终点或一系列终点。

 

 1.        Primary Endpoint Family

1.主要终点系列

 

 The endpoint(s) that establish the effect(s) of the drug and will be the basis for concluding that  the study meets its objective are designated the primary endpoint family. When there is a single prespecified primary endpoint, there are no multiple-endpoint-related multiplicity issues in the determination that the study achieves its objective.

确定药物疗效的终点,是判定研究是否达到目标的依据,被称为主要终点系列。如果只有一个预先指定的主要终点,那么在判定研究是否达到目标时,就不存在与多终点相关的多重性问题。

 

 Multiple primary endpoints occur in three ways, further described in section III.C. The first is   when there are multiple primary endpoints, and each endpoint could be sufficient on its own to establish the drug’s efficacy. These multiple endpoints thus correspond to multiple chances of  success, and in this case, failure to adjust for multiplicity can lead to Type I error rate inflation and a false conclusion that the drug is effective. The second is when the determination of effectiveness depends on success on all primary endpoints, when there are two or more primary   endpoints. In this setting, there are no multiplicity issues related to primary endpoints, as there is  only one path that leads to a successful outcome for the trial and therefore, no concern with Type I error rate inflation. In the third, critical aspects of effectiveness can be combined into a single primary composite or other multicomponent endpoint, thereby avoiding multiple-endpoint-related multiplicity issues. For example, in many cardiovascular studies it is usual to combine several endpoints (e.g., cardiovascular death, heart attack, and stroke) into a single composite endpoint that is primary and to consider death a secondary endpoint (see section III.A.2.).

第一种情况是有多个主要终点,而每个终点本身都足以确定药物的疗效。因此,这些多重终点对应着多重成功机会,在这种情况下,如果不对多重性进行调整,就会导致第一类错误率膨胀,并得出药物有效的错误结论。第二种情况是,当有两个或更多主要终点时,有效性的确定取决于所有主要终点的成功率。在这种情况下,不存在与主要终点相关的多重性问题,因为只有一条路径能导致试验结果成功,因此不存在 I 类错误率膨胀的问题。第三,有效性的关键方面可以合并为一个主要的复合终点或其他多成分终点,从而避免与多终点相关的多重性问题。例如,在许多心血管研究中,通常会将多个终点(如心血管死亡、心脏病发作和中风)合并为一个主要的复合终点,并将死亡视为次要终点(见第 III.A.2 节)。

 

 2.         Secondary and Exploratory Endpoint Families

2.次要和探索性终点系列

 

 When an effect on the primary endpoint is shown, the secondary endpoints can be formally tested. A secondary endpoint could be a clinical effect related to the primary endpoint that extends the understanding of that effect (e.g., an effect on survival when a cardiovascular drug   has shown an effect on the primary endpoint of heart failure-related hospitalizations) or provide evidence of a clinical benefit distinct from the effect shown by the primary endpoint (e.g., a disability endpoint in a multiple sclerosis treatment trial in which relapse rate is the primary endpoint). As a general principle, it is important to include the secondary endpoints that can potentially provide evidence of additional effects of the drug on the disease or condition in the Type I error control plan.

当显示出对主要终点的影响时,就可以对次要终点进行正式测试。次要终点可以是与主要终点相关的临床效应,它可以扩展对该效应的理解(例如,当心血管药物对心衰相关住院的主要终点产生效应时,对生存的效应),也可以提供与主要终点所显示的效应不同的临床益处的证据(例如,在以复发率为主要终点的多发性硬化症治疗试验中的残疾终点)。作为一项一般原则,在 I 类错误控制计划中纳入次要终点非常重要,因为次要终点有可能提供药物对疾病或病症的其他影响的证据。

 

In general, it maybe desirable to limit the number of secondary endpoints, because if multiplicity adjustments are used, the chance of demonstrating an effect on any secondary endpoint may become increasingly small as the number of secondary endpoints increases, or if a hierarchy is used, the important hypotheses further down the hierarchy might never get tested.

一般来说,限制次要终点的数量也许是可取的,因为如果使用多重性调整,随着次要终点数量的增加,对任何次要终点产生影响的几率可能会越来越小,或者如果使用分层方法,层级较低的重要假设可能永远无法得到检验。

 

Exploratory endpoints do not need multiplicity adjustment because they are generally not used to support conclusions.

探索性终点不需要进行多重性调整,因为它们通常不用于支持结论。

 

 3.         Selecting and Interpreting the Endpoints in the Primary and Secondary Endpoint Families

3.选择和解释主要和次要终点系列中的终点

 

 Positive results on the secondary endpoints can be interpretable if there is first a demonstration of a treatment effect on the primary endpoint family (O’Neill 1997). The overall Type I error rate should control for the primary and secondary endpoint families alltogether.

如果首先证明对主要终点系列有治疗效果,那么次要终点的阳性结果就可以解释(O'Neill,1997 年)。总的 I 类错误率应控制主要和次要终点系列。

 

 Occasionally, there are trials where a clinically important endpoint (e.g., mortality or irreversible morbidity) is expected to have too few events to provide adequate power for the trial, while a different clinically important endpoint occurs more frequently or earlier in the disease process, leading to larger power. In such cases, generally the endpoint with inadequate power for detection is classified as a secondary endpoint, while the endpoint for which larger power is expected is classified as the primary endpoint. For example, in some oncology trials,  progression-free survival is selected as the primary endpoint, and overall survival is selected as the secondary endpoint because an effect of treatment on disease progression is clinically important and maybe more readily demonstrable, maybe detected earlier, and may often be larger because the observed effect on overall survival can be impacted by subsequent treatment post progression.

有时,在一些试验中,临床上重要的终点(如死亡率或不可逆发病率)预计发生的事件太少,无法为试验提供足够的功率,而另一个临床上重要的终点在疾病过程中发生得更频繁或更早,从而导致更大的功率。在这种情况下,一般将检测能力不足的终点归为次要终点,而将检测能力较强的终点归为主要终点。例如,在一些肿瘤学试验中,无进展生存期被选为主要终点,而总生存期被选为次要终点,这是因为治疗对疾病进展的影响在临床上非常重要,而且可能更容易证明,可能更早发现,而且通常可能更大,因为观察到的对总生存期的影响可能会受到进展后后续治疗的影响。

 

B.        Type II Error Rate and Sample Size

B. 第二类错误率和样本大小

 

FDA is also concerned with the risk of making a Type II error, which is failing to show an effect of a drug where there actually is one. The study power is the probability that the study will be successful if a treatment effect of a specified size is in fact present. The desired power is an important factor in determining the sample size, especially for the primary endpoints.

美国食品及药物管理局还担心出现第二类错误的风险,即未能显示出药物的疗效。研究功率是指如果确实存在一定规模的治疗效果,研究成功的概率。所需的研究能力是确定样本量的一个重要因素,尤其是对主要终点而言。

 

The sample size of a study is generally chosen to provide a reasonably high power to show a treatment effect if an effect of a specified size on the primary endpoint(s) is in fact present. The sample size calculation may need to account for the statistical adjustments to control the Type I error rate for multiplicity. For example, if a lower α level is used for a study endpoint, then the  sample size should be adjusted to provide desired statistical power for this endpoint.

研究样本量的选择通常是为了在主要终点确实存在特定大小的效应时,提供合理的高功率来显示治疗效果。样本量的计算可能需要考虑统计调整,以控制多重性的 I 类错误率。例如,如果研究终点使用了较低的α水平,则样本量应进行调整,以便为该终点提供所需的统计能力。

 

 Using two or more endpoints for which demonstration of an effect on each is recommended to support regulatory approval (called co-primary endpoints; see section III.C.1. below) will increase the Type II error rate and decrease study power. For example, assume two endpoints have the same effect size and the study sample size is selected to provide 80% power to show success on each of these two endpoints. If the endpoints are independent, the power to show success on both will be approximately 64% (0.8 x 0.8); i.e., the likelihood of the study failing to support a conclusion of a favorable drug effect when such an effect existed (the Type II error rate) would be 36%. To maintain desired study power, a larger sample size is recommended, and the individual endpoints could be powered at approximately 90% to ensure the probability of success is at least 80%. The calculation would be different if the endpoints were highly positively correlated or the power was not equal for each endpoint.

使用两个或两个以上的终点,并建议在每个终点上都显示出效果,以支持监管部门的批准(称为共同主要终点;见下文第 III.C.1 节),会增加 II 类错误率并降低研究功率。例如,假设两个终点具有相同的效应大小,而研究样本量的选择是为了使这两个终点中的每一个都具有 80% 的成功效应。如果这两个终点是独立的,则在这两个终点上的成功率约为 64% (0.8 x 0.8);也就是说,如果存在有利的药物效应,研究未能支持这种效应结论的可能性(II 类错误率)将为 36%。为保持理想的研究功率,建议采用更大的样本量,单个终点的功率可达到约 90%,以确保成功概率至少为 80%。如果终点高度正相关或每个终点的功率不相等,计算结果会有所不同。

 

 C.        Types of Multiple Endpoints

C. 多终端类型

 

 Multiple endpoints can be used when demonstration of a drug effect on more than one disease aspect or outcome is critical for determining that the drug confers a clinical benefit. Multiple endpoints can also be used when (1) there are several important aspects of a disease or several ways to assess an important aspect, (2) it may not be known in advance which aspect is more likely to show a drug effect, and (3) an effect on any one endpoint will be sufficient as evidence of effectiveness to support approval. In some cases, multiple aspects of a disease can appropriately be combined into a single endpoint, but subsequent analysis examining each disease aspect or component of this endpoint is generally important for an adequate understanding of the drug’s effect. These circumstances are discussed in more detail below.

当证明药物对一种以上疾病方面或结果的影响对于确定药物是否能带来临床益处至关重要时,可以使用多终点。当出现以下情况时,也可使用多终点:(1) 一种疾病有几个重要方面,或有几种评估一个重要方面的方法;(2) 事先可能不知道哪个方面更有可能显示出药物效应;(3) 对任何一个终点的影响都足以作为支持批准的有效性证据。在某些情况下,一种疾病的多个方面可以适当地合并为一个终点,但随后对该终点的每个疾病方面或组成部分进行分析,通常对于充分了解药物的效果非常重要。下文将详细讨论这些情况。

 

 1.         When Demonstration of Treatment Effects on Two or More Distinct Endpoints Is Recommended to Establish Clinical Benefit (Co-Primary Endpoints)

1.建议对两个或两个以上不同终点的治疗效果进行证明以确定临床获益时(共同主要终点)

 

 For some disorders, there are two or more different features that are so critically important to the disease under study that a drug will not be considered effective without demonstration of a treatment effect on all of these disease features. The term used in this guidance to describe this circumstance of multiple primary endpoints is co-primary endpoints. Multiple primary endpoints become co-primary endpoints when demonstrating an effect on each of the endpoints is critical    to concluding that a drug is effective.

对于某些疾病来说,有两种或两种以上不同的特征对所研究的疾病非常重要,如果不能证明对所有这些疾病特征都有治疗效果,就不能认为药物是有效的。本指南中用于描述这种多主要终点情况的术语是共同主要终点。当证明对每个终点的疗效对于断定药物是否有效至关重要时,多个主要终点就成为共同主要终点。

 

 Therapies for the acute treatment of migraine headaches illustrate this circumstance. Although pain is the most prominent feature, migraine headaches are also characterized by the presence of photophobia, phonophobia, and/or nausea, all of which are clinically important. Which of the three is most clinically important varies among individuals. An approach to studying acute treatments for migraine headaches is to consider a drug effective for migraines only if the proportion of subjects with no headache pain at 2 hours after dosing and the proportion of subjects with absence of the most bothersome associated symptom at 2 hours after dosing are both shown to be improved by the drug treatment. Another approach could be to evaluate the drug effect on a response endpoint where response is defined by the absence of both pain and an  individually specified second symptom within an individual subject. This approach would utilize a single multi-component endpoint rather than co-primary endpoints.

偏头痛的急性治疗方法就说明了这种情况。虽然疼痛是最显著的特征,但偏头痛还伴有畏光、畏声和/或恶心,所有这些症状在临床上都很重要。这三种症状中哪一种在临床上最重要因人而异。研究偏头痛急性期治疗方法的一种方法是,只有在服药 2 小时后无头痛的受试者比例和服药 2 小时后无最令人烦恼的相关症状的受试者比例均显示药物治疗有效时,才认为该药物对偏头痛有效。另一种方法是评估药物对反应终点的影响,反应终点的定义是受试者同时没有疼痛和单独指定的第二种症状。这种方法将利用单一的多成分终点,而不是共同的主要终点。

 

Trials of combination vaccines are a situation in which co-primary endpoints are applicable.  These vaccine trials are typically designed and powered for demonstration of a successful outcome on effectiveness endpoints for each pathogen against which the vaccine is intended to provide protection.

联合疫苗试验是一种适用于共同主要终点的情况。这些疫苗试验的设计和动力通常是为了证明疫苗所要保护的每种病原体的有效性终点都能取得成功。

 

As discussed in section III.B., there is no multiplicity problem when the study is designed to demonstrate efficacy on all of the separate endpoints. However, co-primary endpoint testing increases the Type II error rate. In general, unless clinically very important, the use of more than two co-primary endpoints should be carefully considered because of the loss of power.

如第 III.B 节所述,如果研究旨在证明所有独立终点的疗效,则不存在多重性问题。但是,共同主要终点测试会增加 II 类错误率。一般来说,除非在临床上非常重要,否则应慎重考虑使用两个以上的共主要终点,因为这会损失研究力量。

 

There have been suggestions that the statistical testing criteria for each co-primary endpoint could be increased (e.g., testing at an αof 0.06 or 0.07) when the targeted αis 0.05 to accommodate the loss in statistical power arising from the need to show an effect on both endpoints. Increasing αfor each co-primary endpoint is not acceptable because doing so may undermine the ability to interpret a treatment effect on each disease aspect considered critical to show that the drug is effective in support of approval.

有人建议,当目标α为0.05时,可提高每个共同主要终点的统计检测标准(例如,在α为0.06或0.07时进行检测),以适应因需要显示对两个终点的影响而造成的统计能力损失。提高每个共同主要终点的α值是不可接受的,因为这样做可能会削弱对每个疾病方面治疗效果的解释能力,而这些方面被认为是证明药物有效以支持批准的关键。

 

2.         When Demonstration of a Treatment Effect on at Least One of Several Primary Endpoints Is Sufficient

2. 当证明对几个主要终点中至少一个终点有治疗效果时就足够了

 

Many diseases have multiple sequelae, and an effect demonstrated on any one of these aspects could support a conclusion of effectiveness. Selection of a single primary endpoint maybe difficult, however, if the aspect of a disease that will be responsive to the drug or the evaluation method that will better detect a treatment effect is not known a priori (at the time of trial design). In this circumstance, a study might be designed such that success on any one of several endpoints could support a conclusion of effectiveness. This creates a primary endpoint family.

许多疾病都有多种后遗症,对其中任何一种后遗症的治疗效果都可以支持有效的结论。然而,如果(在设计试验时)事先不知道疾病的哪一方面会对药物产生反应,也不知道哪种评价方法能更好地检测治疗效果,那么选择一个单一的主要终点可能会很困难。在这种情况下,可以设计一项研究,使多个终点中任何一个终点的成功都能支持疗效结论。这就形成了一个主要终点系列。

 

For example, consider a drug for the treatment of burn wounds where it is not known whether the drug will increase the rate of wound closure or reduce scarring, but the demonstration of either effect alone would be considered clinically important. A study in this case might have both wound closure rate and a scarring measure as separate primary endpoints.

例如,考虑一种治疗烧伤伤口的药物,目前尚不清楚该药物是会提高伤口闭合率还是会减少疤痕,但仅凭其中一种效果就可认为具有重要的临床意义。在这种情况下,研究可能会将伤口闭合率和疤痕测量作为单独的主要终点。

 

This use of multiple endpoints creates a multiplicity problem because there are several ways for the study to successfully demonstrate a treatment effect. Control of the Type I error rate for the primary endpoint family is critical. A variety of approaches can be used to address this multiplicity problem; the appendix describes and discusses some of these approaches.

使用多个终点会产生多重性问题,因为研究有多种方法可以成功证明治疗效果。控制主要终点系列的 I 类错误率至关重要。可以使用多种方法来解决多重性问题;附录介绍并讨论了其中一些方法。

 

3.         Composite Endpoints 3.综合终点

 

There are some disorders for which more than one clinical outcome in a clinical trial is important, and all outcomes are expected to be affected by the treatment. Rather than using each as a separate primary endpoint (creating multiplicity) or selecting just one to be the primary endpoint and designating the others as secondary endpoints, it could be appropriate to combine those clinical outcomes into a single variable. This is often called a composite endpoint, where an endpoint is defined as the occurrence or realization in a subject of any one of the specified components. A typical example is a composite of major adverse clinical outcome events in cardiovascular trials (e.g., a composite of myocardial infarction, stroke, or death). When the components correspond to distinct events, composite endpoints are often assessed as the time to first occurrence of any one of the components. If a single statistical test is performed on the composite endpoint, no multiplicity problem will occur for this endpoint.

有些疾病在临床试验中会出现不止一种重要的临床结果,而且所有结果都会受到治疗的影响。与其将每种结果都作为单独的主要终点(造成多重性),或只选择一种结果作为主要终点,而将其他结果指定为次要终点,不如将这些临床结果合并为一个变量。这通常被称为复合终点,其中终点被定义为受试者出现或实现任何一个指定的组成部分。一个典型的例子是心血管试验中主要不良临床结果事件的复合终点(如心肌梗死、中风或死亡的复合终点)。当各组成部分对应不同的事件时,复合终点通常以任一组成部分首次发生的时间来评估。如果只对复合终点进行一次统计检验,则该终点不会出现多重性问题。

 

One possible reason for using a composite endpoint is that the incidence of each of the events maybe too low to allow a study of reasonable size to have adequate power; the composite endpoint can provide a substantially higher overall event rate that allows a study with a reasonable sample size and study duration to have adequate power. Composite endpoints are often used when the goal of treatment is to prevent or delay occurrence of one of several clinically important and related events (e.g., use of an anti-platelet drug in subjects with coronary artery disease to prevent myocardial infarction, stroke, or death), possibly without knowledge of  which event(s) maybe affected.

使用复合终点的一个可能原因是,每个事件的发生率可能太低,以至于合理规模的研究无法获得足够的研究动力;而复合终点可以提供一个高得多的总体事件发生率,从而使样本规模和研究持续时间合理的研究获得足够的研究动力。当治疗的目的是预防或延迟几种临床上重要的相关事件中的一种时(例如,在冠心病患者中使用抗血小板药物以预防心肌梗死、中风或死亡),可能不知道哪些事件可能会受到影响,这时通常会使用复合终点。

 

The choice of the components of a composite endpoint should be made carefully. The treatment effect on the composite event rate can be interpreted as characterizing the overall clinical effect  when the individual events all have reasonably similar clinical importance. The effect on the composite endpoint, however, will not be a reasonable indicator of the effect on all of the components or an accurate description of the drug’s benefit if the clinical importance of different components is substantially different and the treatment effect is chiefly on the least important event. Furthermore, it is possible that a component with greater importance would be adversely affected by the treatment, even if one or more event types of lesser importance are favorably affected, so that although the overall outcome still has a favorable statistical result, doubt may arise about the treatment’s clinical value. In this case, although the overall statistical analysis

应谨慎选择综合终点的组成部分。当单个事件的临床重要性相当接近时,对综合事件发生率的治疗效果可解释为总体临床效果的特征。但是,如果不同成分的临床重要性有很大差异,而治疗效果主要体现在最不重要的事件上,那么对综合终点的影响就不能作为对所有成分影响的合理指标,也不能准确描述药物的益处。此外,即使一个或多个重要性较低的事件类型受到有利的影响,重要性较高的部分也有可能受到治疗的不利影响,因此,尽管总体结果仍具有有利的统计结果,但可能会对治疗的临床价值产生怀疑。在这种情况下,虽然总体统计分析

 

 indicates the treatment is beneficial, careful examination of the data could call this conclusion into question. For this reason, as well as for a greater depth of understanding of the treatment’s   effects, analyses of the components of the composite endpoint are important (see section III.D.)  and can influence interpretation of the overall study results. The examination of the components is always necessary, but whether multiplicity adjustment should be made depends on the purpose. If the intent is to better understand the demonstrated effect on the composite, then no adjustment is recommended. In that case, clinical judgment is used to decide whether the benefit is clinically meaningful and exceeds risk, and how it will be described in the FDA-approved labeling. If the intent is to establish additional effects of the drug, then multiplicity adjustment should be made.

虽然研究结果表明治疗是有益的,但仔细研究数据可能会对这一结论提出质疑。因此,以及为了更深入地了解治疗效果,对综合终点各组成部分的分析非常重要(见第 III.D 节),并可能影响对整个研究结果的解释。对各组成部分进行检查总是必要的,但是否应进行多重性调整取决于目的。如果目的是为了更好地理解已证实的对综合结果的影响,则建议不做调整。在这种情况下,应根据临床判断来决定获益是否具有临床意义并超过风险,以及在 FDA 批准的标签中如何描述。如果目的是确定药物的额外效果,则应进行多重性调整。

 

4.        Multi-Component Endpoints

4.多组件端点

 

A multi-component endpoint is a within-subject combination of two or more components. In this endpoint, an individual subject’s evaluation is dependent upon observation of all the specified components in that subject. A single overall rating or status is then determined according to specified rules.

多成分终点是两个或两个以上成分在受试者体内的组合。在这种终点中,对单个受试者的评价取决于对该受试者所有指定成分的观察。然后根据指定规则确定单一的总体评级或状态。

 

A single overall rating can be formed by some kind of average (either weighted or unweighted)   across the individual domain scores. An example of a multi-component endpoint is the Positive   and Negative Syndrome Scale (PANSS) in schizophrenia research. A multi-component endpoint can also be a dichotomous (response) endpoint corresponding to an individual subject achieving  specified criteria on each of the multiple components. For example, the primary endpoint in clinical trials of allogeneic pancreatic islet cells for Type 1 diabetes mellitus can be a response rate in which subjects are considered responders only if they meet two dichotomous response   criteria: normal range of HbA1c and elimination of hypoglycemia.

可以通过对各个领域得分进行某种平均(加权或非加权)来形成单一的总体评分。精神分裂症研究中的阳性与阴性综合征量表(PANSS)就是一个多成分终点的例子。多成分终点也可以是二分法(反应)终点,对应于受试者在多个成分中的每一个达到特定标准。例如,在异体胰岛细胞治疗 1 型糖尿病的临床试验中,主要终点可以是反应率,受试者只有达到两个二分法反应标准(HbA1c 在正常范围内和消除低血糖)才被视为反应者。

 

There are more complex endpoint formulations where several, but not all, different features of a   disease must be positively affected for a subject to be regarded as receiving benefit. For example, a positive response for an individual subject might be defined as a certain degree of improvement in two specific aspects of a disease along with improvement in at least three out of five additional disease features, as in the American College of Rheumatology (ACR) scoring system for rheumatoid arthritis.

还有一些更复杂的终点表述,即必须对疾病的几个不同特征(但不是所有特征)产生积极影响,受试者才能被视为获益。例如,美国风湿病学会(ACR)类风湿性关节炎评分系统中的定义是,个体受试者的阳性反应可能是疾病的两个特定方面得到了一定程度的改善,同时其他五个疾病特征中至少有三个得到了改善。

 

The use of within-subject multi-component endpoints maybe efficient if the treatment effects on the different components are generally trending in the same direction within a subject. Study power can be adversely affected, however, if there is limited concordance among the endpoints. Although multi-component endpoints can provide some gains in efficiency compared to co- primary endpoints, the appropriateness of a particular within-subject multi-component endpoint is generally determined by clinical, rather than statistical, considerations. Similar to the assessment of the component endpoints of a composite endpoint in section III.C.3., evaluation of the components of a multi-component endpoint maybe important but should be subject to pre- specification and multiplicity adjustment if the intent is to support specific conclusions on how a treatment affects specific components (see section III.D.).

如果一个受试者体内不同成分的治疗效果总体上趋于一致,那么使用受试者内多成分终点可能会很有效。但是,如果终点之间的一致性有限,研究的有效性就会受到不利影响。虽然与共同的主要终点相比,多组分终点能在一定程度上提高效率,但特定受试者内的多组分终点是否合适,通常是由临床因素而非统计学因素决定的。与第 III.C.3 节中对复合终点中各组成部分终点的评估类似,对多组分终点各组成部分的评估也许很重要,但如果目的是支持关于治疗如何影响特定组成部分的具体结论,则应进行预先规范和多重性调整(见第 III.D 节)。

 

5.   Clinically Critical Endpoints Too Infrequent for Use as a Primary Endpoint

5.临床关键终点太少,无法用作主要终点

 

For many serious diseases, there is an endpoint of such great clinical importance that it is unreasonable not to collect and analyze the endpoint data; the usual example is mortality or major morbidity events (e.g., stroke, fracture, pulmonary exacerbation). Even if relatively few of these events are expected to occur in the trial, they can be included in a composite endpoint (see  section III.C.3.) and also designated as a planned secondary endpoint to potentially support a conclusion regarding effect on that separate endpoint, if the effect of the drug on the composite primary endpoint is demonstrated.

对于许多严重疾病来说,有一个终点具有非常重要的临床意义,不收集和分析该终点数据是不合理的;通常的例子是死亡率或重大发病事件(如中风、骨折、肺部恶化)。即使预计试验中发生的此类事件相对较少,也可将其纳入综合终点(见第 III.C.3 节),如果药物对综合主要终点的影响得到证实,也可将其指定为计划中的次要终点,以支持关于对该单独终点影响的结论。

 

D.        The Individual Components of Composite and Multi-Component Endpoints

D. 综合终点和多组分终点的各个组成部分

 

1.        Evaluating and Reporting the Results of Composite Endpoints

1.评估和报告综合终点的结果

 

For composite endpoints whose components correspond to events, an event is usually defined as   the first occurrence of any of the designated component events. Such composites can be analyzed either with comparisons of proportions between study groups at the end of the study or using time-to-event analyses. The time-to-event method of analysis is the more common method when, within the study’s timeframe of observation, the duration of being event-free is clinically meaningful. Although there maybe an expectation that the drug will have a favorable effect on  all the components of a composite endpoint, that is not a certainty. Results for each component  event should therefore be individually examined and should be included in study reports. These analyses will not alter a conclusion about the statistical significance of the composite primary    endpoint; however, interpretation of the result of the composite endpoint can be uncertain (see   section III.C.3.). If there is an interest in analyzing one or more of the components of a composite endpoint as distinct hypotheses to demonstrate effects of the drug, the hypotheses should be part of the prospectively specified statistical analysis plan that accounts for the multiplicity this analysis will entail, as described above, for mortality. However, testing for individual component endpoints is likely to be underpowered as the sample size or total number of events is usually planned for testing the composite endpoint.

对于成分与事件相对应的复合终点,事件通常被定义为任何指定成分事件的首次发生。此类复合终点既可以在研究结束时比较研究组之间的比例,也可以使用时间到事件分析法进行分析。如果在研究的观察时间范围内,无事件持续时间具有临床意义,则采用时间到事件的分析方法更为常见。虽然人们可能期望药物会对综合终点的所有组成部分产生有利影响,但这并不确定。因此,应单独检查每种成分事件的结果,并将其纳入研究报告中。这些分析不会改变关于综合主要终点的统计学意义的结论;但是,对综合终点结果的解释可能存在不确定性(见第 III.C.3 节)。如果有兴趣将综合终点的一个或多个组成部分作为不同的假设进行分析,以证明药物的效果,则这些假设应成为前瞻性统计分析计划的一部分,该计划应考虑到这一分析将带来的多重性,如上文所述的死亡率分析。然而,由于样本量或事件总数通常是为测试综合终点而规划的,因此对单个组成终点进行测试很可能会削弱其作用力。

 

Decomposition of the first composite event is often presented to depict how the component events constitute the composite event in terms of proportion. For example, in the RENAAL trial (Brenner et al. 2001), the primary efficacy endpoint was the first occurrence of the composite endpoint of doubling of serum creatinine, end-stage renal disease, or death. Based on such decomposition, 52% of the first composite events were doublings of serum creatinine, 19% were end-stage renal disease events, and 29% were deaths. However, subjects may experience more than one event type. For these subjects, events occurring after the first composite event (e.g., end-stage renal disease or death occurring after a doubling of serum creatinine) would not be counted in the decomposition. Therefore, evaluation of the individual event types in analyses that include all events for the event type of interest (even those that occur after events of other event    types) is also important. Such analyses could demonstrate a possible additional effect of the drug  if they are pre-specified, multiplicity is properly accounted for, and the results are interpretable.

通常会对第一个综合事件进行分解,以描述各组成事件在比例上是如何构成综合事件的。例如,在 RENAAL 试验(Brenner 等人,2001 年)中,主要疗效终点是首次出现血清肌酐翻倍、终末期肾病或死亡的复合终点。根据这种分解,52%的首次复合事件是血清肌酐翻倍,19%是终末期肾病事件,29%是死亡。然而,受试者可能会经历不止一种事件类型。对于这些受试者,在第一个综合事件之后发生的事件(如血清肌酐翻倍后发生的终末期肾病或死亡)将不计入分解。因此,在包括相关事件类型的所有事件(即使是发生在其他事件类型之后的事件)的分析中,对单个事件类型进行评估也很重要。如果预先指定了此类分析,并适当考虑了多重性,且结果可解释,则此类分析可证明药物可能产生的额外效应。

 

2.        Evaluating and Reporting the Results on Other Multi-Component Endpoints

2.评估和报告其他多成分终点的结果

 

As with composite endpoints, understanding which components of a within-subject multi- component endpoint have contributed most to the overall statistical significance could be important to correctly understanding the clinical effects of the drug. Consequently, analysis of    the study results on the individual components is usually important but, as stated previously, if   undertaken, should not be presented in FDA-approved labeling in ways that imply a statistically rigorous conclusion or convey certainty about the effects that are not supported by that trial. For many of these multi-component endpoints, the overall score is regarded as comprehensive and    clinically interpretable. The individual component scales, however, may or may not be independently clinically interpretable. Analyses of specific components or subdomains of a clinical outcome assessment as explicit endpoints in the primary or secondary endpoint families  can be reasonable, contingent on the endpoint being clinically interpretable. Pre-specification of specific components or subdomains as endpoints with appropriate multiplicity control is recommended if the intent is to demonstrate an effect of a drug on one or more of these  endpoints in addition to the overall multi-component endpoint.

与综合终点一样,了解受试者内多成分终点的哪些成分对总体统计意义的贡献最大,对于正确理解药物的临床效果可能非常重要。因此,对单个成分的研究结果进行分析通常很重要,但如前所述,如果进行分析,则不应在 FDA 批准的标签中以暗示统计学上严格的结论或传达该试验不支持的疗效确定性的方式进行表述。对于这些多成分终点中的许多终点来说,总分被认为是全面的、临床上可解释的。然而,单个成分的量表在临床上可能是可解释的,也可能是不可解释的。将临床结果评估的特定组成部分或子领域作为主要或次要终点系列中的明确终点进行分析可能是合理的,但前提是终点在临床上是可解释的。如果目的是为了证明药物除了对总体多组分终点有影响外,还对其中一个或多个终点有影响,则建议预先指定特定组分或子域作为终点,并进行适当的多重性控制。

 

IV.      METHODOLOGICAL CONSIDERATIONS

IV.方法论方面的考虑

 

A variety of situations in which multiplicity arises have been discussed in sections II. and III.    When there is a family of endpoints (discussed in section III.A.), the probability of erroneously finding a statistically significant treatment effect in at least one endpoint regardless of the presence or absence of treatment effects in the other endpoints is the overall Type I error rate. This error rate is typically held to 0.05 (or 0.025 for one-sided tests). Statistical methods that   control this error rate at the desired level can permit an effectiveness conclusion on individual endpoints.

第二节和第三节讨论了产生多重性的各种情况。当存在一系列终点时(在第三.A.节中讨论),无论其他终点是否存在治疗效果,至少在一个终点中错误地发现具有统计学意义的治疗效果的概率就是总的 I 类错误率。这一误差率通常被控制在 0.05(单侧试验为 0.025)。将误差率控制在理想水平的统计方法可以得出单个终点的有效性结论。

 

There are many common statistical methods for addressing multiple-endpoint-related multiplicity problems (Hochberg and Tamhane 1987). The appendix presents some of the commonly considered methods. Examples include the Bonferroni, Holm (Holm 1979), and Hochberg (Hochberg 1988) procedures, which do not assume any hierarchy among the tested null hypotheses (i.e., any individual null hypothesis in the family can be rejected regardless of the rejection of other hypotheses). Other viable methods apply a combination of partial alpha allocation and hierarchies, such as graphical methods (Bretz et al. 2009) that arepresented in the appendix. If finding a statistically significant treatment effect in any one of the considered endpoints is considered a success, then methods that appropriately adjust for multiplicity across the family of endpoints can be applicable.

有许多常用的统计方法可以解决与多端点相关的多重性问题(Hochberg 和 Tamhane,1987 年)。附录介绍了一些常用的方法。例如 Bonferroni、Holm(Holm,1979 年)和 Hochberg(Hochberg,1988 年)程序,它们不假定所测试的零假设之间有任何等级关系(即无论其他假设是否被拒绝,都可以拒绝族中的任何单个零假设)。其他可行的方法则结合使用了部分阿尔法分配和层次结构,如附录中介绍的图形方法(Bretz 等,2009 年)。如果在任何一个被考虑的终点中发现具有统计学意义的治疗效果被认为是成功的,那么可以采用适当调整整个终点系列多重性的方法。

 

However, if endpoints are ordered based on clinical importance or logically related, then different methods can be recommended (e.g., Pocock et al. 2012). For example, in the simple case where there is one primary and one secondary endpoint, a hierarchical testing approach can be used. Some methodologies have been developed to account for more complex logical/ hierarchical relationships among the endpoints such as graphical approaches (e.g., Bretz et al. 2009) and mixture gatekeeping procedures (Dmitrienko et al. 2008). The graphical method has a sequential testing algorithm and makes it possible to visualize the testing process via a graph.

但是,如果终点是根据临床重要性排序的,或者在逻辑上是相关的,则可以推荐使用不同的方法(如 Pocock 等人,2012 年)。例如,在有一个主要终点和一个次要终点的简单情况下,可以使用分层测试方法。目前已开发出一些方法来考虑端点之间更复杂的逻辑/层次关系,如图形方法(如 Bretz 等人,2009 年)和混合物把关程序(Dmitrienko 等人,2008 年)。图形方法具有顺序测试算法,可通过图形直观显示测试过程。

 

In some cases, a primary endpoint can be tested for non-inferiority (with a fixed margin), followed by testing it for superiority. If this endpoint is the only endpoint being tested, then non- inferiority and superiority can be tested without multiplicity adjustment because the null hypotheses of non-inferiority and superiority are naturally ordered, and the two tests apply to the one hierarchy considered for this endpoint. However, if at least one more endpoint is included for testing, then multiplicity issues arise, and adjustments should be made to control the overall Type I error probability. For example, the tests could be ordered in a single hierarchy where the additional endpoint(s) are tested after the superiority hypothesis for the primary endpoint. Or, alternatively, testing could proceed to both the superiority hypothesis for the primary endpoint and to the hypotheses for the additional endpoints, with alpha allocation across these multiple hypotheses. To see why such alpha allocation can be applicable, suppose the drug is non-inferior to the active control with respect to the primary endpoint, but the drug is neither superior to the active control for the primary endpoint nornon-inferior to the active control for the secondary endpoint. Thus, a Type I error could occur with either of these hypothesis tests. If both of these were tested at 0.05, the probability of at least one of these leading to a spurious conclusion would be greater than 0.05. Thus, there should be appropriate control in some manner (e.g., test the secondary endpoint only if the primary endpoint superiority is shown or split alpha between the   two tests). Additional discussion on this special case and on other methodological considerations is provided in the appendix.

在某些情况下,可以先测试主要终点的非劣效性(有固定的差值),然后再测试其优效性。如果该终点是唯一进行测试的终点,那么就可以进行非劣效性和优效性测试,而无需进行多重性调整,因为非劣效性和优效性的无效假设是自然排序的,这两种测试适用于该终点所考虑的一个层次。但是,如果至少还有一个终点需要测试,就会出现多重性问题,因此应进行调整以控制总体 I 类错误概率。例如,可以按单一层次对测试进行排序,在主要终点的优越性假设之后再对附加终点进行测试。或者,也可以同时对主要终点的优越性假设和附加终点的假设进行测试,并在多个假设之间进行阿尔法分配。为了说明为什么可以采用这种阿尔法分配,假设药物在主要终点方面不劣于活性对照,但在次要终点方面既不优于活性对照,也不劣于活性对照。因此,这两种假设检验中的任何一种都可能出现 I 类错误。如果这两个假设的检验值都是 0.05,那么至少其中一个导致虚假结论的概率将大于 0.05。因此,应该以某种方式进行适当的控制(例如,只有当主要终点显示出优越性时,才对次要终点进行测试,或在两个测试之间分配 alpha)。关于这种特殊情况和其他方法学考虑因素的更多讨论见附录。

 

V.        SUMMARY V. 摘要

 

Making a false positive conclusion about effectiveness (i.e., falsely concluding that a drug has a positive treatment effect when it does not) is a major concern. A common approach is to control the Type I error rate at less than 5% (1 in 20 chance) for a false conclusion that there is a treatment difference or 2.5% (1 in 40 chance) for a false positive conclusion about effectiveness. As the number of endpoints or analyses increases, the Type I error rate can increase well beyond 2.5% due to multiplicity. Multiplicity adjustments, as described in this guidance, provide means  for controlling the Type I error rate when the drug effect is evaluated in multiple endpoints. There are many strategies and methods that can be used, as appropriate, as described in this guidance. Each of these methods has advantages and disadvantages, and the selection of suitable strategies and methods is a challenge that should be addressed at the study-planning stage.

对疗效做出错误的肯定性结论(即错误地认为某种药物具有肯定的治疗效果,而实际上却没有)是一个令人担忧的主要问题。一种常见的方法是控制 I 类错误率,对于存在治疗差异的错误结论,控制在 5%(1/20 的概率)以下;对于疗效的错误阳性结论,控制在 2.5%(1/40 的概率)以下。随着终点或分析数量的增加,由于多重性的存在,I 类错误率可能会增加到远远超过 2.5%。本指南中所述的多重性调整提供了在多个终点中评估药物效果时控制 I 类错误率的方法。如本指南所述,可酌情使用多种策略和方法。这些方法各有利弊,选择合适的策略和方法是一项挑战,应在研究规划阶段加以解决。

 

Statistical expertise should be enlisted to help choose the most appropriate approach. Failing to appropriately control the Type I error rate may increase the risk of a false positive conclusion;   this guidance is intended to clarify when and how multiplicity due to multiple endpoints should be managed to avoid reaching such false conclusions.

应利用统计专业知识来帮助选择最合适的方法。如果不能适当控制 I 类错误率,可能会增加得出假阳性结论的风险;本指南旨在阐明何时以及如何管理多个终点导致的多重性,以避免得出此类错误结论。

 

APPENDIX: STATISTICAL METHODS

附录:统计方法

 

This appendix presents some commonly used statistical methods and approaches for addressing multiplicity problems in controlled clinical trials that evaluate treatment effects on multiple endpoints. The methods listed in this appendix are not intended to be a comprehensive list of methods for controlling multiplicity; other approaches could be appropriate for specific situations. The choice of the method to use for a specific clinical trial will depend on the objectives and the design of the trial, as well as the knowledge of the drug being developed and the clinical setting. The method, however, should be decided upon prospectively. Because the   considerations that go into the choice of multiplicity adjustment method can be complex and specific to individual product development programs, this guidance does not attempt to recommend any one method over another in most cases. Sponsors should consider the variety of methods available and in the prospective analysis plan select the most powerful method that is suitable for the design and objective of the study and maintains Type I error rate control.

本附录介绍了一些常用的统计方法和手段,用于解决对照临床试验中的多重性问题,这些临床试验评估治疗对多个终点的影响。本附录中列出的方法并不是控制多重性的全面方法清单;其他方法可能适用于特定情况。具体临床试验采用哪种方法取决于试验的目标和设计,以及对所开发药物和临床环境的了解。不过,方法应是前瞻性的决定。由于选择多重性调整方法的考虑因素可能很复杂,而且具体到每个产品开发项目,因此本指南并不打算在大多数情况下推荐任何一种方法。申办者应考虑现有的各种方法,并在前瞻性分析计划中选择适合研究设计和目标并能保持 I 类错误率控制的最有效方法。

 

1.         The Bonferroni Method 1.Bonferroni 法

 

The Bonferroni method is a single-step procedure that is commonly used, perhaps because of its simplicity and broad applicability. The drug is considered to have shown effects for each endpoint that succeeds on this test. The Holm and Hochberg methods (see below) are more powerful than the Bonferroni method for primary endpoints and are therefore preferable in many cases. However, sponsors might still wish to use the Bonferroni method for primary endpoints to maximize power for secondary endpoints or because the assumptions of the Hochberg method are not justified.

Bonferroni 法是一种常用的单步程序,也许是因为它简单易行、适用性广。药物被认为对每一个测试成功的终点都有效果。对于主要终点,Holm 和 Hochberg 方法(见下文)比 Bonferroni 方法更有效,因此在许多情况下更可取。然而,申办者可能仍希望对主要终点使用 Bonferroni 方法,以最大限度地提高次要终点的效应,或者因为 Hochberg 方法的假设不成立。

 

The most common form of the Bonferroni method divides the available total  (typically 0.05 two-sided) equally among the chosen endpoints. The method then concludes that a treatment effect is significant at the  level for each one of them endpoints for which the endpoint’sp- value is less than α/m. Thus, with two endpoints, the critical  for each endpoint is two-sided    0.025. The Bonferroni test can also be performed with different weights assigned to endpoints,  with the sum of the relative weights equal to 1.0 (e.g., 0.4, 0.3, 0.2, and 0.1 for four endpoints). These weights should be prespecified in the design of the trial, taking into consideration the clinical importance of the endpoints, the likelihood of success, or other factors.

最常见的 Bonferroni 方法是将可用的总  (通常为 0.05 双侧)平均分配给所选终点。然后,该方法得出结论:对于终点sp-值小于α/m的每个终点,治疗效果在  水平上都是显著的。因此,对于两个终点,每个终点的临界  是双侧 0.025。在进行 Bonferroni 检验时,也可以给端点分配不同的权重,相对权重之和等于 1.0(例如,四个端点的权重分别为 0.4、0.3、0.2 和 0.1)。这些权重应在试验设计中预先确定,同时考虑到终点的临床重要性、成功的可能性或其他因素。

 

2.         The Holm Procedure 2.霍尔姆程序

 

The Holm procedure is a multi-step step-down procedure; it is useful for endpoints with any degree of correlation. It is less conservative than the Bonferroni method because a success with   the smallest p-value (at the same endpoint-specific alpha as the Bonferroni method) allows other endpoints to be tested at larger endpoint-specific alpha levels than does the Bonferroni method.   The algorithm for performing this test is as follows:

Holm 程序是一个多步骤的递减程序;它适用于任何相关程度的终点。它不如 Bonferroni 方法保守,因为用最小的 p 值(与 Bonferroni 方法相同的终点特异性 alpha 值)取得成功后,就可以用比 Bonferroni 方法更大的终点特异性 alpha 值水平来检验其他终点。进行该检验的算法如下:

 

The endpoint p-values resulting from the completed study are first ordered from the smallest to the largest. Suppose that there are m endpoints to be tested and p(1) represents the smallest p- value, p(2) the next-smallest p-value, p(3) the third-smallest p-value, and so on.

已完成研究得出的终点 p 值首先从小到大排序。假设有 m 个终点需要测试,p(1) 代表最小的 p 值,p(2) 代表次小的 p 值,p(3) 代表第三小的 p 值,以此类推。

 

 i.   The test begins by comparing the smallest p-value, p(1), to α/m, the same threshold used in the equally-weighted Bonferroni correction. If this p(1) is less than α/m, the treatment effect for the endpoint associated with this p-value is considered significant.

i. 检验首先将最小 p 值 p(1) 与 α/m 进行比较,α/m 与等权 Bonferroni 校正中使用的临界值相同。如果 p(1) 小于 α/m,则与该 p 值相关的终点治疗效果被认为是显著的。

 

ii.   The test then compares the next-smallest p-value, p(2), to an endpoint-specific alpha of the total alpha divided by the number of yet-untested endpoints (e.g., α/(m- 1) for the second smallest p-value, a somewhat less conservative significance level). If p(2) < α/(m- 1), then the treatment effect for the endpoint associated with this p(2)  is also   considered significant.

ii. 然后,将次小的 p 值 p(2)与终点特异性 alpha 值(总 alpha 值除以尚未检测的终点数)进行比较(例如,第二小的 p 值为 α/(m- 1),这是一个不太保守的显著性水平)。如果 p(2) < α/(m-1),那么与该 p(2) 相关的终点的治疗效果也被认为是显著的。

 

iii.   The test then compares the next ordered p-value, p(3), to α/(m-2), and so on until the last p-value (the largest p-value) is compared to α .

iii. 然后,检验将下一个有序 p 值 p(3) 与 α/(m-2) 进行比较,依此类推,直到最后一个 p 值(最大 p 值)与 α 比较为止。

 

 iv.   The procedure stops, however, whenever a step yields anon-significant result. Once an ordered p-value is not significant, the remaining larger p-values are not evaluated and    cannot be considered as statistically significant.

iv. 然而,每当一个步骤得出不显著的结果时,程序就会停止。一旦某个有序 p 值不显著,其余较大的 p 值将不予评估,也不能被视为具有统计意义。

 

There is also a more general weighted version of Holm which allows unequal alpha allocation to the individual null hypotheses.

还有一种更通用的加权霍尔姆版本,允许对各个零假设进行不平等的阿尔法分配。

 

 3.         The Hochberg Procedure

3.霍赫伯格程序

 

The Hochberg procedure is a step-up testing procedure. It is more powerful than the Holm procedure (i.e., if a treatment effect is significant under Holm procedure it will be also significant under Hochberg procedure but not necessarily viceversa), but, unlike the Holm procedure, it controls the overall error rate only under certain assumptions. It compares the p-values to the same alpha critical values of α/m, α/(m- 1), …, α, as the Holm procedure, but, in contrast to the Holm procedure, the Hochberg procedure is a step-up procedure. Instead of starting with the smallest p-value, the procedure starts with the largest p-value, which is compared to the largest   endpoint-specific critical value (α). Also, essentially in the reverse of the Holm procedure, if the first test of hypothesis does not show statistical significance, testing proceeds to compare the second-largest p-value to the second-largest adjusted alpha value, α/2. Sequential testing continues in this manner until ap-value for an endpoint is statistically significant, whereupon the Hochberg procedure provides a conclusion of statistically significant treatment effects for that endpoint and all endpoints with smaller p-values. For example, when the largest p-value is less    than α, then the method concludes that there are significant treatment effects for all endpoints. In another situation, when the largest p-value is not less than α, but the second-largest p-value is less than α/2, then the method concludes that treatment effects have been demonstrated for all endpoints except for the one associated with the largest p-value.

霍奇伯格程序是一种阶跃检验程序。它比 Holm 程序更强大(也就是说,如果在 Holm 程序下处理效果显著,那么在 Hochberg 程序下也会显著,但不一定相反),但与 Holm 程序不同的是,它只在特定假设条件下控制总体误差率。它将 p 值与 α/m、α/(m-1)、......、α 等与 Holm 程序相同的 α 临界值进行比较,但与 Holm 程序不同的是,Hochberg 程序是一个递增程序。该过程不是从最小的 p 值开始,而是从最大的 p 值开始,并与最大的终点临界值 (α)进行比较。此外,与霍尔姆程序基本相反,如果第一次假设检验没有显示出统计学意义,则继续将第二大 p 值与第二大调整后的α值(α/2)进行比较。例如,当最大 p 值小于 α 时,该方法得出的结论是所有终点都有显著的治疗效果。在另一种情况下,当最大 p 值不小于 α,但第二大 p 值小于 α/2,则该方法得出结论:除了与最大 p 值相关的终点外,所有终点的治疗效果均已得到证实。

 

 The Bonferroni and the Holm procedures are well known for being assumption-free. The methods can be applied without concern for the endpoint types, their statistical distributions, and  the type of correlation structure. The Hochberg procedure, on the other hand, is not assumption-   free in this way. The Hochberg procedure is known to provide adequate overall alpha-control for  independent endpoint tests or for positively correlated dependent tests with standard test statistics in some cases (e.g., the test statistics are jointly bivariate normal). It is also a valid test procedure when certain conditions are met. Various simulation experiments for the general case (e.g., for more than two endpoints with unequal correlation structures) indicate that the Hochberg procedure usually will, but is not guaranteed to, control the overall Type I error rate for positively correlated endpoints, but fails to do so for some negatively correlated tests (Sarkar et al. 1997,Huque 2016).

众所周知,Bonferroni 和 Holm 程序是无假设的。在应用这些方法时,无需考虑终点类型、统计分布和相关结构类型。另一方面,霍奇伯格程序却不是这样的无假设程序。众所周知,霍赫伯格程序在某些情况下(如测试统计量是共同的双变量正态),可为独立终点测试或具有标准测试统计量的正相关因变测试提供充分的总体α控制。当满足某些条件时,它也是一种有效的测试程序。针对一般情况(如两个以上具有不等相关结构的端点)的各种模拟实验表明,霍赫伯格程序通常可以控制正相关端点的总体 I 类错误率,但并不能保证做到这一点,但对于某些负相关检验,霍赫伯格程序却不能做到这一点(Sarkar 等,1997 年;Huque,2016 年)。

 

 4.        Prospective Alpha Allocation Scheme

4.前瞻性阿尔法分配计划

 

The Prospective Alpha Allocation Scheme (PAAS) (Moye 2000) is a single-step method that has a slight advantage in power over the Bonferroni method. The method allows equal or unequal alpha allocations to all endpoints, but, as with the Bonferroni method, each specific endpoint receives a prospective allocation of a specific amount of the overall alpha. The alpha allocations are required to satisfy the equation:

前瞻性阿尔法分配方案(PAAS)(莫伊,2000 年)是一种单步骤方法,与邦费罗尼法相比,它在功率方面略胜一筹。该方法允许对所有端点进行相等或不相等的 alpha 分配,但与 Bonferroni 方法一样,每个特定端点都会得到总 alpha 中特定数量的预期分配。阿尔法分配必须满足等式:

 

 (1 - α 1)(1 – α2) … (1 – αk) … (1 – αm) = (1- α).

 

Each element in this equation, (1 – αk), is the probability of correctly not rejecting the null hypothesis for the kth  endpoint, when it is tested at the allocated alpha αk. This procedure is valid when the endpoints are independent or positively correlated, but the Type I error rate maybe inflated when the endpoints are negatively correlated. This equation states the requirement that probability of correctly not rejecting all of the individual null hypotheses, calculated by multiplying each of them probabilities together, should equal the selected goal (e.g., 0.95). The    alpha allocation for any of the individual endpoint tests can be arbitrarily assigned, if desired, but the total group of allocations should always satisfy the above equation. In general, when arbitrary alpha allocations are made for some endpoints, at least the last endpoint’s alpha should be calculated in order to satisfy the overall equation. As stated earlier, the Bonferroni method relies upon a similar constraint-defining equation, except that for the Bonferroni method the sum of all the individual alphas should equal the overall alpha.

这个等式中的每个元素(1-αk)都是第 k 个端点在所分配的αk 下进行测试时,正确不拒绝零假设的概率。当端点独立或正相关时,该程序有效,但当端点负相关时,I 类错误率可能会增大。该等式规定,通过将每个概率相乘计算出的正确不拒绝所有单个零假设的概率应等于所选目标(如 0.95)。如果需要,可以任意分配任何一个终点测试的α分配,但总的分配组应始终满足上述等式。一般来说,当对某些端点进行任意α分配时,至少应计算最后一个端点的α,以满足总等式的要求。如前所述,Bonferroni 方法也依赖于类似的约束定义等式,只不过对于 Bonferroni 方法来说,所有单个 alpha 的总和应等于总体 alpha。

 

 5.         The Fixed-Sequence Method

5. 固定方程法

 

In many studies, testing of the endpoints can be ordered in a specified sequence, often ranking them by clinical relevance or likelihood of success. A fixed-sequence statistical testing procedure tests endpoints in a predefined order, all at the same significance level alpha (e.g., α = 0.05), moving to the next endpoint only after a success on the previous endpoint. Such a testing procedure requires (1) prospective specification of the testing sequence and (2) no further testing once the sequence breaks; that is, further testing stops as soon as there is a failure of an endpoint in the sequence to show significance at level alpha (e.g., α = 0.05).

在许多研究中,终点测试可以按特定顺序排列,通常按临床相关性或成功的可能性排序。固定序列统计测试程序按预先确定的顺序测试终点,所有终点的显著性水平α(如α = 0.05)相同,只有在上一个终点测试成功后,才进入下一个终点测试。这种测试程序要求:(1) 预先确定测试顺序;(2) 一旦顺序中断,就不再进行测试;也就是说,一旦顺序中的一个端点未能显示出α水平(如α=0.05)的显著性,就停止进一步测试。

 

 The appeal of the fixed-sequence testing method is that it does not require any alpha adjustment of the individual tests. Its main drawback is that if a hypothesis in the sequence is not rejected, statistical significance cannot be achieved for the endpoints planned for the subsequent hypotheses, even if they have extremely small p-values. Suppose, for example, that in a study,  the p-value for the first endpoint test in the sequence isp = 0.59, and the p-value for the second endpoint isp = 0.001; despite the apparent strong finding for the second endpoint, the result is   not considered statistically significant. Ignoring the first endpoint’s result recreates the multiplicity problem and causes inflation of the overall Type I error rate. For this example, other methods of controlling Type I error such as the Bonferroni method, would have shown an effect  for the second endpoint.

固定序列检验法的优点是不需要对单个检验进行任何阿尔法调整。它的主要缺点是,如果序列中的一个假设未被拒绝,那么为后续假设规划的终点就无法达到统计显著性,即使这些终点的 p 值非常小。例如,假设在一项研究中,序列中第一个终点检验的 p 值为 p = 0.59,而第二个终点的 p 值为 p = 0.001;尽管第二个终点有明显的重大发现,但结果不被认为具有统计学意义。忽略第一个终点的结果会再次出现多重性问题,并导致总体 I 类错误率上升。在这个例子中,其他控制 I 类错误的方法,如 Bonferroni 方法,会显示第二个终点的影响。

 

Thus, for the fixed-sequence method, carefully selecting the ordering of the tests of hypotheses is critical. A test early in the sequence that fails to show statistical significance will render the  remainder of the endpoints not statistically significant. It is often not possible to determine a priori the best order for testing (Hung and Wang 2010), and there are other methods for addressing the multiplicity problem, which are described in the following subsections.

因此,对于固定序列法来说,仔细选择假设检验的顺序至关重要。如果在序列早期的测试未能显示出统计学意义,则其余的终点也不会显示出统计学意义。通常不可能事先确定最佳的测试顺序(Hung 和 Wang,2010 年),因此还有其他方法来解决多重性问题,下文将对此进行介绍。

 

 6.        Resampling-Based, Multiple-Testing Procedures

6.基于重采样的多重测试程序

 

 When there is correlation among multiple endpoints, resampling (Westfall and Young 1993) is one general statistical approach that can provide more power than the methods described above    to detect a true treatment effect while maintaining control of the overall Type I error rate, and the power increases as the correlation increases. With these methods, a distribution of the possible test-statistic values under the null hypothesis is generated based upon the observed data of the trial. This data-based distribution is then used to find the p-value of the observed study result  instead of using a theoretical distribution of the test statistics (e.g., a normal distribution of Z- scores, or at-distribution fort-scores) as with most other methods.

当多个终点之间存在相关性时,重采样(Westfall 和 Young,1993 年)是一种通用的统计方法,它可以提供比上述方法更强的检测真实治疗效果的能力,同时保持对总体 I 类错误率的控制,而且随着相关性的增加,检测能力也会增加。使用这些方法时,会根据试验观察到的数据生成零假设下可能的试验统计量值分布。然后使用这个基于数据的分布来找出观察到的研究结果的 p 值,而不是像大多数其他方法那样使用检验统计量的理论分布(如 Z-分数的正态分布或等分布 fort-分数)。

 

Resampling methods include the bootstrap and permutation approaches for multiple endpoints and require few, albeit important, assumptions about the true distribution of the endpoints. There are, however, some drawbacks to these methods. The important assumptions are generally difficult to verify, particularly for small study sample sizes. These methods, consequently, usually require large study sample sizes (particularly bootstrap methods) and often require simulations to ensure the data-based distribution of the test statistics from the limited trial data is applicable and to ensure adequate Type I error rate control. Inflation of the Type I error rate may occur, for example, if the shape of the data distribution is different between the treatment groups being compared.

重采样方法包括针对多终点的自举法和置换法,这些方法对终点的真实分布要求不高,尽管这些假设很重要。不过,这些方法也有一些缺点。重要的假设通常很难验证,特别是对于研究样本量较小的情况。因此,这些方法通常需要较大的研究样本量(尤其是自举法),而且往往需要进行模拟,以确保基于有限试验数据的测试统计量的数据分布是适用的,并确保有足够的 I 类错误率控制。例如,如果被比较的治疗组之间的数据分布形状不同,就可能出现 I 类错误率膨胀。

 

7.         Gatekeeping Testing Strategies

7. 把关测试策略

 

Gatekeeping procedures (e.g., Dmitrienko et al. 2008, Dmitrienko and D’Agostino 2013) address the problems of testing hierarchically ordered families of null hypotheses. Families usually correspond to primary and secondary objectives in a clinical trial (see section III.A.). Inferences in each family depend on the acceptance or rejection of null hypotheses in the earlier families consistent with logical relationships that may exist among the null hypotheses. The relationships usually reflect the relevant clinical considerations and are specified using a set of logical restrictions. Different types of logical gatekeeping constraints have been studied including serial gatekeeping, parallel gatekeeping and their generalization referred to as tree-structured gatekeeping.

把关程序(例如,Dmitrienko 等人,2008 年;Dmitrienko 和 D'Agostino,2013 年)可解决分层有序的零假设检验问题。族通常对应于临床试验中的主要目标和次要目标(见第 III.A 节)。每个族中的推论取决于接受或拒绝前一个族中的零假设,这些零假设之间可能存在逻辑关系。这些关系通常反映了相关的临床考虑因素,并通过一系列逻辑限制条件加以明确。已研究过不同类型的逻辑把关限制,包括串行把关、并行把关以及被称为树状结构把关的概括。

 

 A serial strategy can be applied, for example, in the scenario where the endpoints of the primary family are tested as co-primary endpoints (section III.C.). If all endpoints in the primary family   are statistically significant at the alpha level (e.g., α = 0.05), the endpoints in the second family are examined. The endpoints in the second family can be tested at the overall alpha level by any prespecified acceptable method (e.g., Holm procedure, the fixed-sequence method, or others described in this appendix) that controls Type I error rate within the second family. If, however, at least one of the null hypotheses of the primary family fails to be rejected, the primary family  criterion has not been met and the secondary endpoint family is not tested.

例如,在将主族终点作为共主终点进行测试的情况下,就可以采用序列策略(第 III.C 节)。如果主族中的所有终点在α水平(如α=0.05)上都具有统计学意义,则对第二族中的终点进行检验。第二个族中的终点可通过任何预先指定的可接受方法(如霍尔姆程序、固定序列法或本附录中描述的其他方法)在总体α水平上进行检验,该方法可控制第二个族中的I类错误率。但是,如果主要族中至少有一个零假设未被拒绝,则主要族标准未达到,次要终点族不进行测试。

 

A parallel gatekeeping strategy is applied when the endpoints in the primary family are not all co-primary endpoints, and a separable testing method (e.g., Bonferroni method or Truncated Holm method) is specified for the primary family. In this strategy, the second endpoint family is  examined when at least one of the endpoints in the first family has shown statistical significance.

当主族中的端点并非都是共主端点时,就会采用平行把关策略,并为主族指定一种可分离的检测方法(如 Bonferroni 法或截断 Holm 法)。在这一策略中,当第一个族中至少有一个终点显示出统计学意义时,就会对第二个终点族进行检测。

 

Some multiplicity problems are multidimensional. One dimension may correspond to multiple endpoints, a second to multiple-dose groups (that have each of those endpoints tested), and yet another dimension to multiple hypotheses regarding an endpoint, such as non-inferiority and superiority tests (for each dose and each endpoint). The multiple sources of multiplicity create    the potential for multiple pathways of testing the hypotheses. For example, if the goal of a study is to demonstrate non-inferiority as well as superiority, a single path of sequential tests is preferred. Suppose, however, that one wants to analyze a second endpoint for non-inferiority    after the first endpoint is successfully shown to be non-inferior. The testing path now branches into two paths from this initial test (i.e., testing superiority for the first endpoint and non-inferiority for the second endpoint).

有些多重性问题是多维的。一个维度可能对应多个终点,第二个维度对应多个剂量组(对每个终点都进行测试),另一个维度对应与一个终点有关的多个假设,如非劣效性和优效性测试(针对每个剂量和每个终点)。多重性的多种来源为测试假设提供了多种途径。例如,如果一项研究的目标是证明非劣效性和优效性,那么最好采用单一的顺序测试途径。然而,假设在第一个终点成功证明为非劣效性后,还想对第二个终点进行非劣效性分析。现在,测试路径从最初的测试分支成两条路径(即测试第一个终点的优越性和第二个终点的非劣效性)。

 

 The multi-branched gatekeeping procedure allows for ordering the sequence of testing with the option of testing of more than one endpoint if a preceding test is successful. When there are multiple levels of this sequential hierarchy, and branching is applied at several of the steps, the possible paths of endpoint testing become a complex, multi-branched structure.

多分支把关程序允许对测试顺序进行排序,如果前面的测试成功,可选择测试多个端点。当这种顺序层次结构有多个层次,并在多个步骤中应用分支时,端点测试的可能路径就会成为一个复杂的多分支结构。

 

 As a simple illustration (Figure A1), consider a clinical trial that compares a treatment to control on two primary endpoints (Endpoint 1 and Endpoint 2) to determine first whether the treatment is non-inferior to the control for at least one endpoint. If, for either of the two endpoints, the treatment is found non-inferior to the control, there is also a desire to test whether it is superior to control for that endpoint. The analytic plan for the trial thus sets the following logical restrictions:

举个简单的例子(图 A1),考虑一项临床试验,在两个主要终点(终点 1 和终点 2)上对治疗方法和对照方法进行比较,首先确定治疗方法在至少一个终点上是否不劣于对照方法。如果发现在两个终点中的任何一个终点上,治疗效果不劣于对照组,则还希望测试在该终点上治疗效果是否优于对照组。因此,试验的分析计划设定了以下逻辑限制:

 

i.     Test endpoint two only after non-inferiority for endpoint one is first established.

i.只有首先确定终点一的非劣效性后,才能测试终点二。

 

 ii.    Test for superiority on an endpoint only after non-inferiority for that endpoint is first concluded.

ii.只有在对某一终点得出非劣效性结论后,才能测试该终点的优越性。

 

The following diagram shows the decision structure of the test strategy. In this diagram, each block (or node) states the null hypothesis that it tests.

下图显示了测试策略的决策结构。在该图中,每个区块(或节点)都说明了它要检验的零假设。

 

 image.png

Figure A1:Example of a flow diagram for non-inferiority and superiority tests for endpoints one and two of a trial  with logical restrictions,where a+a=a.To test for superiority for Endpoint 1 and/or 2,one should first establish non-inferiority for that endpoint.

图 A1:对有逻辑限制的试验的终点 1 和 2 进行非劣效性和优效性检验的流程图示例,其中 a+a=a.要检验终点 1 和/或 2 的优越性,首先应确定该终点的非劣效性。

 

Thus,the  above  test  strategy  has   a  two-dimensional  hierarchical   structure,one  dimension  for  the two  different  endpoints  and  the  other  for  the  non-inferiority  and  superiority  tests,with  the  logical restrictions  as  stated  above.Note  that  for  this  type  of  procedure,if  multiple  branches  split  off from  a  single  node,the  alpha  should  be  split  across  the  multiple  branches.

因此,上述检验策略具有两维层次结构,一维用于两个不同的终点,另一维用于优劣检验,并具有上述逻辑限制。

 

8.    Graphical  Approaches   Based  on Sequentially   Rejective   Tests

8. 基于顺序拒绝测试的图形方法

 

The  graphical  approach  (e.g.,Bretz  et  al.2009)is  a  means   for  developing  and  evaluating multiple  analysis   strategies  for  Bonferroni-based  sequentially  rejective  methods.This  approach illustrates  differences  in  endpoint  importance  as  well  as  the  relationships  among  the  endpoints  by mapping onto a test strategy that ensures control of the Type I error rate and aids in creating and  evaluating  alternative  test  strategies.

图解法(如 Bretz 等人,2009 年)是为基于 Bonferroni 的顺序拒绝法制定和评估多重分析策略的一种方法。这种方法通过映射到测试策略来说明终点重要性的差异以及终点之间的关系,从而确保对 I 类错误率的控制,并有助于制定和评估替代测试策略。

 

 Graphical  displays  of  complex  analysis  strategies  can  aid  in  describing  and  assessing  the proposed  plan  by  displaying  all  the  logical  relationships  among  endpoint  tests  of  hypotheses.

复杂分析策略的图形显示有助于描述和评估拟议的计划,因为它可以显示假设的终点测试之间的所有逻辑关系。

 

Basics  of  the   Graphical  Approach:Use   of  vertex(node)and  path(order   or   direction)

图形方法的基本原理:顶点(节点)和路径(顺序方向)的使用

 

In  the  graphical  approach,the  testing  strategy  is  defined  by  a  figure  (graph)that  shows  each  of the  hypotheses  (Hj,H…,Hm)located  at  a  vertex   (or  node,a  junction  of  testing  order  paths). Each  vertex  (hypothesis)is  allocated  an  initial  amount  of  alpha,which  this  document  defines  as   the  endpoint-specific  alpha  (with  the  understanding  that  a  test  of  an  endpoint  is  associated  with  a test  of a  hypothesis,and  vice versa).A key requirement  is  that  the  sum  of all  of the  endpoint- specific alpha levels is equal to the total alpha level available for the  study (the overall Type I error  rate).At  each   step  of  the  algorithm,endpoints  are  tested  at  the  endpoint-specific significance  levels  using  Bonferroni  procedure.

在图形方法中,测试策略由一个图(图形)来定义,该图(图形)显示了位于顶点(或节点,测试阶 次路径的交点)的每个假设(Hj,H...,Hm)。每个顶点(假设)都分配了初始的 alpha 值,本文将其定义为终点特定的 alpha 值(对终点的检验与对假设的检验相关联,反之亦然)。一个关键要求是,所有终点特定的 alpha 值之和等于研究可用的总 alpha 值(总体 I 类错误率)。

 

 Another  feature  of  the  figure  (graph)is  a  set  of  directed  edges.Each  directed  edge  (or  arrow) connects two hypotheses  and  is  assigned  a  value between  O  and  1,called  a  weight  for  that  edge and  shown  above  the  arrow,which  indicates  the  fraction  of the preserved  alpha  to  be  shifted

图(图形)的另一个特征是有向边集。每条有向边(或箭头)连接两个假设,并被赋予一个介于 O 和 1 之间的值,称为该边的权重,显示在箭头上方,表示要移动的保留 alpha 的分数。

 

along that path to the receiving hypothesis, when the hypothesis at the tail end of the path is successful (i.e., is rejected). The sum of the weights across all the paths leaving a vertex should be 1.0, so that the entire preserved alpha is used in testing subsequent hypotheses. All study hypotheses that are intended to potentially provide firm conclusions of efficacy are shown in the graph.

当路径尾端的假设成功(即被拒绝)时,沿该路径的权重为接收假设的权重。离开顶点的所有路径的权重之和应为 1.0,以便在测试后续假设时使用整个保留的 alpha。图中显示了所有可能提供确切疗效结论的研究假设。

 

Several examples of the graphical method follow to help illustrate the concept, construction, interpretation, and application of these diagrams.

以下是几个图表法的例子,以帮助说明这些图表的概念、构建、解释和应用。

 

Fixed-Sequence Method 固定序列法

 

The fixed-sequence testing strategy (appendix section 5.), shown in Figure A2, illustrates a simple case of the graphical method with three hypotheses. In this scheme, the endpoints (hypotheses) are ordered. Testing begins with the first endpoint at the full alpha level and continues through the sequence only until an endpoint is not statistically significant. This diagram shows that the endpoint-specific alpha levels associated with hypotheses H1, H2, and H3 are set in the beginning as α, 0, and 0. For the fixed-sequence method, arrows represent the sequence of testing, and if the test is successful, the full alpha is shifted along to the next test. Consequently, if null hypothesis H1  is successfully rejected, the endpoint-specific alpha level for H2 becomes 0 + 1  α = α, which allows testing of H2  at level α . However, if the test of H1  is unsuccessful, there is no pre-assigned non-zero alpha for H2 to allow testing of H2, so the testing stops.

图 A2 所示的固定序列测试策略(附录第 5.在此方案中,端点(假设)是有序的。测试从第一个端点的全α水平开始,一直到某个端点在统计上不显著为止。该图显示,与假设 H1、H2 和 H3 相关的特定终点的 α 水平在开始时分别设置为 α、0 和 0。对于固定序列法,箭头代表测试序列,如果测试成功,全 α 水平就会转移到下一个测试。因此,如果成功拒绝了零假设 H1,H2 的终点特定 alpha 水平就会变成 0 + 1  α = α,从而可以在水平 α 上对 H2 进行检验。但是,如果对 H1 的检验不成功,就没有预先指定的非零α 水平来检验 H2,因此检验也就停止了。

 

 image.png

Figure A2: Graphical illustration of the fixed-sequence testing with three hypotheses.

图 A2:三个假设的固定序列测试图解。

 

Loop-Back Feature to Indicate Two-Way Potential for Retesting

回环功能可显示复测的双向可能性

 

Another valuable feature of the graphical method occurs when the available alpha level is split   between two or more endpoints into endpoint-specific alpha levels; these diagrams illustrate the potential for loop-back passing of endpoint-specific alpha.

当两个或多个端点之间的可用阿尔法水平被分割成端点特定的阿尔法水平时,图形方法就会出现另一个有价值的特征;这些图表说明了端点特定阿尔法水平回传的可能性。

 

The Holm procedure (appendix section 2.) is a specific case of tests for two hypotheses with a loop-back feature where the graphical method enables a simple depiction of the procedure and its rationale. The Holm procedure directs that the first step is to test the smaller p-value at endpoint-  specific alpha = α/2 and, only if successful, proceed to test the larger p-value at the level α (e.g.,   0.05). Because the Holm procedure splits alpha evenly in half, if the test of hypothesis with the smaller p-value was not significant, it is clear that the test with the larger p-value will also fail to be significant; performing that comparison is unnecessary. The diagram for the Holm procedure  (Figure A3), shows two vertices and associated endpoint-specific alpha levels of α1 = 0.025 and  α2 = 0.025, respectively, satisfying the requirement for total alpha = 0.05. The two arrows show   that alpha might be passed along from H1 to H2, or H2 to H1. If the first test is successful, the endpoint-specific alpha of 0.025 is shifted entirely to the other hypothesis and added to the endpoint-specific alpha already allocated for that hypothesis to provide a net alpha of 0.05. Because either hypothesis might be tested first, the diagram shows a loop-back configuration.

霍尔姆程序(附录第 2 节)是两个假设检验的一个特殊案例,它具有回环功能,图解法可 以简单地描述该程序及其原理。霍尔姆程序规定,第一步是在终点特定的 alpha = α/2 条件下检验较小的 p 值,只有当检验成功时,才在水平 α(如 0.05)上检验较大的 p 值。由于霍尔姆过程将 α 平均分成两半,如果对 p 值较小的假设检验不显著,那么对 p 值较大的假设检验显然也不显著,因此没有必要进行比较。霍尔姆过程图(图 A3)显示了两个顶点和相关的终点特定α水平,分别为 α1 = 0.025 和 α2 = 0.025,满足总α = 0.05 的要求。两个箭头表示α可能从 H1 传递到 H2,或从 H2 传递到 H1。如果第一个测试成功,0.025 的终点特定 alpha 就会完全转移到另一个假设上,并与已经分配给该假设的终点特定 alpha 相加,从而得到 0.05 的净 alpha。由于任何一个假设都可能首先进行测试,因此该图显示了一个回环配置。

 

 image.png

Figure A3: Graphical illustration of the Holm procedure with two hypotheses.

图 A3:带有两个假设的霍尔姆程序图解。

 

Testing on the diagram can start at any of the vertices that have non-zero alpha in the initial diagram, and all vertices with non-zero alpha can be tested until one is found for which the test is successful (i.e., the hypothesis is rejected). Then, the respective node is removed, and the alpha allocated to the rejected hypothesis propagates to other nodes following the arrows, as directed in the diagram. The final conclusions of which hypotheses were rejected and which were not will be the same irrespective of which vertex was inspected first. The graphical method enables complex alpha-splitting and branching of testing path features to be clearly identified as part of the analysis plan and correctly implemented.

图中的测试可以从初始图中任何一个 alpha 值不为零的顶点开始,所有 alpha 值不为零的顶点都可以被测试,直到找到一个测试成功的顶点(即假设被拒绝)。然后,删除相应的节点,分配给被拒绝假设的 alpha 会按照图中箭头的指示传播到其他节点。无论先检查哪个顶点,哪些假设被否决,哪些没有被否决,最终结论都是一样的。图解法可将复杂的阿尔法分割和测试路径分支特征作为分析计划的一部分加以明确,并正确实施。

 

Progressive Updating of the Diagram When Hypotheses Are Successfully Rejected

假设被成功否决时图表的逐步更新

 

The graphical approach guides the hierarchical testing of multiple hypotheses through continual updating of the initial graph whenever a hypothesis is successfully rejected. The initial graph represents the full testing strategy (with all hypotheses). Each new graph shows the progression

每当一个假设被成功否定时,图形方法就会通过不断更新初始图形来指导多个假设的分层检验。初始图形代表完整的测试策略(包含所有假设)。每个新图表都显示了

 of the testing strategy by eliminating hypotheses that have been rejected and retaining those yet to be tested or re-tested.

通过剔除已被否定的假设,保留那些有待测试或重新测试的假设,来确定测试策略。

 When there is a desire to consider analysis strategies with complex division of alpha, the graphical method and progressive updating of the diagram can aid in understanding the implication of the different strategies for a variety of different hypothetical scenarios. This   progressive updating can aid in selecting which specific strategy to select for the final study statistical analysis plan.

当需要考虑具有复杂阿尔法划分的分析策略时,图形方法和图表的逐步更新有助于理解不同策略对各种不同假设情况的影响。这种逐步更新的方法有助于选择最终研究统计分析计划的具体策略。



https://blog.sciencenet.cn/blog-3426442-1432161.html

上一篇:[转载]CAR T 细胞在癌症治疗中的免疫原性
下一篇:[转载]系统性红斑狼疮 - 开发用于治疗的医疗产品(FDA)
收藏 IP: 103.91.179.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-5-16 03:14

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部