Interim monitoring of survival studies}

First, we import the library gsDesign into R.

    ################################################
    ### Sequential monitoring of clinical trials ###
    ################################################
    library(gsDesign)
## Warning: package 'gsDesign' was built under R version 3.6.3

When designing a survival study, there are several complications even in the fixed design. There are two sources of complications. The fact that the power is a function of the events not the sample size, and the number of events we observe (even from the same number of patients) depends on the duration of the study. The duration of the study in turn, depends on the duration of the patient accrual and the subsequent follow-up of all accrued patients, but even the duration of patient accrual is a function of the accrual rate.
Needless to say, adding a schedule of interim analyses on top of the already complex design only increases the complexity and the level of the challenge for the analyst (that is you!).

In the sequel, we will follow two steps or the design process: We first calculate the number of events required without sequential monitoring (fixed-sample design) and then we add the sequential monitoring in the design.

Fixed-sample design

Consider the situation where you are designing a study comparing the mortality (or, conversely, survival) of patients receiving an experimental treatment and a group receiving the standard treatment. Suppose further that you would like to explore the possibility that there is a 25% reduction in the hazard of mortality in the experimental treatment group (i.e., \(\theta=\lambda_1/\lambda_0=0.75\)) and that you will ultimately carry out the comparison at the two-sided \(\alpha=0.05\) level. You would like to determine the number of patients needed to ensure that, under the above considerations, you will have 80% power (i.e., \(\beta=0.2\)).
As we have mentioned before, we first come up with the number of events necessary. First we have to calculate the hazard rates \(\lambda_0\) and \(\lambda_1\) for the experimental and control groups respectively. Recall from the class notes, that the hazard ratio is \[ HR=\frac{\lambda_1}{\lambda_0}=\frac{m_0}{m_1} \] where \(m_0\) and \(m_1\) are the median survival times in the two arms. This is because \(\lambda_0=\frac{-\log 0.5}{m_0}\) and similarly for \(\lambda_1\).

  # (iv) Using O'Brien-Fleming boundaries with number of events
  # Fixed-sample design
  HR = 0.75
theta = (qnorm(0.975)+qnorm(0.80))

# Number of events without monitoring
D = 4*theta^2/log(HR)^2
D
## [1] 379.3517

In other words, \(D=380\) events will be needed to produce the requisite power.

From events to patients

The duration of the study will depend on many considerations. Let’s suppose here that we want to cap the total duration of the study to \(T=36\) months (this will include both the accrual and follow-up period). We use the function nSurv as follows:

# Hazard for control
m0<-9
lambda0<--log(0.5)/m0
# Hazard for exp. treatment
m1<-12
lambda1<--log(0.5)/m1

# Sample size in the fixed-sample design
nSurv(alpha = 0.05, sided = 2, beta=0.20,
lambdaC = lambda0, hr = lambda1/lambda0, hr0 = 1,
gamma = 30, T=36, ratio = 1)
## Error in KT(lambdaC = lambdaC, hr = hr, hr0 = hr0, etaC = etaC, etaE = etaE, : Enrollment duration insufficient to power trial

Oops! This means that the design as described cannot result in the requisite power given the alpha level of the ultimate comparison. In other words, we don’t have time to enroll enough patients and follow them for a sufficient amount of time in order to observe \(D=380\) events (even at the rate of 30 patients a month). After some trial and error, the following is a viable study design using 12 months accrual duration:

  # Sample size in a viable fixed-sample design
  nSurv(alpha = 0.05, sided = 2, beta=0.20,
        lambdaC = lambda0, hr = lambda1/lambda0, hr0 = 1,
        gamma = 40, R = 12, ratio = 1)
## Fixed design, two-arm trial with time-to-event
## outcome (Lachin and Foulkes, 1986).
## Solving for:  Follow-up duration 
## Hazard ratio                  H1/H0=0.75/1
## Study duration:                   T=29.7532
## Accrual duration:                   12
## Min. end-of-study follow-up: minfup=17.7532
## Expected events (total, H1):        378.0039
## Expected sample size (total):       480
## Accrual rates:
##      Stratum 1
## 0-12        40
## Control event rates (H1):
##       Stratum 1
## 0-Inf     0.077
## Censoring rates:
##       Stratum 1
## 0-Inf         0
## Power:                 100*(1-beta)=80%
## Type I error (1-sided):   100*alpha=5%
## Equal randomization:          ratio=1

In this situation, we end up with a study that has total duration \(T=29.8\) months (about 2.5 years), of which 12 months are devoted to patient accrual (at a rate of \(\gamma=40\) patients/month for a total of \(N=480\) patients) followed by about 17.75 months (\(\approx 1.5\) years) of follow-up after the last patient has been accrued.

Adding the interim analyses

We now want to add an interim analysis schedule involving \(k=3\) total analyses at the \(\tau_1=0.30\), \(\tau_3=0.60\) and \(\tau_3=1.00\) trial fraction. Now, instead of specifying the sample size without monitoring, we specify the number of events. We use the argument n.fix in gsDesign to do so.

  # Sample size is number of events here!
  x = gsDesign(k = 3, test.type = 2, alpha = 0.025,beta = 0.20,
               timing = c(0.3,0.6,1), sfu = "OF")
x
## Symmetric two-sided group sequential design with
## 80 % power and 2.5 % Type I Error.
## Spending computations assume trial stops
## if a bound is crossed.
## 
##            Sample
##             Size 
##   Analysis Ratio*  Z   Nominal p  Spend
##          1  0.304 3.64    0.0001 0.0001
##          2  0.608 2.57    0.0050 0.0050
##          3  1.014 1.99    0.0231 0.0199
##      Total                       0.0250 
## 
## ++ alpha spending:
##  O'Brien-Fleming boundary.
## * Sample size ratio compared to fixed design with no interim
## 
## Boundary crossing probabilities and expected sample size
## assume any cross stops the trial
## 
## Upper boundary (power or Type I Error)
##           Analysis
##    Theta      1      2      3 Total   E{N}
##   0.0000 0.0001 0.0050 0.0199 0.025 1.0095
##   2.8016 0.0182 0.3316 0.4503 0.800 0.8664
## 
## Lower boundary (futility or Type II Error)
##           Analysis
##    Theta     1     2      3 Total
##   0.0000 1e-04 0.005 0.0199 0.025
##   2.8016 0e+00 0.000 0.0000 0.000

Thus, the number of events now become \(D'\)=385, slightly larger than the number of events required without sequential monitoring.

From events to patients

In the previous paragraphs we used the function nSurv, which calculates the required number of subjects and events in the fixed-sample design (i.e., when a single look at the end of the study is to take place). The function gsSurv constitutes an extension of nSurv, allowing for interim analyses during the study (please take a look at the previous labs for nSurv).

Assuming the exponential distribution for the event times, it is easy to obtain the event rates in each group (also see the second lab for such calculations). At this stage, the study design doesn’t involve futility boundaries. We include in R the fact that we want \(k=3\) looks at the \(\tau_1=0.3\), \(\tau_2=0.6\) and \(\tau_3=1.0\) trial fraction we want to produce O’Brien-Fleming boundaries for efficacy. This is done by selecting the option sfu = "OF". Other important parameters include test.type = 2 which, as before, means that we want symmetric upper and lower bounds for a two-sided alpha level (note the option side=2). As before, we include the rate of patient accrual per unit time gamma=40 and accrual duration R=12. The R code follows:

# (v) Sequential monitoring of survival study using gsSurv 
# Hazard for control
x3 = gsSurv( k = 3, test.type = 2, alpha = 0.05, sided = 2,
             lambdaC = lambda0, hr = lambda1/lambda0, hr0 = 1,
             beta = 0.20, sfu = "OF", gamma = 40, R = 12 , 
             ratio = 1, timing=c(0.3, 0.6, 1.0))
x3
## Time to event group sequential design with HR= 0.75 
## Equal randomization:          ratio=1
## Symmetric two-sided group sequential design with
## 80 % power and 2.5 % Type I Error.
## Spending computations assume trial stops
## if a bound is crossed.
## 
##               
##   Analysis  N   Z   Nominal p  Spend
##          1 115 3.64    0.0001 0.0001
##          2 230 2.57    0.0050 0.0050
##          3 384 1.99    0.0231 0.0199
##      Total                    0.0250 
## 
## ++ alpha spending:
##  O'Brien-Fleming boundary.
## 
## Boundary crossing probabilities and expected sample size
## assume any cross stops the trial
## 
## Upper boundary (power or Type I Error)
##           Analysis
##    Theta      1      2      3 Total  E{N}
##   0.0000 0.0001 0.0050 0.0199 0.025 381.6
##   0.1441 0.0182 0.3316 0.4503 0.800 327.5
## 
## Lower boundary (futility or Type II Error)
##           Analysis
##    Theta     1     2      3 Total
##   0.0000 1e-04 0.005 0.0199 0.025
##   0.1441 0e+00 0.000 0.0000 0.000
##              T        n   Events HR futility HR efficacy
## IA 1  10.32165 412.8659 114.9578       1.971       0.507
## IA 2  16.14506 480.0000 229.9157       1.404       0.712
## Final 30.55325 480.0000 383.1929       1.226       0.816
## Accrual rates:
##      Stratum 1
## 0-12        40
## Control event rates (H1):
##       Stratum 1
## 0-Inf      0.08
## Censoring rates:
##       Stratum 1
## 0-Inf         0

From the above output we see that the required number of events is 384 and the maximum study duration is just over 30 months (30.55 months to be exact) or about 2.5 years, of which, the first 12 months are devoted to patient accrual (at an average \(\gamma=40\) patients per month for a total \(N=480\) patients).
The above output

## Upper boundary (power or Type I Error) 
##           Analysis 
##    Theta      1      2      3 Total  E{N} 
##   0.0000 0.0001 0.0050 0.0199 0.025 381.6 
##   0.1441 0.0182 0.3316 0.4503 0.800 327.5

suggests that the expected number of events is 381.6 under the null hypothesis and 327.5 under the alternative. The fact that it is lower than the \(D=384\) total events has to do with the fact that, even under the null hypothesis, there is nonzero probability of stopping early during one of the interim analyses. Finally, consider the code chunk

##              T        n   Events HR futility HR efficacy 
## IA 1  10.32165 412.8659 114.9578       1.971       0.507 
## IA 2  16.14506 480.0000 229.9157       1.404       0.712 
## Final 30.55325 480.0000 383.1929       1.226       0.816

We see that the first interim analysis will occur at \(\tau_1=10.3\) months, when about 115 events have been observed (and about 413 patients have been enrolled in the study). The second interim analysis, assuming the study was not stopped at the first interim analysis, will occur during month 16 when about 230 events have been observed.
Also notice in this same output chunk that the hazards ratios corresponding to the efficacy and futility bounds are given in the output. For example, it would take an observed hazard ratio of \(\theta_1=0.507\) in the first interim analysis to stop the study in favor of the experimental treatment and almost a doubling of the hazard (\(\theta^*_1=1.971\)) to stop the study in favor of the standard therapy. As a final remark, note that the second interim analysis will occur after all the patients have been accrued. Given the high threshold required by the O’Brien-Fleming method, this means that it is very unlikely that we will prevent any of the prospective subjects from being exposed to the experimental drug even if the experimental therapy results in significant elevation of the mortality risk compared to the standard therapy.

Recalibration of survival studies

Many times during the implementation of a survival study we may need to adjust the study design and the study monitoring based on various factors, which include but are not limited to the observed rate of occurrence of the events as well as logistical issues having to do with the timing of the review itself. Alpha spending functions enable us to address all of these issues.

Example: Recalibration of alpha spending

Suppose we have a survival study based on the log-rank test and we are using the Pocock spending function \(\alpha_P(t) = 0.05\log\{1 + (e-1)t\}\), and we estimate that there will be \(D=100\) deaths by the end of the six-year trial. Now suppose that the first analysis happened at the end of the first year and after \(d_1=10\) deaths were observed. Given the Pocock procedure, at this look, we’ve spent \(\alpha_P(1/10) = 0.008\) (using z-score boundaries \(Z(0.1)=\pm 2.655\) from the Pocock spending function).
At the second analysis, at the end of the second year, you figure that you will have at most \(D'=50\) events at the end of 6 years.

Question: What do you do?

You realize that you are underspending your alpha. This is because the information fraction should have been \(\tau_1 = 10/50 = 0.2\) and you should have spent \(\alpha_P(0.20) = 0.015\) but you spent only \(\alpha(0.10) = 0.008\). Alpha spent at the first look is gone. You need to accelerate the rate of alpha spending.
One way to fix it is to interpolate the spending function for \(t\geq 0.20\) by the following formula \[ \alpha'_P(t) = 0.008 +\left (\frac{0.05-0.008}{0.05-0.015}\right ) \left \{\alpha_P(t)- 0.015\right \} \]

t<-seq(0,1,.1)
# Original spending function
apt<-0.05*log(1+(exp(1)-1)*t)
cbind(t, apt)
##         t         apt
##  [1,] 0.0 0.000000000
##  [2,] 0.1 0.007928254
##  [3,] 0.2 0.014769726
##  [4,] 0.3 0.020786761
##  [5,] 0.4 0.026156858
##  [6,] 0.5 0.031005725
##  [7,] 0.6 0.035425653
##  [8,] 0.7 0.039486402
##  [9,] 0.8 0.043241986
## [10,] 0.9 0.046735083
## [11,] 1.0 0.050000000

Figure 1 shows this pictorially.

par(mar=c(4,4,1,1))
plot(t, apt, 
xlab="Trial fraction", ylab="Alpha spent", 
type="l", col="blue")
abline(v=0.2, lty=2)
# Recalibrated spending function
apt1<-apt[2]+((0.05-apt[2])/(0.05-apt[3]))*(apt-apt[3])
lines(t[t>=0.2], apt1[t>=0.2], col="red")
<b>Figure 1.</b>Pocock spending function (blue) and recalibrated spending function (red) to catch up for slow initial spending.

Figure 1.Pocock spending function (blue) and recalibrated spending function (red) to catch up for slow initial spending.

Study monitoring using calendar time

Suppose you had designed the previous study based on the Pocock procedure. The basic design would be like this:

gsDesign(k=6, test.type=2, alpha=0.025, beta=0.1, sfu=sfLDPocock)
## Symmetric two-sided group sequential design with
## 90 % power and 2.5 % Type I Error.
## Spending computations assume trial stops
## if a bound is crossed.
## 
##            Sample
##             Size 
##   Analysis Ratio*  Z   Nominal p  Spend
##          1  0.200 2.50    0.0063 0.0063
##          2  0.401 2.48    0.0066 0.0050
##          3  0.601 2.45    0.0070 0.0042
##          4  0.802 2.44    0.0074 0.0036
##          5  1.002 2.42    0.0077 0.0031
##          6  1.203 2.41    0.0079 0.0028
##      Total                       0.0250 
## 
## ++ alpha spending:
##  Lan-DeMets Pocock approximation spending function.
## * Sample size ratio compared to fixed design with no interim
## 
## Boundary crossing probabilities and expected sample size
## assume any cross stops the trial
## 
## Upper boundary (power or Type I Error)
##           Analysis
##    Theta      1      2      3      4      5      6 Total   E{N}
##   0.0000 0.0063 0.0050 0.0042 0.0036 0.0031 0.0028 0.025 1.1727
##   3.2415 0.1483 0.2154 0.2018 0.1562 0.1084 0.0700 0.900 0.6756
## 
## Lower boundary (futility or Type II Error)
##           Analysis
##    Theta      1     2      3      4      5      6 Total
##   0.0000 0.0063 0.005 0.0042 0.0036 0.0031 0.0028 0.025
##   3.2415 0.0000 0.000 0.0000 0.0000 0.0000 0.0000 0.000

This design anticipates six analyses, including the final analysis, all performed at 1/6 trial fraction increments. Note also, the inflation of the sample size

##                
##   Analysis  N   Z   Nominal p  Spend 
##          1  21 2.50    0.0063 0.0063 
##          2  41 2.48    0.0066 0.0050 
##          3  61 2.45    0.0070 0.0042 
##          4  81 2.44    0.0074 0.0036 
##          5 101 2.42    0.0077 0.0031 
##          6 121 2.41    0.0079 0.0028 
##      Total                    0.0250  
## 

is over 20%.

Now let’s change the monitoring using calendar rather than event time. Since the maximum duration of the study was six years, at the end of the first year we are at \(s = 1/6 = 0.17\) of the total calendar time. Spending functions allow us to change teh monitoring schedule as follows: We could let \(t = s\) and use a spending function \(\alpha^*(s)\) which spends alpha based on \(s\) instead of \(t\). In the example above, at the end of the first year, the alpha to be spent should have been \(\alpha^*(1/6) = 0.013\) (using the Pocock alpha spending function). Then we can proceed identically as before by adjusting for this fraction. This is an attractive way to do monitoring since the total number of events at the end is not necessary to be known. However, there is a big drawback: Events will be likely slow at the beginning, so it may actually be easier to stop the trial early using calendar versus information time. Consequently, you would be overspending your alpha level and thus reducing your power. We saw previously, that designs which spend the alpha level earlier tend to be a bit underpowered (or, conversely, require slightly larger sample sizes to maintain the same power levels).

Interim monitoring with futility bounds

When monitoring a study we may also want to stop the trial if the probability of ever showing a statistically significant difference favoring the experimental arm is very small. This results in the construction of ``futility" bounds. To see how this is done, we must first understand the concept of beta spending. This is similar to alpha spending.
Beta spending is relevant in the construction of futility bounds because any errors there will impact the Type II error of the study (i.e., not rejecting the null hypothesis when it is not true).

One-sided study boundaries for futility

Suppose that we are designing a study with \(k = 5\) analyses, to be carried out at the \(\alpha = 0.05\) and with 90% power. Suppose also that we want to stop the study both if there is sufficient evidence to reject the null hypothesis as well as if there is evidence in favor of the null hypothesis (futility). Using spending functions \(\alpha(t) = \alpha t^\rho\) and \(\beta(t) = \beta t^\rho\) with \(\rho = 3\). Note that this is approximately the O’Brien-Fleming procedure} we obtain the following upper and lower boundaries, shown in Figure 2.

# One-sided with futility bounds
ss1<-plot(gsDesign(k=5, test.type=3, alpha=0.05, beta=0.1, 
sfu=sfPower, sfupar=3, sfl=sfPower, sflpar=3), main="")
ss1+theme(legend.position = "none")+theme(axis.text=element_text(size=10))+
theme(axis.title=element_text(size=15,face="bold"))
<b>Figure 2.</b> One-sided efficacy and futility bounds for the approximate O'Brien-Fleming procedure with k=5 equally spaced analyses.

Figure 2. One-sided efficacy and futility bounds for the approximate O’Brien-Fleming procedure with k=5 equally spaced analyses.

Two-sided futility bounds

We will carry out this analysis with SAS since gsDesign does not do analysis with a two-sided futility bound (inner wedge). The following is the code and resulting boundaries for an O’Brien-Fleming design with four equally spaced analyses at \(\alpha=0.05\) and \(\beta=0.1\). We use the SAS procedure seqdesign, which produces group sequential designs.

ods graphics on;
proc seqdesign 
     boundaryscale=stdz
        plots=boundary(hscale=samplesize);
     TwoSidedOBrienFleming: design method=OBF 
        alt=twosided stop=both nstages=4 alpha=0.05   beta=0.10;
     ods output Boundary=bound_ldl;
run;
ods graphics off;
<b>Figure 3. </b> O'Brien-Fleming design with $k=4$ analyses and inner wedge.

Figure 3. O’Brien-Fleming design with \(k=4\) analyses and inner wedge.

We used options boundaryscale=stdz so that the results are presented in the standardized z scale. The above code produces the results in Figure 3.

The study will continue as long as the standardized \(Z\) statistic falls within the white region and stop for efficacy (either in favor of the experimental treatment or for the standard treatment) in the blue areas (above and below respectively), or futility (i.e., inability to ever show a difference) if the statistic lands in the inner wedge on the right.