Skip to contents

Introduction

In this vignette, we explore how to prepare chronogram data for statistical testing. We take a chronogram and perform filtering, windowing and selecting prior to an illustrative plot and a statistical test.

NOTE

This vignette does not cover when each statistical test is appropriate, nor the R functions (or arguments) for each particular statistical test.

Libraries

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(rstatix) # pipe-friendly stats functions
#> 
#> Attaching package: 'rstatix'
#> The following object is masked from 'package:stats':
#> 
#>     filter
library(broom) # tidy handling of lm() output
library(ggrepel) # repelled labels in ggplot2
#> Loading required package: ggplot2
library(survival) # time-to-event analysis
library(survminer) # ggplot2 from time-to-event
#> Loading required package: ggpubr
#> 
#> Attaching package: 'survminer'
#> The following object is masked from 'package:survival':
#> 
#>     myeloma
library(chronogram)

Wilcoxon test

  • Hypothesis

    For individuals without prior exposure, a long interval (8-12w) between doses 1 and 2 of an mRNA vaccine will have higher antibody levels 3-5w after dose 2, than individuals following a short interval (3-4w).

  • response variable

    MSD SARS-CoV-2 Spike antibody (CoV_2S_MSD).

  • categorical predictor

    LONG or SHORT interval (Vaccine_interval in metadata).

  • data

    PITCH, introduced and assembled in the SQL vignette.

data("pitch_chronogram")

## filter, select, window ##
windowed_data <- pitch_chronogram %>%
  filter(Previous_infection == "N") %>%
  cg_window_by_metadata(dose_2_date, 
                        preceding_days = -21, 
                        following_days = 35) %>%
  cg_select_visit(visit_col = Cov_2S_BAU,
                  visit = "latest")
## plot ##
windowed_data %>%
  ggplot(aes(x = Vaccine_interval, y = Cov_2S_MSD)) +
  geom_point(shape = 20, alpha = 0.4,
             position = position_jitter(0.3)) +
  stat_summary(
    geom = "point",
    fun = "median",
    shape = 5,
    size = 2,
    stroke = 1.5,
    col = "dodgerblue") +
  theme_bw()


## stats ##
windowed_data %>%
  wilcox_test(Cov_2S_MSD ~ Vaccine_interval) ## rstatix::wilcox_test
#> # A tibble: 1 × 7
#>   .y.        group1 group2    n1    n2 statistic       p
#> * <chr>      <chr>  <chr>  <int> <int>     <dbl>   <dbl>
#> 1 Cov_2S_MSD LONG   SHORT    124    37      3117 0.00095

Linear regression

  • Hypothesis

    SARS-CoV-2 Spike antibody is dependent on vaccine interval, previous infection, and not on study centre.

  • response variable

    MSD SARS-CoV-2 Spike (CoV_2S_MSD).

  • predictors

    • LONG or SHORT interval (Vaccine_interval in metadata).

    • study centre (Centre_code in metadata).

    • Previous infection (Y/N in Previous_infection in metadata).

  • data

    PITCH, introduced and assembled in the SQL vignette.

data("pitch_chronogram")

## filter, select, window ##
windowed_data <- pitch_chronogram %>%
  cg_window_by_metadata(dose_2_date, 
                        preceding_days = -21, 
                        following_days = 35) %>%
  cg_select_visit(visit_col = Cov_2S_BAU,
                  visit = "latest")
## plot ##
windowed_data %>%
  ggplot(aes(x = Centre_code, y = Cov_2S_MSD)) +
  geom_point(shape = 20, alpha = 0.4,
             position = position_jitter(0.3)) +
  stat_summary(
    geom = "point",
    fun = "median",
    shape = 5,
    size = 2,
    stroke = 1.5,
    col = "dodgerblue") +
  facet_grid(. ~Vaccine_interval + Previous_infection ) +
  theme_bw()


## stats ##
lm_res <- windowed_data %>%
  lm(Cov_2S_MSD ~ 
       Vaccine_interval + 
       Previous_infection + 
       Centre_code, data = .)

## results from lm ##
summary(lm_res)
#> 
#> Call:
#> lm(formula = Cov_2S_MSD ~ Vaccine_interval + Previous_infection + 
#>     Centre_code, data = .)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -435977 -187858  -77071   92963 1410750 
#> 
#> Coefficients:
#>                       Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)             273547      32450   8.430 1.41e-15 ***
#> Vaccine_intervalSHORT   -21186      47969  -0.442    0.659    
#> Previous_infectionY     152471      33969   4.489 1.02e-05 ***
#> Centre_code              21256      12996   1.636    0.103    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 290300 on 304 degrees of freedom
#> Multiple R-squared:  0.08457,    Adjusted R-squared:  0.07554 
#> F-statistic: 9.362 on 3 and 304 DF,  p-value: 6.161e-06

## easier to read, and plot from broom::tidy ##
tidy(lm_res)
#> # A tibble: 4 × 5
#>   term                  estimate std.error statistic  p.value
#>   <chr>                    <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)            273547.    32450.     8.43  1.41e-15
#> 2 Vaccine_intervalSHORT  -21186.    47969.    -0.442 6.59e- 1
#> 3 Previous_infectionY    152471.    33969.     4.49  1.02e- 5
#> 4 Centre_code             21256.    12996.     1.64  1.03e- 1

tidy(lm_res) %>%
  filter(term != "(Intercept)") %>%
  ggplot(aes(x=estimate, y = -log10(p.value),
         label = term)) +
  geom_point() +
  geom_text_repel() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = 3)


Linear regression with interaction terms

Setup as for Linear regression example above.

Since nearly all of the SHORT interval data were from a single centre, we add an interaction term between centre and interval.

## stats - lm with interaction terms ##
## note the *
lm_res_interactions <- windowed_data %>%
  lm(Cov_2S_MSD ~ 
       Previous_infection +
       Vaccine_interval * 
       Centre_code, data = .)

## results from lm ##
summary(lm_res_interactions )
#> 
#> Call:
#> lm(formula = Cov_2S_MSD ~ Previous_infection + Vaccine_interval * 
#>     Centre_code, data = .)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -681468 -176802  -69633   98854 1444423 
#> 
#> Coefficients:
#>                                   Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                         314364      33824   9.294  < 2e-16 ***
#> Previous_infectionY                 120937      34477   3.508 0.000520 ***
#> Vaccine_intervalSHORT              -145061      58443  -2.482 0.013603 *  
#> Centre_code                           6937      13365   0.519 0.604099    
#> Vaccine_intervalSHORT:Centre_code   157812      44140   3.575 0.000407 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 284800 on 303 degrees of freedom
#> Multiple R-squared:  0.1216, Adjusted R-squared:   0.11 
#> F-statistic: 10.49 on 4 and 303 DF,  p-value: 5.699e-08

## easier to read, and plot from broom::tidy ##
tidy(lm_res_interactions )
#> # A tibble: 5 × 5
#>   term                              estimate std.error statistic  p.value
#>   <chr>                                <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)                        314364.    33824.     9.29  3.02e-18
#> 2 Previous_infectionY                120937.    34477.     3.51  5.20e- 4
#> 3 Vaccine_intervalSHORT             -145061.    58443.    -2.48  1.36e- 2
#> 4 Centre_code                          6937.    13365.     0.519 6.04e- 1
#> 5 Vaccine_intervalSHORT:Centre_code  157812.    44140.     3.58  4.07e- 4

tidy(lm_res_interactions ) %>%
  filter(term != "(Intercept)") %>%
  ggplot(aes(x=estimate, y = -log10(p.value),
         label = term)) +
  geom_point() +
  geom_text_repel() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = 3)

The model Cov_2S_MSD ~ Previous_infection + Vaccine_interval * Centre_code, with an interaction term between interval and centre performs reasonably, with a neglible effect assigned to centre alone, and larger effects assigned to vaccine interval and previous infection.

By ANOVA, Cov_2S_MSD ~ Previous_infection + Vaccine_interval * Centre_code performs better than Cov_2S_MSD ~ Previous_infection + Vaccine_interval + Centre_code:

anova(lm_res, 
      lm_res_interactions)
#> Analysis of Variance Table
#> 
#> Model 1: Cov_2S_MSD ~ Vaccine_interval + Previous_infection + Centre_code
#> Model 2: Cov_2S_MSD ~ Previous_infection + Vaccine_interval * Centre_code
#>   Res.Df        RSS Df  Sum of Sq      F   Pr(>F)    
#> 1    304 2.5611e+13                                  
#> 2    303 2.4575e+13  1 1.0367e+12 12.783 0.000407 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Either model describes ~10% of the total variance in Spike antibodies:


bind_rows(
  # this is broom::glance() : #
  glance(lm_res),
  glance(lm_res_interactions)
)
#> # A tibble: 2 × 12
#>   r.squared adj.r.squared   sigma statistic     p.value    df logLik   AIC   BIC
#>       <dbl>         <dbl>   <dbl>     <dbl>       <dbl> <dbl>  <dbl> <dbl> <dbl>
#> 1    0.0846        0.0755 290255.      9.36     6.16e-6     3 -4309. 8628. 8647.
#> 2    0.122         0.110  284788.     10.5      5.70e-8     4 -4303. 8618. 8640.
#> # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Time to event analysis

  • Hypothesis

    Within centre 1, dose 2 (“event”) occurs earlier in individuals without prior infection.

  • Start

    date of dose 1

  • Event

    date of dose 2

  • Censoring

    none (all of PITCH have both doses)

  • Filtering

    to centre 0, as that centre has an approximately equal split of long:short interval.

NOTE

Changes have been made to the de-identified public dataset (DOI: 10.17632/fyp26zjgmj.1).

  1. the public data reports dates as MM/YYYY. To build a chronogram we have assigned randomly from 1-28 for DD in DD/MM/YYYY for dose 1, and used the available intervals in days to place the remaining data in date time. For illustration purposes, we need plausible DD/MM/YYYY dates - they are not real.

  2. not all of the public dataset has been used in this example (some assays not included).

For meaningful time to event analysis, real rather than plausible dates are needed. This vignette shows example code, but the results themselves are insecure.

data("pitch_chronogram")

## filter, de-duplicate ##
t_to_event_data <- pitch_chronogram %>%
  filter(Centre_code == 0) %>%
  group_by(ID) %>%
  slice_head() %>%
  ungroup() %>%
  ## all receive dose 2 ##
  mutate(outcome = 1) %>%
  mutate(dose_2_crude = 
            as.numeric(
           dose_2_date - (dose_1_date)))

fit<- survfit(Surv(time = dose_2_crude,
                   # time2 = dose_2_d,
                   outcome) ~ Previous_infection, data = t_to_event_data)

ggsurvplot(fit,
                      conf.int = TRUE,
                      data = t_to_event_data,
                      fun = "event",
                      risk.table = TRUE,
                      pval = TRUE,
                      pval.coord = c(0.1, 0.9),
                      pval.method = TRUE,
                      pval.method.coord = c(0.1, 1))

Left-censored time to event analysis

Here we use left censoring, such that a given individual is only considered at-risk for the event (dose 2) once they have had their first dose.

In the real world, left censoring is useful if the risk of exposure might change over calendar time. In this example dataset, it highlights the use of plausible (DD randomly assigned to MM-YYYY), rather than real dates.

data("pitch_chronogram")

## filter, de-duplicate ##
t2_to_event_data <- pitch_chronogram %>%
  filter(Centre_code == 0) %>%
  group_by(ID) %>%
  slice_head() %>%
  ungroup() %>%
  ## all receive dose 2 ##
  mutate(outcome = 1) %>%
  ## use the earliest date of dose 1 to anchor other times ##
  mutate(dose_1_d = 
            as.numeric(
           dose_1_date - min(dose_1_date))) %>%
  mutate(dose_2_d = 
            as.numeric(
           dose_2_date - min(dose_1_date)))


fit2 <- survfit(Surv(time = dose_1_d,
                    time2 = dose_2_d,
                   outcome) ~ Previous_infection, data = t2_to_event_data)



ggsurvplot(
  fit2,
  conf.int = TRUE,
  data = t2_to_event_data,
  fun = "event",
  risk.table = TRUE,
  ## calcuate Cox PH P value ##
  pval = 
    round(
    summary(
    coxph(Surv(time = dose_1_d,
               time2 = dose_2_d,
               outcome) ~ Previous_infection, 
          data = t2_to_event_data)
    )$coefficients[,"Pr(>|z|)"],
    digits = 3),
  pval.coord = c(0.1, 0.9)
  # pval.method.coord = c(0.1, 1)
)

Note:

  1. now the x axis runs to over 5 months as it covers the ~2m period between the earliest and latest first dose.

  2. the numbers at risk can increase between time points as more individuals become eligible (here, by receiving dose 1).

The difference between strata is now much less apparent. This is likely due to our plausible rather than real dates in the example dataset.


Summary

This vignette shows code to map data from a chronogram to a variety of different statistical tests.


SessionInfo

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] chronogram_1.0.0 survminer_0.5.0  ggpubr_0.6.0     survival_3.7-0  
#> [5] ggrepel_0.9.6    ggplot2_3.5.1    broom_1.0.7      rstatix_0.7.2   
#> [9] dplyr_1.1.4     
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6      xfun_0.49         bslib_0.8.0       lattice_0.22-6   
#>  [5] vctrs_0.6.5       tools_4.4.2       generics_0.1.3    tibble_3.2.1     
#>  [9] fansi_1.0.6       highr_0.11        pkgconfig_2.0.3   Matrix_1.7-1     
#> [13] data.table_1.16.2 desc_1.4.3        lifecycle_1.0.4   stringr_1.5.1    
#> [17] compiler_4.4.2    farver_2.1.2      textshaping_0.4.0 munsell_0.5.1    
#> [21] carData_3.0-5     htmltools_0.5.8.1 sass_0.4.9        yaml_2.3.10      
#> [25] Formula_1.2-5     pillar_1.9.0      pkgdown_2.1.1     car_3.1-3        
#> [29] jquerylib_0.1.4   tidyr_1.3.1       cachem_1.1.0      abind_1.4-8      
#> [33] km.ci_0.5-6       commonmark_1.9.2  tidyselect_1.2.1  digest_0.6.37    
#> [37] stringi_1.8.4     purrr_1.0.2       labeling_0.4.3    splines_4.4.2    
#> [41] fastmap_1.2.0     grid_4.4.2        colorspace_2.1-1  cli_3.6.3        
#> [45] magrittr_2.0.3    utf8_1.2.4        withr_3.0.2       scales_1.3.0     
#> [49] backports_1.5.0   rmarkdown_2.29    ggtext_0.1.2      gridExtra_2.3    
#> [53] ggsignif_0.6.4    ragg_1.3.3        zoo_1.8-12        evaluate_1.0.1   
#> [57] knitr_1.48        KMsurv_0.1-5      markdown_1.13     survMisc_0.5.6   
#> [61] rlang_1.1.4       gridtext_0.1.5    Rcpp_1.0.13-1     xtable_1.8-4     
#> [65] glue_1.8.0        xml2_1.3.6        jsonlite_1.8.9    R6_2.5.1         
#> [69] systemfonts_1.1.0 fs_1.6.5