Chronogram assembly from SQL • chronogram

Introduction

Here we build a chronogram from an SQL database. We use an example derived from the de-identified PITCH dataset available here, under a CC-BY-4.0 licence: https://data.mendeley.com/datasets/fyp26zjgmj/1 Published: 3 November 2021, DOI: 10.17632/fyp26zjgmj.1

Contributors: Rebecca Payne, Susan Hopkins, Victoria Hall, Christina Dold, Christopher Duncan, Alex Richter, Miles Carroll, Gavin Screaton, Thushan de Silva, Lance Turtle, Paul Klenerman, Susanna Dunachie, PITCH Consortium authors. The dataset supports Payne et al. Immunogenicity of standard and extended dosing intervals of BNT162b2 mRNA vaccine Cell 2021

NOTE

Changes have been made to the de-identified public dataset (DOI: 10.17632/fyp26zjgmj.1).

the public data reports dates as MM/YYYY. To build a chronogram we have assigned randomly from 1-28 for DD in DD/MM/YYYY for dose 1, and used the available intervals in days to place the remaining data in date time. For illustration purposes, we need plausible DD/MM/YYYY dates - they are not real.

not all of the public dataset has been used in this example (some assays not included).

If your motivation is to re-analyse Payne et al. for scientific reasons, do not use the example dataset in this package. Instead consult the de-identified Mendeley data (DOI: 10.17632/fyp26zjgmj.1), or the Cell manuscript’s data and code availability statement.

library(chronogram)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql
library(DBI)
library(RSQLite)
library(ggplot2)

Approach

Use dbplyr to extract the relevant tibbles from an existing SQL database. These extracted tibbles are assembled with the chronogram constructor functions.

Reminder

There are five required inputs:

a list of participant or study IDs

a start date

an end date

some metadata, stored against the identifier in #1.

Experimental data, stored against the identifier in #1 and a date of sample.

1-3 are straightforward, so we start with 4

Connect to SQL db

## Connect to the sql database ##
## here we use the shipped example database
db_file <- system.file(
  "extdata", "pitch-database-output.sqlite",
  package = "chronogram"
)

## R covers most types of SQL. Here, we use SQLite.
pitch_db <- DBI::dbConnect(RSQLite::SQLite(), db_file)

dbplyr::src_dbi(pitch_db)
#> src:  sqlite 3.46.0 [/home/runner/work/_temp/Library/chronogram/extdata/pitch-database-output.sqlite]
#> tbls: metadata, MSD_data, MSD.data, nAb_data, nAb.data, sqlite_stat1,
#>   sqlite_stat4

Get metadata

metadata <- tbl(
  pitch_db,
  ## give the name of the SQL table below ##
  "metadata"
)

head(metadata)
#> # Source:   SQL [6 x 6]
#> # Database: sqlite 3.46.0 [/home/runner/work/_temp/Library/chronogram/extdata/pitch-database-output.sqlite]
#>      ID Centre_code Previous_infection Vaccine_interval dose_1_date dose_2_date
#>   <int>       <int> <chr>              <chr>                  <dbl>       <dbl>
#> 1  4001           4 Y                  LONG                   18615       18699
#> 2  4002           4 Y                  SHORT                  18639       18670
#> 3  4003           4 Y                  LONG                   18597       18667
#> 4  4004           4 Y                  LONG                   18605       18676
#> 5  4005           4 Y                  LONG                   18602       18707
#> 6  4006           4 N                  LONG                   18628       18677

The above leaves the data itself in the SQL db, but routes it through R to print the first few rows. For very large databases, where significant up-front filtering can take place, then most tidyverse verbs work here. For example, let’s restrict the metadata participants from a single centre.

metadata_filt <- tbl(
  pitch_db,
  ## give the name of the SQL table below ##
  "metadata"
) %>%
  filter(Centre_code == 1)

head(metadata_filt)
#> # Source:   SQL [6 x 6]
#> # Database: sqlite 3.46.0 [/home/runner/work/_temp/Library/chronogram/extdata/pitch-database-output.sqlite]
#>      ID Centre_code Previous_infection Vaccine_interval dose_1_date dose_2_date
#>   <int>       <int> <chr>              <chr>                  <dbl>       <dbl>
#> 1  1001           1 Y                  LONG                   18639       18704
#> 2  1002           1 Y                  LONG                   18652       18723
#> 3  1003           1 Y                  LONG                   18616       18672
#> 4  1004           1 Y                  LONG                   18651       18722
#> 5  1005           1 Y                  LONG                   18634       18697
#> 6  1007           1 Y                  LONG                   18653       18716

See the dbplyr documentation for further info: https://dbplyr.tidyverse.org/index.html .

As our example is small, we want to move all the data into R, to do this use dbplyr::collect().

metadata <- tbl(
  pitch_db,
  ## give the name of the SQL table below ##
  "metadata"
) %>%
  collect()

metadata
#> # A tibble: 587 × 6
#>       ID Centre_code Previous_infection Vaccine_interval dose_1_date dose_2_date
#>    <int>       <int> <chr>              <chr>                  <dbl>       <dbl>
#>  1  4001           4 Y                  LONG                   18615       18699
#>  2  4002           4 Y                  SHORT                  18639       18670
#>  3  4003           4 Y                  LONG                   18597       18667
#>  4  4004           4 Y                  LONG                   18605       18676
#>  5  4005           4 Y                  LONG                   18602       18707
#>  6  4006           4 N                  LONG                   18628       18677
#>  7  4007           4 Y                  LONG                   18611       18682
#>  8  4008           4 Y                  LONG                   18602       18703
#>  9  4009           4 Y                  LONG                   18609       18680
#> 10  4010           4 Y                  LONG                   18614       18685
#> # ℹ 577 more rows

Clean metadata

The dose_1_date and dose_2_date columns have lost their DD-MM-YYYY formatting when the SQL db was constructed. Fix:

metadata <- metadata %>%
  mutate(across(ends_with("date"), ~
    as.Date(.x, origin = "1970-01-01")))

## show the relevant columns ##
metadata %>%
  select(ID, contains("date")) %>%
  head()
#> # A tibble: 6 × 3
#>      ID dose_1_date dose_2_date
#>   <int> <date>      <date>     
#> 1  4001 2020-12-19  2021-03-13 
#> 2  4002 2021-01-12  2021-02-12 
#> 3  4003 2020-12-01  2021-02-09 
#> 4  4004 2020-12-09  2021-02-18 
#> 5  4005 2020-12-06  2021-03-21 
#> 6  4006 2021-01-01  2021-02-19

Precisely what cleaning is needed will be study specific.

List of participant IDs

Extract this from the SQL’s master table (in our example, that’s metadata table).

studyId <- metadata$ID

Start and end date

We can use 1st Jan 2020 (i.e. pre-pandemic) as the start date, and the date of publication as the end date.

## provide dates as character vectors DDMMYYYY ##
start_date <- "01012020"
end_date <- "03112021"

Experimental data

As we have a small study, and want to import all into a chronogram format, we can `collect()` the relevant SQL table, fix formatting and then be ready for chronogram assembly.

For a large study where you only want a subset of the available SQL data, consider filtering to a smaller selection of participants, using a similar approach as for `metadata_filt` above.

For extremely large studies, several stages of chronogram assembly and annotation can be computationally expensive. Building a chronogram from massive studies (>40,000 participants, >2 years) fails on “lab issued” laptops. However, enough metadata can be pulled into R to select the participants relevant to a given analysis, so an analysis-relevant chronogram can be build. If the entirety of a massive study is required, the first option is to swap to a high-performance compute cluster (or a high end desktop). The second option, is to break the task into smaller components, and this is best performed by building and annotating chronograms for ~5000 individuals (so total number of rows < ~ 1,000,000), and writing each to disk with subsequent windowing, filtering, and selecting the chronograms in parallel. A third approach is to take the spirit of this package and push the computational work to the SQL server. This would likely require a solution specific to that study’s SQL flavour, assumes the server has compute capacity (and the user permissions to utilise), and supporting SQL to that degree (for example, tidyverse verbs are variably translated to SQL flavours) is currently beyond the scope of this package and our team. We welcome hearing from users about their approaches with larger studies.

Neutralisation data

## Collect the experimental data from SQL ##
nAb_data <- tbl(
  pitch_db,
  ## give the name of the SQL table below ##
  "nAb_data"
) %>%
  collect()


## Again, the date needs re-formatting ##
nAb_data %>%
  head()
#> # A tibble: 6 × 6
#>      ID calendar_date Victoria  Beta Gamma Delta
#>   <int>         <dbl>    <int> <int> <int> <chr>
#> 1   561         18810       10    10    46 10   
#> 2   568         18753       10    10    10 10   
#> 3   447         18680       92    10   132 10   
#> 4   557            NA      115    10   290 10   
#> 5   564         18754       10    10    24 10   
#> 6   581         18715       46    10    43 10

nAb_data <- nAb_data %>%
  mutate(across(ends_with("date"), ~
    as.Date(.x, origin = "1970-01-01")))

MSD data

## Collect the experimental data from SQL ##
MSD_data <- tbl(
  pitch_db,
  ## give the name of the SQL table below ##
  "MSD_data"
) %>%
  collect()


## Again, the date needs re-formatting ##
MSD_data %>%
  head()
#> # A tibble: 6 × 14
#>      ID Cov2_RBD_MSD Cov2_RBD_BAU Cov_2N_MSD Cov_2N_BAU Cov_2S_MSD Cov_2S_BAU
#>   <int>        <dbl>        <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
#> 1     2       78191.        2127.      170.       0.402    164133.      1480.
#> 2     3        6694.         182.       52.7      0.124     14510.       131.
#> 3     5      559384.       15215.     4692.      11.1      912009.      8221.
#> 4     5       11600.         316.     5261.      12.4       28710.       259.
#> 5     5      310782.        8453.     2072.       4.89     531463.      4791.
#> 6     8       88388.        2404.     7323.      17.3      217099.      1957.
#> # ℹ 7 more variables: HKU1_S <dbl>, MERS_S <dbl>, NL63_S <dbl>, OC43_S <dbl>,
#> #   X229E_S <dbl>, Cov_1S <dbl>, calendar_date <dbl>

MSD_data <- MSD_data %>%
  mutate(across(ends_with("date"), ~
    as.Date(.x, origin = "1970-01-01")))

Assemble chronogram

Now we have the relevant objects in memory and cleaned, we can assemble a chronogram.

pitch_chronogram <- cg_assemble(
  start_date = start_date,
  end_date = end_date,
  metadata =  metadata,
  metadata_ids_col = ID,
  calendar_date_col = calendar_date)
#> Checking input parameters...
#> -- checking start date 01012020
#> -- checking end date 03112021
#> -- checking end date later than start date
#> -- checking metadata
#> --no experiment data provided. Add later: cg_add_experiment()
#> Input checks completed
#> Chronogram assembling...
#> -- chrongram_skeleton built
#> -- chrongram built with metadata
#> -- no experiment data provided
#> 
#> Assembly finished

pitch_chronogram <- cg_add_experiment(
  pitch_chronogram,
  nAb_data
)

pitch_chronogram <- cg_add_experiment(
  pitch_chronogram,
  MSD_data
)

pitch_chronogram
#> # A tibble:     395,051 × 23
#> # A chronogram: try summary()
#>    calendar_date ID    Centre_code Previous_infection Vaccine_interval
#>  * <date>        <fct>       <int> <chr>              <chr>           
#>  1 2020-01-01    100             0 Y                  LONG            
#>  2 2020-01-02    100             0 Y                  LONG            
#>  3 2020-01-03    100             0 Y                  LONG            
#>  4 2020-01-04    100             0 Y                  LONG            
#>  5 2020-01-05    100             0 Y                  LONG            
#>  6 2020-01-06    100             0 Y                  LONG            
#>  7 2020-01-07    100             0 Y                  LONG            
#>  8 2020-01-08    100             0 Y                  LONG            
#>  9 2020-01-09    100             0 Y                  LONG            
#> 10 2020-01-10    100             0 Y                  LONG            
#> # ℹ 395,041 more rows
#> # ℹ 18 more variables: dose_1_date <date>, dose_2_date <date>, Victoria <int>,
#> #   Beta <int>, Gamma <int>, Delta <chr>, Cov2_RBD_MSD <dbl>,
#> #   Cov2_RBD_BAU <dbl>, Cov_2N_MSD <dbl>, Cov_2N_BAU <dbl>, Cov_2S_MSD <dbl>,
#> #   Cov_2S_BAU <dbl>, HKU1_S <dbl>, MERS_S <dbl>, NL63_S <dbl>, OC43_S <dbl>,
#> #   X229E_S <dbl>, Cov_1S <dbl>
#> # ★ Dates: calendar_date      ★ IDs: ID
#> # ★ metadata: Centre_code, Previous_infection, Vaccine_interval, dose_1_date, dose_2_date

Annotate chronogram and analyse

Now that assembly is complete the next step is to annotate the chronogram (find and label infection episodes, count the number of exposures from vaccination, infection and in total). Check out the annotation vignettes for usage examples.

Let’s plot some results from the un-annotated PITCH dataset.

pitch_chronogram %>%
  filter(!is.na(Cov_2S_MSD)) %>%
  ggplot(aes(x = calendar_date, y = Cov_2S_MSD)) +
  geom_point(shape = 20, alpha = 0.4) +
  scale_y_log10() +
  facet_grid(. ~ Previous_infection + Vaccine_interval,
    labeller = "label_both"
  ) +
  theme_bw() +
  labs(main = "SARS-CoV-2 Spike binding Ab")


## -----
pitch_chronogram %>%
  filter(!is.na(OC43_S)) %>%
  ggplot(aes(x = calendar_date, y = OC43_S)) +
  geom_point(shape = 20, alpha = 0.4) +
  scale_y_log10() +
  facet_grid(. ~ Previous_infection + Vaccine_interval,
    labeller = "label_both"
  ) +
  theme_bw() +
  labs(main = "OC43 [a common coronavirus] Spike binding Ab")

As expected - SARS-CoV-2 binding IgG is boosted by Covid vaccination, whereas binding IgG against OC43’s spike is not boosted by Covid vaccination.

NOTE

As highlighted earlier, changes have been made to the de-identified public dataset (DOI: 10.17632/fyp26zjgmj.1).

If your motivation is to re-analyse Payne et al. for scientific reasons, do not use the example dataset in this package. Instead consult the de-identified Mendeley data (DOI: 10.17632/fyp26zjgmj.1), and the data and code availability statement of Payne et al. Immunogenicity of standard and extended dosing intervals of BNT162b2 mRNA vaccine Cell 2021.

Summary

This vignette shows a stepwise approach to move data from SQL to chronogram. Whilst the clean-up code is study/database specific, it is re-usable throughout that study. We suggest combining the equivalent code for your study into an Rscript - or series of Rscripts - to run non-interactively, for example whilst you’re off for coffee or as a scheduled task.

SessionInfo

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_3.5.1    RSQLite_2.3.7    DBI_1.2.3        dbplyr_2.5.0    
#> [5] dplyr_1.1.4      chronogram_1.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyr_1.3.1       sass_0.4.9        utf8_1.2.4        generics_0.1.3   
#>  [5] digest_0.6.37     magrittr_2.0.3    timechange_0.3.0  evaluate_1.0.1   
#>  [9] grid_4.4.2        fastmap_1.2.0     blob_1.2.4        jsonlite_1.8.9   
#> [13] purrr_1.0.2       fansi_1.0.6       scales_1.3.0      textshaping_0.4.0
#> [17] jquerylib_0.1.4   cli_3.6.3         rlang_1.1.4       bit64_4.5.2      
#> [21] munsell_0.5.1     withr_3.0.2       cachem_1.1.0      yaml_2.3.10      
#> [25] tools_4.4.2       memoise_2.0.1     colorspace_2.1-1  vctrs_0.6.5      
#> [29] R6_2.5.1          lifecycle_1.0.4   lubridate_1.9.3   fs_1.6.5         
#> [33] bit_4.5.0         ragg_1.3.3        pkgconfig_2.0.3   desc_1.4.3       
#> [37] pkgdown_2.1.1     pillar_1.9.0      bslib_0.8.0       gtable_0.3.6     
#> [41] glue_1.8.0        systemfonts_1.1.0 highr_0.11        xfun_0.49        
#> [45] tibble_3.2.1      tidyselect_1.2.1  knitr_1.48        farver_2.1.2     
#> [49] htmltools_0.5.8.1 rmarkdown_2.29    compiler_4.4.2