Introducing the chronogram class • chronogram

The chronogram class

The chronogram class extends tibble and dplyr. The chronogram class:

asserts that each combination of date and participant ID can only be a single row.
uses attributes slots to store the names of columns containing the index dates, participant IDs, and metadata columns. The chronogram package provides these column names to other chronogram functions.
uses the `pillar` package to customise the printing of a chronogram
has a related grouped chronogram class, which allows group_by() and subsequent tidyverse verbs to work as you would expect.

library(chronogram)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(tidyr)
library(knitr)

Differences between chronogram and tibble classes

We will use a simple example dataset to illustrate these differences.


## a 3m window for this particular chronogram ##
start_date <- "01012020"
end_date <- "01042020"

## load example metadata ##
data("smallstudy")
metadata <- smallstudy$small_study_metadata

knitr::kable(metadata)

elig_study_id	age	sex	dose_1	date_dose_1	dose_2	date_dose_2
1	40	F	AZD1222	2021-01-05	AZD1222	2021-02-05
2	45	F	BNT162b2	2021-01-05	BNT162b2	2021-02-05
3	35	M	BNT162b2	2021-01-10	BNT162b2	2021-03-10

We next create a chronogram (class chronogram), and a chronogram-like object (class tibble).

Using the chronogram assembly:

cg <- chronogram::cg_assemble(
  start_date = start_date,
  end_date = end_date,
  metadata = metadata,
  metadata_ids_col = elig_study_id,
  calendar_date_col = calendar_date
)
#> Checking input parameters...
#> -- checking start date 01012020
#> -- checking end date 01042020
#> -- checking end date later than start date
#> -- checking metadata
#> --no experiment data provided. Add later: cg_add_experiment()
#> Input checks completed
#> Chronogram assembling...
#> -- chrongram_skeleton built
#> -- chrongram built with metadata
#> -- no experiment data provided
#> 
#> Assembly finished

cg
#> # A tibble:     276 × 8
#> # A chronogram: try summary()
#>    calendar_date elig_study_id   age sex   dose_1 date_dose_1 dose_2 date_dose_2
#>  * <date>        <fct>         <dbl> <fct> <fct>  <date>      <fct>  <date>     
#>  1 2020-01-01    1                40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#>  2 2020-01-02    1                40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#>  3 2020-01-03    1                40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#>  4 2020-01-04    1                40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#>  5 2020-01-05    1                40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#>  6 2020-01-06    1                40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#>  7 2020-01-07    1                40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#>  8 2020-01-08    1                40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#>  9 2020-01-09    1                40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#> 10 2020-01-10    1                40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#> # ℹ 266 more rows
#> # ★ Dates: calendar_date      ★ IDs: elig_study_id
#> # ★ metadata: age, sex, dose_1, date_dose_1, dose_2, date_dose_2

Using tibble-based assembly:


cg_tibble <- 
  tidyr::crossing(
    calendar_date = seq.Date(
      lubridate::dmy(start_date),
      lubridate::dmy(end_date),
      by = 1),
    elig_study_id = metadata$elig_study_id) %>%
  left_join(metadata)
#> Joining with `by = join_by(elig_study_id)`

cg_tibble
#> # A tibble: 276 × 8
#>    calendar_date elig_study_id   age sex   dose_1 date_dose_1 dose_2 date_dose_2
#>    <date>                <dbl> <dbl> <fct> <fct>  <date>      <fct>  <date>     
#>  1 2020-01-01                1    40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#>  2 2020-01-01                2    45 F     BNT16… 2021-01-05  BNT16… 2021-02-05 
#>  3 2020-01-01                3    35 M     BNT16… 2021-01-10  BNT16… 2021-03-10 
#>  4 2020-01-02                1    40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#>  5 2020-01-02                2    45 F     BNT16… 2021-01-05  BNT16… 2021-02-05 
#>  6 2020-01-02                3    35 M     BNT16… 2021-01-10  BNT16… 2021-03-10 
#>  7 2020-01-03                1    40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#>  8 2020-01-03                2    45 F     BNT16… 2021-01-05  BNT16… 2021-02-05 
#>  9 2020-01-03                3    35 M     BNT16… 2021-01-10  BNT16… 2021-03-10 
#> 10 2020-01-04                1    40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#> # ℹ 266 more rows

Whilst these classes are different, the containing data is identical, once arranged equivalently:


all(
  cg_tibble %>% 
  group_by(elig_study_id) %>%
  arrange(calendar_date, .by_group = TRUE) ==
  
    cg %>% as_tibble()
)
#> [1] TRUE

Although the setup code feels very similar to either approach, the chronogram class checks the input data.

## create example metadata, with a duplicated row ##
metadata_duplicated_row <- 
  dplyr::bind_rows(
    metadata, 
            metadata %>% slice_tail() )

knitr::kable(metadata_duplicated_row)

elig_study_id	age	sex	dose_1	date_dose_1	dose_2	date_dose_2
1	40	F	AZD1222	2021-01-05	AZD1222	2021-02-05
2	45	F	BNT162b2	2021-01-05	BNT162b2	2021-02-05
3	35	M	BNT162b2	2021-01-10	BNT162b2	2021-03-10
3	35	M	BNT162b2	2021-01-10	BNT162b2	2021-03-10

Using the chronogram assembly:

cg_fail <- try(
  chronogram::cg_assemble(
  start_date = start_date,
  end_date = end_date,
  ## use the new metadata ##
  metadata = metadata_duplicated_row,
  metadata_ids_col = elig_study_id,
  calendar_date_col = calendar_date
)
)
#> Checking input parameters...
#> -- checking start date 01012020
#> -- checking end date 01042020
#> -- checking end date later than start date
#> -- checking metadata
#> --no experiment data provided. Add later: cg_add_experiment()
#> Input checks completed
#> Chronogram assembling...
#> Warning in chronogram_skeleton(start_date = start_date, end_date = end_date, : Duplicate IDs found :  3 Check your input ID vector...
#>       provide a unique identifier for
#>       each individual
#> Returning a de-duplicated chronogram_skeleton...
#> -- chrongram_skeleton built
#> Error in check_metadata(x = metadata, cg_skeleton = cg_skeleton) : 
#>   { .... is not TRUE

cg_fail
#> [1] "Error in check_metadata(x = metadata, cg_skeleton = cg_skeleton) : \n  { .... is not TRUE\n"
#> attr(,"class")
#> [1] "try-error"
#> attr(,"condition")
#> <simpleError in check_metadata(x = metadata, cg_skeleton = cg_skeleton): { .... is not TRUE>

Using tibble-based assembly:


cg_tibble <- 
  tidyr::crossing(
    calendar_date = seq.Date(
      lubridate::dmy(start_date),
      lubridate::dmy(end_date),
      by = 1),
    elig_study_id = metadata_duplicated_row$elig_study_id) %>%
  left_join(metadata_duplicated_row)
#> Joining with `by = join_by(elig_study_id)`
#> Warning in left_join(., metadata_duplicated_row): Detected an unexpected many-to-many relationship between `x` and `y`.
#> ℹ Row 3 of `x` matches multiple rows in `y`.
#> ℹ Row 1 of `y` matches multiple rows in `x`.
#> ℹ If a many-to-many relationship is expected, set `relationship =
#>   "many-to-many"` to silence this warning.

cg_tibble
#> # A tibble: 368 × 8
#>    calendar_date elig_study_id   age sex   dose_1 date_dose_1 dose_2 date_dose_2
#>    <date>                <dbl> <dbl> <fct> <fct>  <date>      <fct>  <date>     
#>  1 2020-01-01                1    40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#>  2 2020-01-01                2    45 F     BNT16… 2021-01-05  BNT16… 2021-02-05 
#>  3 2020-01-01                3    35 M     BNT16… 2021-01-10  BNT16… 2021-03-10 
#>  4 2020-01-01                3    35 M     BNT16… 2021-01-10  BNT16… 2021-03-10 
#>  5 2020-01-02                1    40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#>  6 2020-01-02                2    45 F     BNT16… 2021-01-05  BNT16… 2021-02-05 
#>  7 2020-01-02                3    35 M     BNT16… 2021-01-10  BNT16… 2021-03-10 
#>  8 2020-01-02                3    35 M     BNT16… 2021-01-10  BNT16… 2021-03-10 
#>  9 2020-01-03                1    40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#> 10 2020-01-03                2    45 F     BNT16… 2021-01-05  BNT16… 2021-02-05 
#> # ℹ 358 more rows

Validating a chronogram

chronogram::validate_chronogram() can be used to check a chronogram is correct.

The tibble version expectedly fails:


try(
  validate_chronogram(cg_tibble) ## fails
)
#> Error in validate_chronogram(cg_tibble) : 
#>   Invalid chronogram: Wrong class. Create with 'chronogram()'

validate_chronogram(cg) ## returns TRUE
#> [1] TRUE

We can intentionally break the row “rule” where each combination of date x participant ID can appear once, or not at all.

Example 1

## break "rule" one way ##------------------------------------------------------------
not_a_cg <- bind_rows(cg, 
                      cg)

try(
  validate_chronogram(not_a_cg)
)
#> Error in validate_chronogram(not_a_cg) : 
#>   Invalid chronogram: Dates are duplicated for IDs

Example 2


## break "rule" a different way ##----------------------------------------------------
## this extra data will create 2 rows for ID==1 on 1st Jan 2020 ##
data_to_join <- tibble::tribble(
  ~calendar_date, ~elig_study_id, ~new_info,
  "01012020",     1,              "a",
  "01012020",     1,              "b"
)

data_to_join <- data_to_join %>%
  mutate(calendar_date = lubridate::dmy(calendar_date)) %>%
  mutate(elig_study_id = factor(elig_study_id))

## using the provided method, cg_add_experiment()
## errors, with an appropriate message
try(
  cg_add_experiment(cg, data_to_join)
)
#> Error in check_experiment(x = experiment, cg = cg) : 
#>   Invalid experiment: date:ids are duplicated

## we can use a dplyr::join ##
still_not_a_cg <- right_join(data_to_join, cg)
#> Joining with `by = join_by(calendar_date, elig_study_id)`
still_not_a_cg_left <- left_join(cg, data_to_join)
#> Joining with `by = join_by(calendar_date, elig_study_id)`
## when joining by two columns, the 1:many warning is not emitted
## an extra row is gained:
nrow(still_not_a_cg)
#> [1] 277
nrow(still_not_a_cg_left)
#> [1] 277
nrow(cg)
#> [1] 276

try(
  validate_chronogram(still_not_a_cg)
)
#> Error in validate_chronogram(still_not_a_cg) : 
#>   Invalid chronogram: Wrong class. Create with 'chronogram()'
class(still_not_a_cg)
#> [1] "tbl_df"     "tbl"        "data.frame"

try(
  validate_chronogram(still_not_a_cg_left)
)
#> Error in validate_chronogram(still_not_a_cg_left) : 
#>   Invalid chronogram: Dates are duplicated for IDs
class(still_not_a_cg_left) # left_join inherits class from first argument
#> [1] "cg_tbl"     "tbl_df"     "tbl"        "data.frame"

Chronogram assembly and annotation functions use validate_chronogram() to maintain the row “rule” where each combination of date x participant ID can be present once, or absent.

Chronogram attributes

To spare the user repetitively specifying the names used for the columns storing participant IDs or calendar dates, these are stored as attribute slots. These slots are examined by chronogram functions to use the relevant columns for per-date and per-individual operations. The user can explore these attributes:


attribs <- attributes(cg)

## attribs is a named list:
class(attribs)
#> [1] "list"
names(attribs)
#> [1] "names"             "row.names"         "class"            
#> [4] "col_calendar_date" "col_ids"           "cg_pkg_version"   
#> [7] "windowed"          "cols_metadata"

## print those attributes with names starting "col"
attribs [ grepl(names(attribs), pattern = "^col")]
#> $col_calendar_date
#> [1] "calendar_date"
#> 
#> $col_ids
#> [1] "elig_study_id"
#> 
#> $cols_metadata
#> [1] "age"         "sex"         "dose_1"      "date_dose_1" "dose_2"     
#> [6] "date_dose_2"

Or, more simply:


summary(cg)
#> A chronogram:
#>  Dates column:  calendar_date 
#>  IDs column:    elig_study_id 
#>  Starts on:     2020-01-01 
#>  Ends on:       2020-04-01 
#>  Contains:      3  unique participant IDs
#>  Windowed:      FALSE 
#>  Spanning:      92 - 92 days [min-max per participant]
#>  Metadata:      age, sex, dose_1, date_dose_1, dose_2, date_dose_2 
#>  Size:          17.82 kB 
#>  Pkg_version:   1.0.0 [used to build this cg]

print(cg)
#> # A tibble:     276 × 8
#> # A chronogram: try summary()
#>    calendar_date elig_study_id   age sex   dose_1 date_dose_1 dose_2 date_dose_2
#>  * <date>        <fct>         <dbl> <fct> <fct>  <date>      <fct>  <date>     
#>  1 2020-01-01    1                40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#>  2 2020-01-02    1                40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#>  3 2020-01-03    1                40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#>  4 2020-01-04    1                40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#>  5 2020-01-05    1                40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#>  6 2020-01-06    1                40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#>  7 2020-01-07    1                40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#>  8 2020-01-08    1                40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#>  9 2020-01-09    1                40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#> 10 2020-01-10    1                40 F     AZD12… 2021-01-05  AZD12… 2021-02-05 
#> # ℹ 266 more rows
#> # ★ Dates: calendar_date      ★ IDs: elig_study_id
#> # ★ metadata: age, sex, dose_1, date_dose_1, dose_2, date_dose_2

glimpse(cg) ## no experiment data added to this chronogram
#> Glimpse: chronogram 
#>  Dates column:  calendar_date 
#>  IDs column:    elig_study_id 
#> 
#> Metadata
#> Rows: 276
#> Columns: 6
#> $ age         <dbl> 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40…
#> $ sex         <fct> F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F…
#> $ dose_1      <fct> AZD1222, AZD1222, AZD1222, AZD1222, AZD1222, AZD1222, AZD1…
#> $ date_dose_1 <date> 2021-01-05, 2021-01-05, 2021-01-05, 2021-01-05, 2021-01-0…
#> $ dose_2      <fct> AZD1222, AZD1222, AZD1222, AZD1222, AZD1222, AZD1222, AZD1…
#> $ date_dose_2 <date> 2021-02-05, 2021-02-05, 2021-02-05, 2021-02-05, 2021-02-0…
#> 
#> Experiment data & annotations
#> Rows: 276
#> Columns: 0

Summary

Chronogram is a subclass of tibble, and allows additional checks on its validity. We found that assemblying data such that rows were present for each {ID x date} combination by sequential joins is prone to duplicating rows. During chronogram development, we opted to assert the condition that each is {ID x date} combination could be present once, or absent. As we wanted to avoid altering handling of “conventional” tibble, we extended as a sub-class.