Introducing the chronogram class
chronogram_class.Rmd
The chronogram class
The chronogram class extends tibble and dplyr. The chronogram class:
asserts that each combination of date and participant ID can only be a single row.
uses attributes slots to store the names of columns containing the index dates, participant IDs, and metadata columns. The chronogram package provides these column names to other chronogram functions.
uses the `pillar` package to customise the printing of a chronogram
has a related grouped chronogram class, which allows group_by() and subsequent tidyverse verbs to work as you would expect.
library(chronogram)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
library(tidyr)
library(knitr)
Differences between chronogram and tibble classes
We will use a simple example dataset to illustrate these differences.
## a 3m window for this particular chronogram ##
start_date <- "01012020"
end_date <- "01042020"
## load example metadata ##
data("smallstudy")
metadata <- smallstudy$small_study_metadata
knitr::kable(metadata)
elig_study_id | age | sex | dose_1 | date_dose_1 | dose_2 | date_dose_2 |
---|---|---|---|---|---|---|
1 | 40 | F | AZD1222 | 2021-01-05 | AZD1222 | 2021-02-05 |
2 | 45 | F | BNT162b2 | 2021-01-05 | BNT162b2 | 2021-02-05 |
3 | 35 | M | BNT162b2 | 2021-01-10 | BNT162b2 | 2021-03-10 |
We next create a chronogram (class chronogram), and a chronogram-like object (class tibble).
Using the chronogram assembly:
cg <- chronogram::cg_assemble(
start_date = start_date,
end_date = end_date,
metadata = metadata,
metadata_ids_col = elig_study_id,
calendar_date_col = calendar_date
)
#> Checking input parameters...
#> -- checking start date 01012020
#> -- checking end date 01042020
#> -- checking end date later than start date
#> -- checking metadata
#> --no experiment data provided. Add later: cg_add_experiment()
#> Input checks completed
#> Chronogram assembling...
#> -- chrongram_skeleton built
#> -- chrongram built with metadata
#> -- no experiment data provided
#>
#> Assembly finished
cg
#> # A tibble: 276 × 8
#> # A chronogram: try summary()
#> calendar_date elig_study_id age sex dose_1 date_dose_1 dose_2 date_dose_2
#> * <date> <fct> <dbl> <fct> <fct> <date> <fct> <date>
#> 1 2020-01-01 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> 2 2020-01-02 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> 3 2020-01-03 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> 4 2020-01-04 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> 5 2020-01-05 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> 6 2020-01-06 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> 7 2020-01-07 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> 8 2020-01-08 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> 9 2020-01-09 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> 10 2020-01-10 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> # ℹ 266 more rows
#> # ★ Dates: calendar_date ★ IDs: elig_study_id
#> # ★ metadata: age, sex, dose_1, date_dose_1, dose_2, date_dose_2
Using tibble-based assembly:
cg_tibble <-
tidyr::crossing(
calendar_date = seq.Date(
lubridate::dmy(start_date),
lubridate::dmy(end_date),
by = 1),
elig_study_id = metadata$elig_study_id) %>%
left_join(metadata)
#> Joining with `by = join_by(elig_study_id)`
cg_tibble
#> # A tibble: 276 × 8
#> calendar_date elig_study_id age sex dose_1 date_dose_1 dose_2 date_dose_2
#> <date> <dbl> <dbl> <fct> <fct> <date> <fct> <date>
#> 1 2020-01-01 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> 2 2020-01-01 2 45 F BNT16… 2021-01-05 BNT16… 2021-02-05
#> 3 2020-01-01 3 35 M BNT16… 2021-01-10 BNT16… 2021-03-10
#> 4 2020-01-02 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> 5 2020-01-02 2 45 F BNT16… 2021-01-05 BNT16… 2021-02-05
#> 6 2020-01-02 3 35 M BNT16… 2021-01-10 BNT16… 2021-03-10
#> 7 2020-01-03 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> 8 2020-01-03 2 45 F BNT16… 2021-01-05 BNT16… 2021-02-05
#> 9 2020-01-03 3 35 M BNT16… 2021-01-10 BNT16… 2021-03-10
#> 10 2020-01-04 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> # ℹ 266 more rows
Whilst these classes are different, the containing data is identical, once arranged equivalently:
all(
cg_tibble %>%
group_by(elig_study_id) %>%
arrange(calendar_date, .by_group = TRUE) ==
cg %>% as_tibble()
)
#> [1] TRUE
Although the setup code feels very similar to either approach, the chronogram class checks the input data.
## create example metadata, with a duplicated row ##
metadata_duplicated_row <-
dplyr::bind_rows(
metadata,
metadata %>% slice_tail() )
knitr::kable(metadata_duplicated_row)
elig_study_id | age | sex | dose_1 | date_dose_1 | dose_2 | date_dose_2 |
---|---|---|---|---|---|---|
1 | 40 | F | AZD1222 | 2021-01-05 | AZD1222 | 2021-02-05 |
2 | 45 | F | BNT162b2 | 2021-01-05 | BNT162b2 | 2021-02-05 |
3 | 35 | M | BNT162b2 | 2021-01-10 | BNT162b2 | 2021-03-10 |
3 | 35 | M | BNT162b2 | 2021-01-10 | BNT162b2 | 2021-03-10 |
Using the chronogram assembly:
cg_fail <- try(
chronogram::cg_assemble(
start_date = start_date,
end_date = end_date,
## use the new metadata ##
metadata = metadata_duplicated_row,
metadata_ids_col = elig_study_id,
calendar_date_col = calendar_date
)
)
#> Checking input parameters...
#> -- checking start date 01012020
#> -- checking end date 01042020
#> -- checking end date later than start date
#> -- checking metadata
#> --no experiment data provided. Add later: cg_add_experiment()
#> Input checks completed
#> Chronogram assembling...
#> Warning in chronogram_skeleton(start_date = start_date, end_date = end_date, : Duplicate IDs found : 3 Check your input ID vector...
#> provide a unique identifier for
#> each individual
#> Returning a de-duplicated chronogram_skeleton...
#> -- chrongram_skeleton built
#> Error in check_metadata(x = metadata, cg_skeleton = cg_skeleton) :
#> { .... is not TRUE
cg_fail
#> [1] "Error in check_metadata(x = metadata, cg_skeleton = cg_skeleton) : \n { .... is not TRUE\n"
#> attr(,"class")
#> [1] "try-error"
#> attr(,"condition")
#> <simpleError in check_metadata(x = metadata, cg_skeleton = cg_skeleton): { .... is not TRUE>
Using tibble-based assembly:
cg_tibble <-
tidyr::crossing(
calendar_date = seq.Date(
lubridate::dmy(start_date),
lubridate::dmy(end_date),
by = 1),
elig_study_id = metadata_duplicated_row$elig_study_id) %>%
left_join(metadata_duplicated_row)
#> Joining with `by = join_by(elig_study_id)`
#> Warning in left_join(., metadata_duplicated_row): Detected an unexpected many-to-many relationship between `x` and `y`.
#> ℹ Row 3 of `x` matches multiple rows in `y`.
#> ℹ Row 1 of `y` matches multiple rows in `x`.
#> ℹ If a many-to-many relationship is expected, set `relationship =
#> "many-to-many"` to silence this warning.
cg_tibble
#> # A tibble: 368 × 8
#> calendar_date elig_study_id age sex dose_1 date_dose_1 dose_2 date_dose_2
#> <date> <dbl> <dbl> <fct> <fct> <date> <fct> <date>
#> 1 2020-01-01 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> 2 2020-01-01 2 45 F BNT16… 2021-01-05 BNT16… 2021-02-05
#> 3 2020-01-01 3 35 M BNT16… 2021-01-10 BNT16… 2021-03-10
#> 4 2020-01-01 3 35 M BNT16… 2021-01-10 BNT16… 2021-03-10
#> 5 2020-01-02 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> 6 2020-01-02 2 45 F BNT16… 2021-01-05 BNT16… 2021-02-05
#> 7 2020-01-02 3 35 M BNT16… 2021-01-10 BNT16… 2021-03-10
#> 8 2020-01-02 3 35 M BNT16… 2021-01-10 BNT16… 2021-03-10
#> 9 2020-01-03 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> 10 2020-01-03 2 45 F BNT16… 2021-01-05 BNT16… 2021-02-05
#> # ℹ 358 more rows
Validating a chronogram
chronogram::validate_chronogram()
can be used to check a
chronogram is correct.
The tibble version expectedly fails:
try(
validate_chronogram(cg_tibble) ## fails
)
#> Error in validate_chronogram(cg_tibble) :
#> Invalid chronogram: Wrong class. Create with 'chronogram()'
validate_chronogram(cg) ## returns TRUE
#> [1] TRUE
We can intentionally break the row “rule” where each combination of
date x participant ID
can appear once, or not at all.
Example 1
## break "rule" one way ##------------------------------------------------------------
not_a_cg <- bind_rows(cg,
cg)
try(
validate_chronogram(not_a_cg)
)
#> Error in validate_chronogram(not_a_cg) :
#> Invalid chronogram: Dates are duplicated for IDs
Example 2
## break "rule" a different way ##----------------------------------------------------
## this extra data will create 2 rows for ID==1 on 1st Jan 2020 ##
data_to_join <- tibble::tribble(
~calendar_date, ~elig_study_id, ~new_info,
"01012020", 1, "a",
"01012020", 1, "b"
)
data_to_join <- data_to_join %>%
mutate(calendar_date = lubridate::dmy(calendar_date)) %>%
mutate(elig_study_id = factor(elig_study_id))
## using the provided method, cg_add_experiment()
## errors, with an appropriate message
try(
cg_add_experiment(cg, data_to_join)
)
#> Error in check_experiment(x = experiment, cg = cg) :
#> Invalid experiment: date:ids are duplicated
## we can use a dplyr::join ##
still_not_a_cg <- right_join(data_to_join, cg)
#> Joining with `by = join_by(calendar_date, elig_study_id)`
still_not_a_cg_left <- left_join(cg, data_to_join)
#> Joining with `by = join_by(calendar_date, elig_study_id)`
## when joining by two columns, the 1:many warning is not emitted
## an extra row is gained:
nrow(still_not_a_cg)
#> [1] 277
nrow(still_not_a_cg_left)
#> [1] 277
nrow(cg)
#> [1] 276
try(
validate_chronogram(still_not_a_cg)
)
#> Error in validate_chronogram(still_not_a_cg) :
#> Invalid chronogram: Wrong class. Create with 'chronogram()'
class(still_not_a_cg)
#> [1] "tbl_df" "tbl" "data.frame"
try(
validate_chronogram(still_not_a_cg_left)
)
#> Error in validate_chronogram(still_not_a_cg_left) :
#> Invalid chronogram: Dates are duplicated for IDs
class(still_not_a_cg_left) # left_join inherits class from first argument
#> [1] "cg_tbl" "tbl_df" "tbl" "data.frame"
Chronogram assembly and annotation functions use
validate_chronogram()
to maintain the row “rule” where each
combination of date x participant ID
can be present once,
or absent.
Chronogram attributes
To spare the user repetitively specifying the names used for the columns storing participant IDs or calendar dates, these are stored as attribute slots. These slots are examined by chronogram functions to use the relevant columns for per-date and per-individual operations. The user can explore these attributes:
attribs <- attributes(cg)
## attribs is a named list:
class(attribs)
#> [1] "list"
names(attribs)
#> [1] "names" "row.names" "class"
#> [4] "col_calendar_date" "col_ids" "cg_pkg_version"
#> [7] "windowed" "cols_metadata"
## print those attributes with names starting "col"
attribs [ grepl(names(attribs), pattern = "^col")]
#> $col_calendar_date
#> [1] "calendar_date"
#>
#> $col_ids
#> [1] "elig_study_id"
#>
#> $cols_metadata
#> [1] "age" "sex" "dose_1" "date_dose_1" "dose_2"
#> [6] "date_dose_2"
Or, more simply:
summary(cg)
#> A chronogram:
#> Dates column: calendar_date
#> IDs column: elig_study_id
#> Starts on: 2020-01-01
#> Ends on: 2020-04-01
#> Contains: 3 unique participant IDs
#> Windowed: FALSE
#> Spanning: 92 - 92 days [min-max per participant]
#> Metadata: age, sex, dose_1, date_dose_1, dose_2, date_dose_2
#> Size: 17.82 kB
#> Pkg_version: 1.0.0 [used to build this cg]
print(cg)
#> # A tibble: 276 × 8
#> # A chronogram: try summary()
#> calendar_date elig_study_id age sex dose_1 date_dose_1 dose_2 date_dose_2
#> * <date> <fct> <dbl> <fct> <fct> <date> <fct> <date>
#> 1 2020-01-01 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> 2 2020-01-02 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> 3 2020-01-03 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> 4 2020-01-04 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> 5 2020-01-05 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> 6 2020-01-06 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> 7 2020-01-07 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> 8 2020-01-08 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> 9 2020-01-09 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> 10 2020-01-10 1 40 F AZD12… 2021-01-05 AZD12… 2021-02-05
#> # ℹ 266 more rows
#> # ★ Dates: calendar_date ★ IDs: elig_study_id
#> # ★ metadata: age, sex, dose_1, date_dose_1, dose_2, date_dose_2
glimpse(cg) ## no experiment data added to this chronogram
#> Glimpse: chronogram
#> Dates column: calendar_date
#> IDs column: elig_study_id
#>
#> Metadata
#> Rows: 276
#> Columns: 6
#> $ age <dbl> 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40…
#> $ sex <fct> F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F…
#> $ dose_1 <fct> AZD1222, AZD1222, AZD1222, AZD1222, AZD1222, AZD1222, AZD1…
#> $ date_dose_1 <date> 2021-01-05, 2021-01-05, 2021-01-05, 2021-01-05, 2021-01-0…
#> $ dose_2 <fct> AZD1222, AZD1222, AZD1222, AZD1222, AZD1222, AZD1222, AZD1…
#> $ date_dose_2 <date> 2021-02-05, 2021-02-05, 2021-02-05, 2021-02-05, 2021-02-0…
#>
#> Experiment data & annotations
#> Rows: 276
#> Columns: 0
Summary
Chronogram is a subclass of tibble, and allows additional checks on
its validity. We found that assemblying data such that rows were present
for each {ID x date}
combination by sequential joins is
prone to duplicating rows. During chronogram
development,
we opted to assert the condition that each is {ID x date}
combination could be present once, or absent. As we wanted to avoid
altering handling of “conventional” tibble, we extended as a
sub-class.