We load in all the necessary additional R packages, and set up some initial parameters.
Table 1: Context
Setting
Value
section
Exploratory analysis
res_dir
results/v9.9.9
VERSION
v9.9.9
TAG
_v9.9.9-00eb166
staging_dir
staging
file_col
ID
name_col
sample_name
metadata
extdata/metadata_GRCh38.csv
counts
extdata/genes.results/GRCh38/
alignment
GRCh38
spec
analyse
specname
analyse
script
staging/01_exploratory_analyse_GRCh38.qmd
NoteSetting analysis parameter
The value of rowNoun is gene
NoteSetting analysis parameter
The value of RowNoun is Gene
We also have the following defaults set. Whenever they are used or changed in the analysis, that will be highlighted in the text in a table identified with an info icon.
Table 2: Analysis Settings
Option
Value
alpha
0.05
lfcThreshold
0
baseMeanMin
0
top_n_variable
500
showCategory
25
seed
1
filterFun
NULL
gene_clust
bluster::HclustParam(cut.params = list(k = 12))
sample_clust
bluster::HclustParam(metric = "pearson")
stringsAsFactors
FALSE
normalise
NULL
impute
NULL
baseline_heuristic
"min"
pc_weight
"equal weight"
LRT_effect
"default"
NoteSetting analysis parameter
A random seed of 1 is used to ensure reproduciblity
Using analysis plan ‘analyse’.
Using alignment settings ‘GRCh38’
Use the ‘“min”’ heuristic to centre the colour-scale
Use the ‘“default”’ summary of an LRT ‘effect size’
Discard transcripts with few average counts per sample than 0
Table 3: Feature Filtering
dataset
filter
success
condition
D1
universal
100.0%
TRUE
2 Input Summary
The sample annotations are as follows:
Table 4: Sample Annotation
Metadata
D1
ID
cellLine
treatment
D1
.influential
.involved
SRR1039508
N61311
Untreated
✓
✓
TRUE
SRR1039509
N61311
Dexamethasone
✓
✓
TRUE
SRR1039512
N052611
Untreated
✓
✓
TRUE
SRR1039513
N052611
Dexamethasone
✓
✓
TRUE
SRR1039516
N080611
Untreated
✓
✓
TRUE
SRR1039517
N080611
Dexamethasone
✓
✓
TRUE
SRR1039520
N061011
Untreated
✓
✓
TRUE
SRR1039521
N061011
Dexamethasone
✓
✓
TRUE
3 All Datasets and Models
We may examine the samples in different combinations, and leave out certain samples (either entirely ignore them, or visualise them but not let them influence the analysis in any way.)
In the above table, the columns under the ‘In Subset’ group indicate these combinations are listed, and the samples’ inclusions are indicated. Below, we describe the datasets in terms of the predicates which determine which samples are included in them, and also the ‘influential’ samples that determine: how the genes are clustered; how he principal components are calculated; and what samples are used in subsequent differential analysis.
Usually we will use include all samples and they will all be ‘influential’, but depending on the analysis question, it might be useful to completely ignore samples/experimental-groups entirely and from those select an influential subset that drive the analysis (the non-influential samples being retained in the visualisations and analysis for cross-reference, but which won’t actively drive the analysis.)
Each of those datasets may be inferentially analysed in potentially several ways, as formulated below. In simple ‘A vs B’ experiments there may only be one inferential approach, but as soon as we have a more complex structure different questions may demand a diversity of approaches, optionally: accounting for various covariates; stratifying the analysis by experimental factors; looking for the ways that one factor modifies the effect of another upon expression;… We use mathematical formulae to specify these approaches:
3.1 Dataset D1: All
All samples included
Samples for inclusion in any analysis: TRUE (8)
Samples used, out of those already selected for inclusion, to actively inform analysis (differential, heatmap’s top genes, principal components): All included samples (8)
Model M1: Simple
Including treatment and line, so that we can look at either one of those effects while accounting for any systematic changes in the other. But no interaction, so when there is a modifying effect (of treatment type on the response to line, or vice versa) will be unaccounted for and genes exhibiting this behaviour will tend not to be selected. There is no replication (ie no line x treatment combination has more than one sample, so we have to restrict our model to at most this complexity.
Expression ~ treatment + cellLine
Model M2: Line-only
Just including a line effect, and totally ignoring treatment. So any systematic treatment effect will not be accounted for, and genes exhibiting a change due to treatment will tend not to be selected
Expression ~ cellLine
Model M3: Treatment-only
Just including a treatment effect, and totally ignoring treatment. So any systematic differences between lines will not be accounted for, and genes exhibiting a dependencey on line will tend not to be selected
Expression ~ treatment
4 Description of Visualisations
There are different ways of configuring the plots. These roughly divide into two choices:
How to illustrate the covariates (treatments, batches, …) in the visualisations: in multi-factorial experiments, we may want to represent one covariate in terms of colouring the point that represents a sample. It might be obvious what these ‘aesthetic’ choices should be, but conversely we may produce multiple ‘plot configurations’ to illustrate different experimental factors.
What to use as the ‘outcome’: typically we’ll use the normalised readout in heatmaps, or to generate principal components, but there are situations where we might wish to ‘normalise out’ the effect of, say, batch so that the true biology can come into focus. Or we might wish to ‘recentre’ the data so that one particular experimental group is seen as a baseline.
At the beginning of each ‘plot config’ section we list the respective choice made for the plots in that section: the ‘aesthetic’ mapping describes how each aspect of the plot (colour, x-coordinate, shape, …) reflects an aspect of the experimental desgin (e.g. treatment); and the ‘reconstruction’ (if present) states which experimental factors have been normalised out (e.g. . - batch indicates that the batch effect has been regressed out of visualisation).
Now we give a brief introduction to each of the plot-types we show for each configuration.
4.1 Overall Heatmaps
These visualisations are carried out blind to the experimental design. For the heatmap Figure 2, we select a subset of genes removing the (assumed uninformative) genes with flat expression levels across the whole sample set. We’d expect samples from the same experimental group to cluster together, in the sense that they are on the same branch of the tree. But this is a transcriptome-wide picture, and even if the clustering is not perfect, there will still be genes that are consistent with the expected expression profiles, and they will be revealed in the differential analysis.
The values of the covariates relevant to the dataset and model will be illustrated in the colour-bar above the heatmp. Influential samples (often defined as all the samples) will be identified similarly in the colourbar - the choice of “most variable genes” wil be based on their behaviour in the influential samples.
In case that heatmap doesn’t cluster exactly according to the experimental design, in Figure 3, we provide the samples ordered according to one particular interpretation of the experimental design. In simple (one-way) designs where there is only one covariate, this is a fairly straightforward way of gathering samples in a set of columns, but in more complex designs it isn’t obvious whether to sort first by genotype and then treatment, say, or perhaps the reverse, so for this ‘supervised’ clustering we chose the (arbitrary) order the ‘model’ was specified, above.
4.2 Clustering centroids
Based on the gene’s expression in the influential samples, the dendrogram from the heatmap will be split into similarly behaving features. We can aggregate the expression across those features and produce dot plots for each of the clusters. See, for example Figure 4
4.3 PC-Covariate assocation
Principal Components are a way of finding a number of meta-genes - proxies that best reconstruct the multi-dimensional expression signal if we had to choose 1, 2, … individual numbers to represent a sample. In ?@fig-covar-1 the columns explore the lower-order principal components (which although not providing a universally strong signal transcriptome wide, might still detect biology that is affecting a smaller subset of genes, or a reasonably large set of genes to a lesser extent than happens in the higher-order PCS). The rows represent the experimental factors we have recorded, and the depth of colour of the cell records the degree of ‘association’ that factor has with the give PC (and grey says there was no statistically significant association.) This being an automated report, we’ve had to make various compromises so please don’t read too much into this chart: it primarily guides which other plots we should generate to look at how samples associate with each other.
4.4 Principal Component Scatterplots
Then come scatter plots of the first two Principal Components - a way of summarising all the genes down to just two ‘metagenes’ that attempt to encapsulate the variability across the samples in as faithful way as possible. Each plot (e.g. Figure 5) takes the PCs from the unadjusted data, and colours them according to the plot configuration under consideration.
There’s a slight complexity here in that we may need to produce two such plots to reveal all the information requested in a plot configuration: the configuration might requested the use of the horizontal position to represent an aspect of the experimental design, but this is needed to represent the first PC, so it may be necessary to generate a second plot representing the missing aspect as a colour in the chart.
These plots are interactive - hovering over a point will bring up a tooltip listing all the experimental factors pertaining to that sample. Hovering will also select all other samples that share an attribute with that sample - the nominated attribute will be the first one listed in the tooltip.
4.5 Principal Component Dot-Plots
We can choose to look closer at certain principal components that e.g. ?@fig-covar-1 leads us to believe are relevant to the experimental design. Just like the cluster dot-plots, we can take a single value as a representation of the sample and plot it, as in ?@fig-onepc-1, against the covariates according to the chosen plot configuration. Whereas in the cluster dotplot we equally-weighted various subsets of similar genes, in the PC dotplots we aggregate all the genes but chose a weighting scheme that captures as much of the multi-dimensionality as possible.
NoteSetting analysis parameter
Only use 500 features for unsupervised clustering
bluster::HclustParam(cut.params = list(k = 12))
The value of sample_clust is bluster::HclustParam(metric = “pearson”)
The value of pc_x is 1
The value of pc_y is 2
Use the ‘“equal weight”’ heuristic to choose strength of PC-covariate association
5 Dataset D1: All
All samples included
Samples for inclusion in any analysis: TRUE (8)
Samples used, out of those already selected for inclusion, to actively inform analysis (differential, heatmap’s top genes, principal components): All included samples (8)
5.1 Model M1: Simple
Including treatment and line, so that we can look at either one of those effects while accounting for any systematic changes in the other. But no interaction, so when there is a modifying effect (of treatment type on the response to line, or vice versa) will be unaccounted for and genes exhibiting this behaviour will tend not to be selected. There is no replication (ie no line x treatment combination has more than one sample, so we have to restrict our model to at most this complexity.
Figure 1: Metadata colour-codings for dataset ‘D1’ pngpdf
The Crick has a publication policy and we expect to be included on publications, regardless of funding arrangements. Any use of these results in publication must be discussed with BABS regarding authorship. If not authorship then the BABS analyst must receive a named acknowledgement. Please also cite the following sources which have enabled the analysis to be carried out.