Exploratory analysis GRCh38 analyse

analyse

GRCh38

Exploratory

Exploratory analysis of GRCh38-aligned data according to plan ‘analyse’

Authors

Affiliations

Gavin Kelly ^[Analyst]

Francis Crick Institute

BABS ^{[Development]}

Francis Crick Institute

Gavin Kelly ^{[Developer,Statistical Metholodogy]}

Francis Crick Institute

1 Preface

We load in all the necessary additional R packages, and set up some initial parameters.

Table 1: Context

Setting	Value
section	Exploratory analysis
res_dir	results/v9.9.9
VERSION	v9.9.9
TAG	_v9.9.9-00eb166
staging_dir	staging
file_col	ID
name_col	sample_name
metadata	extdata/metadata_GRCh38.csv
counts	extdata/genes.results/GRCh38/
alignment	GRCh38
spec	analyse
specname	analyse
script	staging/01_exploratory_analyse_GRCh38.qmd

Setting analysis parameter

The value of rowNoun is gene

Setting analysis parameter

The value of RowNoun is Gene

We also have the following defaults set. Whenever they are used or changed in the analysis, that will be highlighted in the text in a table identified with an info icon.

Table 2: Analysis Settings

Option	Value
alpha	0.05
lfcThreshold	0
baseMeanMin	0
top_n_variable	500
showCategory	25
seed	1
filterFun	NULL
gene_clust	bluster::HclustParam(cut.params = list(k = 12))
sample_clust	bluster::HclustParam(metric = "pearson")
stringsAsFactors	FALSE
normalise	NULL
impute	NULL
baseline_heuristic	"min"
pc_weight	"equal weight"
LRT_effect	"default"

Setting analysis parameter

A random seed of 1 is used to ensure reproduciblity
Using analysis plan ‘analyse’.
Using alignment settings ‘GRCh38’
Use the ‘“min”’ heuristic to centre the colour-scale
Use the ‘“default”’ summary of an LRT ‘effect size’
Discard transcripts with few average counts per sample than 0

Table 3: Feature Filtering

dataset	filter	success	condition
D1	universal	100.0%	TRUE

2 Input Summary

The sample annotations are as follows:

Table 4: Sample Annotation

Metadata			D1
ID	cellLine	treatment	D1	.influential	.involved
SRR1039508	N61311	Untreated	✓	✓	TRUE
SRR1039509	N61311	Dexamethasone	✓	✓	TRUE
SRR1039512	N052611	Untreated	✓	✓	TRUE
SRR1039513	N052611	Dexamethasone	✓	✓	TRUE
SRR1039516	N080611	Untreated	✓	✓	TRUE
SRR1039517	N080611	Dexamethasone	✓	✓	TRUE
SRR1039520	N061011	Untreated	✓	✓	TRUE
SRR1039521	N061011	Dexamethasone	✓	✓	TRUE

3 All Datasets and Models

We may examine the samples in different combinations, and leave out certain samples (either entirely ignore them, or visualise them but not let them influence the analysis in any way.)

In the above table, the columns under the ‘In Subset’ group indicate these combinations are listed, and the samples’ inclusions are indicated. Below, we describe the datasets in terms of the predicates which determine which samples are included in them, and also the ‘influential’ samples that determine: how the genes are clustered; how he principal components are calculated; and what samples are used in subsequent differential analysis.

Usually we will use include all samples and they will all be ‘influential’, but depending on the analysis question, it might be useful to completely ignore samples/experimental-groups entirely and from those select an influential subset that drive the analysis (the non-influential samples being retained in the visualisations and analysis for cross-reference, but which won’t actively drive the analysis.)

Each of those datasets may be inferentially analysed in potentially several ways, as formulated below. In simple ‘A vs B’ experiments there may only be one inferential approach, but as soon as we have a more complex structure different questions may demand a diversity of approaches, optionally: accounting for various covariates; stratifying the analysis by experimental factors; looking for the ways that one factor modifies the effect of another upon expression;… We use mathematical formulae to specify these approaches:

3.1 Dataset D1: All

All samples included

Samples for inclusion in any analysis: TRUE (8)
Samples used, out of those already selected for inclusion, to actively inform analysis (differential, heatmap’s top genes, principal components): All included samples (8)

Model M1: Simple

Including treatment and line, so that we can look at either one of those effects while accounting for any systematic changes in the other. But no interaction, so when there is a modifying effect (of treatment type on the response to line, or vice versa) will be unaccounted for and genes exhibiting this behaviour will tend not to be selected. There is no replication (ie no line x treatment combination has more than one sample, so we have to restrict our model to at most this complexity.

Expression ~ treatment + cellLine

Model M2: Line-only

Just including a line effect, and totally ignoring treatment. So any systematic treatment effect will not be accounted for, and genes exhibiting a change due to treatment will tend not to be selected

Expression ~ cellLine

Model M3: Treatment-only

Just including a treatment effect, and totally ignoring treatment. So any systematic differences between lines will not be accounted for, and genes exhibiting a dependencey on line will tend not to be selected

Expression ~ treatment

4 Description of Visualisations

There are different ways of configuring the plots. These roughly divide into two choices:

How to illustrate the covariates (treatments, batches, …) in the visualisations: in multi-factorial experiments, we may want to represent one covariate in terms of colouring the point that represents a sample. It might be obvious what these ‘aesthetic’ choices should be, but conversely we may produce multiple ‘plot configurations’ to illustrate different experimental factors.
What to use as the ‘outcome’: typically we’ll use the normalised readout in heatmaps, or to generate principal components, but there are situations where we might wish to ‘normalise out’ the effect of, say, batch so that the true biology can come into focus. Or we might wish to ‘recentre’ the data so that one particular experimental group is seen as a baseline.

At the beginning of each ‘plot config’ section we list the respective choice made for the plots in that section: the ‘aesthetic’ mapping describes how each aspect of the plot (colour, x-coordinate, shape, …) reflects an aspect of the experimental desgin (e.g. treatment); and the ‘reconstruction’ (if present) states which experimental factors have been normalised out (e.g. . - batch indicates that the batch effect has been regressed out of visualisation).

Now we give a brief introduction to each of the plot-types we show for each configuration.

4.1 Overall Heatmaps

These visualisations are carried out blind to the experimental design. For the heatmap Figure 2, we select a subset of genes removing the (assumed uninformative) genes with flat expression levels across the whole sample set. We’d expect samples from the same experimental group to cluster together, in the sense that they are on the same branch of the tree. But this is a transcriptome-wide picture, and even if the clustering is not perfect, there will still be genes that are consistent with the expected expression profiles, and they will be revealed in the differential analysis.

The values of the covariates relevant to the dataset and model will be illustrated in the colour-bar above the heatmp. Influential samples (often defined as all the samples) will be identified similarly in the colourbar - the choice of “most variable genes” wil be based on their behaviour in the influential samples.

In case that heatmap doesn’t cluster exactly according to the experimental design, in Figure 3, we provide the samples ordered according to one particular interpretation of the experimental design. In simple (one-way) designs where there is only one covariate, this is a fairly straightforward way of gathering samples in a set of columns, but in more complex designs it isn’t obvious whether to sort first by genotype and then treatment, say, or perhaps the reverse, so for this ‘supervised’ clustering we chose the (arbitrary) order the ‘model’ was specified, above.

4.2 Clustering centroids

Based on the gene’s expression in the influential samples, the dendrogram from the heatmap will be split into similarly behaving features. We can aggregate the expression across those features and produce dot plots for each of the clusters. See, for example Figure 4

4.3 PC-Covariate assocation

Principal Components are a way of finding a number of meta-genes - proxies that best reconstruct the multi-dimensional expression signal if we had to choose 1, 2, … individual numbers to represent a sample. In ?@fig-covar-1 the columns explore the lower-order principal components (which although not providing a universally strong signal transcriptome wide, might still detect biology that is affecting a smaller subset of genes, or a reasonably large set of genes to a lesser extent than happens in the higher-order PCS). The rows represent the experimental factors we have recorded, and the depth of colour of the cell records the degree of ‘association’ that factor has with the give PC (and grey says there was no statistically significant association.) This being an automated report, we’ve had to make various compromises so please don’t read too much into this chart: it primarily guides which other plots we should generate to look at how samples associate with each other.

4.4 Principal Component Scatterplots

Then come scatter plots of the first two Principal Components - a way of summarising all the genes down to just two ‘metagenes’ that attempt to encapsulate the variability across the samples in as faithful way as possible. Each plot (e.g. Figure 5) takes the PCs from the unadjusted data, and colours them according to the plot configuration under consideration.

There’s a slight complexity here in that we may need to produce two such plots to reveal all the information requested in a plot configuration: the configuration might requested the use of the horizontal position to represent an aspect of the experimental design, but this is needed to represent the first PC, so it may be necessary to generate a second plot representing the missing aspect as a colour in the chart.

These plots are interactive - hovering over a point will bring up a tooltip listing all the experimental factors pertaining to that sample. Hovering will also select all other samples that share an attribute with that sample - the nominated attribute will be the first one listed in the tooltip.

4.5 Principal Component Dot-Plots

We can choose to look closer at certain principal components that e.g. ?@fig-covar-1 leads us to believe are relevant to the experimental design. Just like the cluster dot-plots, we can take a single value as a representation of the sample and plot it, as in ?@fig-onepc-1, against the covariates according to the chosen plot configuration. Whereas in the cluster dotplot we equally-weighted various subsets of similar genes, in the PC dotplots we aggregate all the genes but chose a weighting scheme that captures as much of the multi-dimensionality as possible.

Setting analysis parameter

Only use 500 features for unsupervised clustering
bluster::HclustParam(cut.params = list(k = 12))
The value of sample_clust is bluster::HclustParam(metric = “pearson”)
The value of pc_x is 1
The value of pc_y is 2
Use the ‘“equal weight”’ heuristic to choose strength of PC-covariate association

5 Dataset D1: All

All samples included

Samples for inclusion in any analysis: TRUE (8)
Samples used, out of those already selected for inclusion, to actively inform analysis (differential, heatmap’s top genes, principal components): All included samples (8)

5.1 Model M1: Simple

Figure 1: Metadata colour-codings for dataset ‘D1’ png pdf

5.1.1 Plot config P1

Aesthetics:

X = cellLine
Colour = treatment
Grouping variables = treatment

Figure 2: Centred Scaled Expression vst in dataset ‘D1’ png pdf

Figure 3: Sorted Centred Scaled Expression vst in dataset ‘D1’ png pdf

Figure 4: Cluster profiles annotated with colour~treatment in dataset ‘D1’ png pdf

Figure 5: Annotated by colour~treatment in dataset ‘D1’ png pdf

Figure 6: Annotated by colour~cellLine in dataset ‘D1’ png pdf

Error in (weights <- weights) <- rep(1, nrow(mat)): could not find function "<-<-"

6 Downloads

counts matrix for D1

vst matrix for D1

norm matrix for D1

6.1 Experimental Group Averages

Expected vst modeled as M1 in D1

6.2 Cluster Composition

Gene-Cluster mappings

Show the code

plot_rows=transpose(plot_df);
viewof dataset=Inputs.select([...new Set(plot_rows.map(d => d.dataset))]);
viewof model=Inputs.select([...new Set(plot_rows.map(d => d.model))]);
viewof label=Inputs.select([...new Set(plot_rows.map(d => d.label))]);

filtered = plot_rows.filter(function(row) {
  return row.dataset==dataset && row.model==model && row.label==label;});

html`<img src="${filtered[0].png}" style="max-width: 100%; height: auto;">`

7 Terms Of Use

The Crick has a publication policy and we expect to be included on publications, regardless of funding arrangements. Any use of these results in publication must be discussed with BABS regarding authorship. If not authorship then the BABS analyst must receive a named acknowledgement. Please also cite the following sources which have enabled the analysis to be carried out.