Exploratory analysis GRCh38 analyse

analyse
GRCh38
Exploratory
Exploratory analysis of GRCh38-aligned data according to plan ‘analyse’
Authors
Affiliations

Gavin Kelly [Analyst]

BABS [Development]

Gavin Kelly [Developer,Statistical Metholodogy]

1 Preface

We load in all the necessary additional R packages, and set up some initial parameters.

Table 1: Context
Setting Value
section Exploratory analysis
res_dir results/v9.9.9
VERSION v9.9.9
TAG _v9.9.9-00eb166
staging_dir staging
file_col ID
name_col sample_name
metadata extdata/metadata_GRCh38.csv
counts extdata/genes.results/GRCh38/
alignment GRCh38
spec analyse
specname analyse
script staging/01_exploratory_analyse_GRCh38.qmd
NoteSetting analysis parameter
  • The value of rowNoun is gene
NoteSetting analysis parameter
  • The value of RowNoun is Gene

We also have the following defaults set. Whenever they are used or changed in the analysis, that will be highlighted in the text in a table identified with an info icon.

Table 2: Analysis Settings
Option Value
alpha 0.05
lfcThreshold 0
baseMeanMin 0
top_n_variable 500
showCategory 25
seed 1
filterFun NULL
gene_clust bluster::HclustParam(cut.params = list(k = 12))
sample_clust bluster::HclustParam(metric = "pearson")
stringsAsFactors FALSE
normalise NULL
impute NULL
baseline_heuristic "min"
pc_weight "equal weight"
LRT_effect "default"
NoteSetting analysis parameter
  • A random seed of 1 is used to ensure reproduciblity
  • Using analysis plan ‘analyse’.
  • Using alignment settings ‘GRCh38’
  • Use the ‘“min”’ heuristic to centre the colour-scale
  • Use the ‘“default”’ summary of an LRT ‘effect size’
  • Discard transcripts with few average counts per sample than 0
Table 3: Feature Filtering
dataset filter success condition
D1 universal 100.0% TRUE

2 Input Summary

The sample annotations are as follows:

Table 4: Sample Annotation
Metadata
D1
ID cellLine treatment D1 .influential .involved
SRR1039508 N61311 Untreated TRUE
SRR1039509 N61311 Dexamethasone TRUE
SRR1039512 N052611 Untreated TRUE
SRR1039513 N052611 Dexamethasone TRUE
SRR1039516 N080611 Untreated TRUE
SRR1039517 N080611 Dexamethasone TRUE
SRR1039520 N061011 Untreated TRUE
SRR1039521 N061011 Dexamethasone TRUE

3 All Datasets and Models

We may examine the samples in different combinations, and leave out certain samples (either entirely ignore them, or visualise them but not let them influence the analysis in any way.)

In the above table, the columns under the ‘In Subset’ group indicate these combinations are listed, and the samples’ inclusions are indicated. Below, we describe the datasets in terms of the predicates which determine which samples are included in them, and also the ‘influential’ samples that determine: how the genes are clustered; how he principal components are calculated; and what samples are used in subsequent differential analysis.

Usually we will use include all samples and they will all be ‘influential’, but depending on the analysis question, it might be useful to completely ignore samples/experimental-groups entirely and from those select an influential subset that drive the analysis (the non-influential samples being retained in the visualisations and analysis for cross-reference, but which won’t actively drive the analysis.)

Each of those datasets may be inferentially analysed in potentially several ways, as formulated below. In simple ‘A vs B’ experiments there may only be one inferential approach, but as soon as we have a more complex structure different questions may demand a diversity of approaches, optionally: accounting for various covariates; stratifying the analysis by experimental factors; looking for the ways that one factor modifies the effect of another upon expression;… We use mathematical formulae to specify these approaches:

3.1 Dataset D1: All

All samples included

  • Samples for inclusion in any analysis: TRUE (8)
  • Samples used, out of those already selected for inclusion, to actively inform analysis (differential, heatmap’s top genes, principal components): All included samples (8)

Model M1: Simple

Including treatment and line, so that we can look at either one of those effects while accounting for any systematic changes in the other. But no interaction, so when there is a modifying effect (of treatment type on the response to line, or vice versa) will be unaccounted for and genes exhibiting this behaviour will tend not to be selected. There is no replication (ie no line x treatment combination has more than one sample, so we have to restrict our model to at most this complexity.

Expression ~ treatment + cellLine

Model M2: Line-only

Just including a line effect, and totally ignoring treatment. So any systematic treatment effect will not be accounted for, and genes exhibiting a change due to treatment will tend not to be selected

Expression ~ cellLine

Model M3: Treatment-only

Just including a treatment effect, and totally ignoring treatment. So any systematic differences between lines will not be accounted for, and genes exhibiting a dependencey on line will tend not to be selected

Expression ~ treatment

4 Description of Visualisations

There are different ways of configuring the plots. These roughly divide into two choices:

  • How to illustrate the covariates (treatments, batches, …) in the visualisations: in multi-factorial experiments, we may want to represent one covariate in terms of colouring the point that represents a sample. It might be obvious what these ‘aesthetic’ choices should be, but conversely we may produce multiple ‘plot configurations’ to illustrate different experimental factors.

  • What to use as the ‘outcome’: typically we’ll use the normalised readout in heatmaps, or to generate principal components, but there are situations where we might wish to ‘normalise out’ the effect of, say, batch so that the true biology can come into focus. Or we might wish to ‘recentre’ the data so that one particular experimental group is seen as a baseline.

At the beginning of each ‘plot config’ section we list the respective choice made for the plots in that section: the ‘aesthetic’ mapping describes how each aspect of the plot (colour, x-coordinate, shape, …) reflects an aspect of the experimental desgin (e.g. treatment); and the ‘reconstruction’ (if present) states which experimental factors have been normalised out (e.g. . - batch indicates that the batch effect has been regressed out of visualisation).

Now we give a brief introduction to each of the plot-types we show for each configuration.

4.1 Overall Heatmaps

These visualisations are carried out blind to the experimental design. For the heatmap Figure 2, we select a subset of genes removing the (assumed uninformative) genes with flat expression levels across the whole sample set. We’d expect samples from the same experimental group to cluster together, in the sense that they are on the same branch of the tree. But this is a transcriptome-wide picture, and even if the clustering is not perfect, there will still be genes that are consistent with the expected expression profiles, and they will be revealed in the differential analysis.

The values of the covariates relevant to the dataset and model will be illustrated in the colour-bar above the heatmp. Influential samples (often defined as all the samples) will be identified similarly in the colourbar - the choice of “most variable genes” wil be based on their behaviour in the influential samples.

In case that heatmap doesn’t cluster exactly according to the experimental design, in Figure 3, we provide the samples ordered according to one particular interpretation of the experimental design. In simple (one-way) designs where there is only one covariate, this is a fairly straightforward way of gathering samples in a set of columns, but in more complex designs it isn’t obvious whether to sort first by genotype and then treatment, say, or perhaps the reverse, so for this ‘supervised’ clustering we chose the (arbitrary) order the ‘model’ was specified, above.

4.2 Clustering centroids

Based on the gene’s expression in the influential samples, the dendrogram from the heatmap will be split into similarly behaving features. We can aggregate the expression across those features and produce dot plots for each of the clusters. See, for example Figure 4

4.3 PC-Covariate assocation

Principal Components are a way of finding a number of meta-genes - proxies that best reconstruct the multi-dimensional expression signal if we had to choose 1, 2, … individual numbers to represent a sample. In ?@fig-covar-1 the columns explore the lower-order principal components (which although not providing a universally strong signal transcriptome wide, might still detect biology that is affecting a smaller subset of genes, or a reasonably large set of genes to a lesser extent than happens in the higher-order PCS). The rows represent the experimental factors we have recorded, and the depth of colour of the cell records the degree of ‘association’ that factor has with the give PC (and grey says there was no statistically significant association.) This being an automated report, we’ve had to make various compromises so please don’t read too much into this chart: it primarily guides which other plots we should generate to look at how samples associate with each other.

4.4 Principal Component Scatterplots

Then come scatter plots of the first two Principal Components - a way of summarising all the genes down to just two ‘metagenes’ that attempt to encapsulate the variability across the samples in as faithful way as possible. Each plot (e.g. Figure 5) takes the PCs from the unadjusted data, and colours them according to the plot configuration under consideration.

There’s a slight complexity here in that we may need to produce two such plots to reveal all the information requested in a plot configuration: the configuration might requested the use of the horizontal position to represent an aspect of the experimental design, but this is needed to represent the first PC, so it may be necessary to generate a second plot representing the missing aspect as a colour in the chart.

These plots are interactive - hovering over a point will bring up a tooltip listing all the experimental factors pertaining to that sample. Hovering will also select all other samples that share an attribute with that sample - the nominated attribute will be the first one listed in the tooltip.

4.5 Principal Component Dot-Plots

We can choose to look closer at certain principal components that e.g. ?@fig-covar-1 leads us to believe are relevant to the experimental design. Just like the cluster dot-plots, we can take a single value as a representation of the sample and plot it, as in ?@fig-onepc-1, against the covariates according to the chosen plot configuration. Whereas in the cluster dotplot we equally-weighted various subsets of similar genes, in the PC dotplots we aggregate all the genes but chose a weighting scheme that captures as much of the multi-dimensionality as possible.

NoteSetting analysis parameter
  • Only use 500 features for unsupervised clustering
  • bluster::HclustParam(cut.params = list(k = 12))
  • The value of sample_clust is bluster::HclustParam(metric = “pearson”)
  • The value of pc_x is 1
  • The value of pc_y is 2
  • Use the ‘“equal weight”’ heuristic to choose strength of PC-covariate association

5 Dataset D1: All

All samples included

  • Samples for inclusion in any analysis: TRUE (8)
  • Samples used, out of those already selected for inclusion, to actively inform analysis (differential, heatmap’s top genes, principal components): All included samples (8)

5.1 Model M1: Simple

Including treatment and line, so that we can look at either one of those effects while accounting for any systematic changes in the other. But no interaction, so when there is a modifying effect (of treatment type on the response to line, or vice versa) will be unaccounted for and genes exhibiting this behaviour will tend not to be selected. There is no replication (ie no line x treatment combination has more than one sample, so we have to restrict our model to at most this complexity.

Figure 1: Metadata colour-codings for dataset ‘D1’ png pdf

5.1.1 Plot config P1

Aesthetics:

  • X = cellLine
  • Colour = treatment
  • Grouping variables = treatment
Figure 2: Centred Scaled Expression vst in dataset ‘D1’ png pdf

Figure 3: Sorted Centred Scaled Expression vst in dataset ‘D1’ png pdf

Figure 4: Cluster profiles annotated with colour~treatment in dataset ‘D1’ png pdf

Figure 5: Annotated by colour~treatment in dataset ‘D1’ png pdf
Figure 6: Annotated by colour~cellLine in dataset ‘D1’ png pdf
Error in (weights <- weights) <- rep(1, nrow(mat)): could not find function "<-<-"

6 Downloads

counts matrix for D1

vst matrix for D1

norm matrix for D1

6.1 Experimental Group Averages

Expected vst modeled as M1 in D1

6.2 Cluster Composition

Gene-Cluster mappings

7 Terms Of Use

The Crick has a publication policy and we expect to be included on publications, regardless of funding arrangements. Any use of these results in publication must be discussed with BABS regarding authorship. If not authorship then the BABS analyst must receive a named acknowledgement. Please also cite the following sources which have enabled the analysis to be carried out.