Analysis Philosophy
This is a wrapper round DESeq2 which standardises the process of
generating DESeq2::results objects for a specifiable set of
comparisons. Each comparison is done in the context of a model (a
specification of what covariates need to be accounted for in predicting
the expression of any transcript). And each model is formulated in a
sample-set (which is usually all the samples, but can be changed to drop
samples, or look at a stratum in an isolated way).
There are three classes of input to the analysis pipeline: the raw
count files; the experiment table; and an analysis specification. The
count files should be stored in a folder
extdata/alignmentID/genes.results/ relative to the analysis
home directory, where alignmentID is a freeform name that
identifies the alignment settings and choices (so could just be
‘nfcore’, or a species name, or a version number of a reference …).
The experiment table is called
extdata/metadata_alignmentID.csv (to match with the choice
of the folder name above) and the first column should be called
ID and identifies the sample that a row is describing, so
if extadata/nfcore/genes.results/A1.genes.results is sample
A1 that is a control, then extdata/metadata_nfcore.csv
should start
ID,treatment,...
A1,control,...
Other columns should be used to entirely describe the experimental
state of the sample that might have an influence on expression. There is
a convention that the first column after ID which
takes on unduplicated values throughout the rows is to be used as the
sample label in plots (e.g. a samplename column 2.) If
there is no such column, the ID column will be used. In
either case, the label will be shorn of any common prefix, so if the IDs
are SAMP9999A1 … SAMP999A10, and no other unduplicated column is stored,
then the samples will be labelled 1…10).
The plan as to what analyses are to be carried out is stored in a
.spec file, and there can be as many of these as you want -
so you could have an overview.spec file that tried out a
handful of approaches, and a publication.spec that recorded
the chosen approach. Each .spec file is in fact an R script
(and so should be treated with as much caution, security-wise, as any
other script), but is expected to be of a certain form, the first
criterion is that it should start and end as follows:
specification(
...
)Sample sets
The highest level of any standardised analysis is a choice as to what samples are to be analysed together. Often this is trivial, and the only sensible analysis is to look at all the samples together. But even in the simplest experiment, it might be observed for example that one sample is an outlier and we’d like to examine the effect of omitting or retaining this sample. We can accomplish this easily:
specification(
sample_sets =list(
all = sample_set(
subset = TRUE,
name="All Samples",# an optional short name
description="All samples included in the analysis"# an optional longer description
),
no_outlier=sample_set(
subset = sample_id != "id_of_bad_sample"
)
)
)So a collection of samples is called a sample_set, and
we have two (a list) of them here. The subset
expression of a sample_set is evaluated in the context of the
colData, so the first one always evaluates to
TRUE, denoted that we include all samples; the second one
evaluates to FALSE for a particular sample (assuming we
have a sample_id column in our colData), dropping that
sample.
Other reasons we may want to subset our samples into different configurations is: if there are two very different cohorts of samples where transcripts are likely to have substantially different noise profiles and we want to avoid contaminating one set of samples with the estimates of dispersion that hold for the other set. Of course it then becomes impossible to statistically test between the two cohorts, but we can have a third set that still combines them:
specification(
sample_sets =list(
all = sample_set(
subset = TRUE
),
tumour=sample_set(
subset = condition == "tumour"
),
normal=sample_set(
subset = condition == "normal"
)
)
)It’s possible for sample_sets to inherit subsets from
the specification level. We discuss this concept later on, but a simple example should suffice to
illustrate how we can refine a ‘global’ subset:
specification(
subset = ! sample_id %in% c("bad1","bad2"), # get rid of two samples that failed qc
sample_sets =list(
all_good = sample_set(
... # No need for a subset here
),
tumour=sample_set(
subset = condition == "tumour" # this will get 'AND'ed with the top-level `subset`
),
normal=sample_set(
subset = condition == "normal" # as will this.
)
)
)There’s also a influential_samples field that takes the
same form. This is to allow situations where you want to focus the
statistical testing to a subset of samples but not limit the
visualisations of the differential results to those samples. So in the
above example, the differential heatmaps of the tumour samples wouldn’t
show the behaviour in the normals. Setting subset = TRUE
and supplying the restriction of the test as
influential_samples = condition == "tumour" would visualise
just the differential genes in the tumour subset (same number of
genes/rows), but show their expression in all samples (larger number of
samples/columns). Samples excluded by an
influential_samples predicate will not
play a role in any of the following:
- Contributing to the calculation of gene variance when selecting variable genes for exploratory heatmaps and cluster centroids
- Calculating the gene:gene distances for the heatmaps
- Calculating the direction of the principal components
- Any part of the DESeq2 processing
The conditions on which we subset must be defined in the
experiment table. It is possible to transform some of the predictors
(columns of colData) using the other main component of a
sample_set, the transform:
Transform
Each specification can have a single transform which is
a dplyr::mutate clause that will operate on the sample
metadata (but with no need for the initial data object):
specification(
transform=mutate(
treatement=relevel(factor(treatment), "control"),
patient_id=factor(patient_id),
number_of_visits=table(patient_id)[as.character(patient_id)],
.include=number_of_visits>1
),...Notice that we change/create various covariates within a single
mutate - having multiple transforms is a
mistake as only one will be applied. As mentioned above,
subset only operates on covariates prior to the transform
being applied - if we want to utilise transformed
covariates in the subset, the special covariate .include
should be generated in the transform, and the colData will
be post-processed to only include samples for which that variable is
TRUE.
Also, each sample_set can have it’s own
transform. In this case, the individual
sample_set’s version will take complete precedence: the
global transform will not be applied at all to
sample_sets with their own transform, and it
will simply be used in sample_sets which don’t have their
own transform.
Extra Assays
In some situations it is necessary to add extra assays that represent
additional pre-processing steps - for instance in strongly paired data
(for example where each biological sample is split into, say,
cytoplasmic and nuclear subsamples) it might be preferable to look at
one of the pair normalised against it’s pair-mate. This is purely for
visualisation purposes: for differential analysis it is always worth
sticking with the primary assay (counts). This can be achieved by a
setting extra_assays in a dataset as a list of
extra_assay calls:
subset = ...,
extra_assays = list(
to_nuclear=extra_assay(
method="normalise", from="vst",
compartment="nuclear",
design=~compartment:treatment:replicate,
recentre=FALSE, rescale=FALSE)
),
visualise = compartment != "nuclear",
...
Currently there are only two methods implemented:
normalise and identity. normalise
takes:
- a
designargument that identifies the groups within which to normalise - an assay
fromto use as the base assay - options to
recentreandrescalethe data (these default toTRUEfor our visualisations) - other arguments set the baseline value against which everything in a
group (defined by the
design) will be normalised to
This clearly leaves, in this case, the nuclear samples
normalised to themselves (so identically zero across all genes). To
avoid this disrupting the visual picture, we have a third
dataset-specific predicate visualise (alongside
subset and influential_samples) that will
remove these samples from plots (this is optional, but if you dislike
seeing a primarily white set of columns in your heatmaps, this is the
way to eradicate them.)
The method="identity" allows us to create assays
identical to the from assay, but allow us to switch off the
centring and/or scaling.
We’ll see how to use the extra assays in the plotting section. But a reminder: these extra assays are ignored in the differential testing, as the inferences and estimations are always done on the primary assay. You’ll need to model the pairing explicitly.
The extra assays accumulate, so you can have a later
to_T0 that could be generated
from="to_nulcear", for instance.
Models
DESeq2 uses R’s standard one-sided formulas to inform the analysis as
to which predictors to use in fitting a negative-binomial model of
expression counts. The colData slot should contain all the
potential predictors, and the design slot of a DESeq2
object encodes a considered choice of which are relevant to the
biological question. Obviously, if the question is to find which genes
are differentially expressed between two treatments, then a predictor
annotating which treatment a sample is subject to is necessary. But
quite often there will be other covariates need to be accounted for,
such as batch; or perhaps there’s a subtler question of finding genes
that respond differently to treatment depending on the genotype of the
sample.
We may want to consider alternative models, for instance whether or not to include a batch effect: if the batch effect is negligible we won’t want to use up degrees of freedom (ie power) estimating it, but if it’s substantial then we want to make our estimates more precise by accounting for it. So we allow, in any given sample_set, for there to be multiple models:
specification(
sample_sets =list(
all = sample_set(
subset = TRUE,
models = list(
accurate = model(
design = ~treatment + batch
),
powerful = model(
design = ~treatment,
name = "Just Treatment",
description = "Account for treatment, ignore batch - if there is a batch effect it'll increase the noise, though"
)
)
)
)
)So once again the plural ‘models’ is a list of individual
model objects, which for the moment we’ve just specified by
their design, exactly as we would for a DESeq
object. Once we have a model, we are then in a position to test the
significance of contrasts (e.g. individual log fold-changes between
conditions) and terms in the model (e.g whether batch effect is
necessary or not):
Universal models
Sometimes all/some models should apply to all
sample_sets in a spec file. To save typing, it’s possible
to elevate the models to the top-level within a
specification, so for example:
specification(
models = list(
powerful = model(
design = ~treatment
)
),
sample_sets =list(
all = sample_set(
subset = TRUE,
models = list(
accurate = model(
design = ~treatment + batch
),
)
),
batch1 = sample_set(
subset = batch=="B1"
)
)
)
)will result in the all dataset having two models (the
universal one from the top of the spec file, and a specific one),
whereas the batch dataset will have only the universal
one.
There are other similar patterns for other parts of the spec file, that certain properties ‘cascade’ down to other parts of the hierarchy - see here for a list of behaviours.
Comparisons
DESeq2 has multiple ways of specifying a contrast. All of these are supported, and again as ‘comparisons’ is plural, we can have a list of them, and have the pipeline loop through all of them for the parent model:
specification(
sample_sets =list(
all = sample_set(
subset = TRUE,
models = list(
accurate = model(
design = ~treatment + batch,
comparisons = list(
comp1 = "treat_vs_ctrl",
comp2 = c("treatment", "treat", "ctrl"),
comp3 = list(c("treat_vs_ctrl"), c("vehicle_vs_ctrl"), listValues=c(1,-1))
comp4 = c(1,0,-0.5,-0.5),
comp5 = ~batch,
comp6 = comparison(c("treatment", "treat", "ctrl"), name="Treat-control", description="Treatment vs Control")
)
)
)
)
)
)So all of the traditional ways of specifying a contrast are represented, respectively:
by a single character, traditionally done in DESeq2 with
results(..., name="treat_vs_ctrl")A character triplet
A two-component list of characters indicating the numerator and denominator (with possible
listValuesset)A numeric vector.
A formula specifying the
reducedterm for carrying out Anova via a likelihood ratio test.
You are referred to the DESeq2 reference manual for further details.
They can also be wrapped in a comparison call, to add extra
information about the comparison, that might get used in the report. But
one advantage of our wrapper is that it allows a simple enumeration of
complex comparisons using machinery from the emmeans
package:
Automatic comparison enumeration
We often find ourselves copying a contrast multiple times, with slight changes to elicit comparisons between different conditions. For example if our samples have been subject to three different treatment regimes, and a control, so treatment={control, vehicle, standard, novel_treatment}, and we wanted to compare everything to control, then we’d need to hand-write three comparisons. Instead we can now write
specification(
sample_sets =list(
all = sample_set(
subset = TRUE,
models = list(
accurate = model(
design = ~treatment + batch,
comparisons = list(
mult_comp(trt.vs.ctrl ~ treatment, ref="control")
)
)
)
)
)
)where the trt.vs.ctrl is a ‘keyword’ instructing as to
which comparisons to carry out - in this case every value of
treatment against the ones labelled “control”. There are a
number of other helpful keywords:
revpairwiseis the most ‘verbose’, in that it will generate every pair of comparisons, in this case there are six combinations, and it can go up rapidly. Due to a slightly annoying convention,pairwisechoses an unintuitive ordering that would result in comparisons such as ‘control vs novel_treatment’ (and a positive fold change would mean higher expression in the control than the novel treatment), andrevpairwisehas the more conventional ordering of ‘novel_treatment vs control’ and it’s more intuitive positive log fold-change indicating higher expression in the treated to the control. This assumes the levels of the factor are given in the conventional order of the previous paragraph.consectakes adjacent pairs along the levels, so again quite a natural ‘vehicle vs control’, ‘standard vs vehicle’ and finally ‘novel_treatment vs standard’trt.vs.ctrlchooses one baseline (which we can select with an additionalrefargument, rather than having torelevelthe data) and compares everything else against it.mean_chglooks to see if splitting the levels at any particlar point reveals a change between levels before vs after that breakpoint, so amongst other things would test if the average of the vehicle and control was different from the average of the standard and novel treatment.efftests whether each individual level is different from the average of the remaining levels.
These latter two still estimate the standard error in the
individual experimental groups, so the significance is still assessed
relative to replicate variability. This is different from ‘relabelling’
the samples (via a transform) so that e.g. there was one
group called “both controls” and another called “both treatments”, where
the standard error would include not only the ‘true’ replicate
variability but also the discrepancy between the former experimental
groups that have now been pooled.
When we don’t have an interaction term in our model, this is everything we’d ever need, as for example the difference between ‘novel_treatment’ and ‘control’ is by construction (of the model formula) independent of which batch we’re in.
The keywords are inherited from the emmeans package, so that is a good place for extra background material on their meaning.
One thing to note is that emmeans has a different way of
handling continuous covariates, so to specifiy a test of difference of
slopes, it’s necessary to request that functionality by providing a
var argument to the mult_comp call, e.g.:
...
#design= ~ Infection * time + Participant_in_infection:infection
mult_comp(revpairwise ~ Infection, var="time")is the correct formulation within the mult_comp
framework to estimate the per-participant rate of change of expression
over time, and then see if that differs between pairs of
groups of patients that are distinguished by their
Infection.
Interaction terms
When our model includes an interaction between two main effects, say between treatment and genotype, then the magnitude of difference between two treatments can depend on the genotype. This leads to two types of follow-up question: stratifying the data by one factor, and then examining the effect of the other factor in each individual stratum, or; examining the interaction itself, so looking to see if the magnitude of effect is different in one stratum than another.
To make this more concrete for the genotype x treatment case, we could ask: which genes are responding to treatment in the KO; which genes are responding to treatment in the WT; which genes are differential between WT and KO in the untreated. These are all questions of the first type, and we can automatically generate their contrasts by:
specification(
sample_sets =list(
all = sample_set(
subset = TRUE,
models = list(
accurate = model(
design = ~treatment * genotype,
comparisons = list(
mult_comp(revpairwise ~ treatment | genotype)
mult_comp(revpairwise ~ genotype | treatment)
)
)
)
)
)
)so the | symbol is to be read as “stratified by” (or
“conditioned on”). The first mult_comp is stratifying the
genotype, and will generate the pairwise comparisons within treatment,
separately for each different genotype. The second is the dual question:
For each treatment, find the genes that are ‘markers of genotype’ within
that treatment group.
The second type of question, where we want to investigate a ‘difference of differences’, is achieved by the following grammar:
mult_comp(revpairwise ~ treatment + genotype, interaction=TRUE)The terms on the right hand side of the formula enumerate the
predictors we’re concerned in, and the left hand side specifies the
contrasts that are going to be considered (so any of the keywords from
the previous section) - there needs to be just one, applied to all
factors. If we omit the interaction=TRUE argument then it
will enumerate all possible combinations of treatment and genotype, and
then contrast every combination against every other combination -
probably not what is wanted (e.g. treated KO vs untreated WT); keeping
the interaction=TRUE term is more usual, and builds a
contrast that represents the differences of differences, e.g. treatment
effect in KO - treatment effect in WT (where treatment effect in KO is
treated KO - untreated KO.)
The above is probably quite intimidating, so please ask for help on this - I’ll try to write something a bit more friendly.
Caveat
With the ease of testing so many hypotheses comes a risk of fishing the data, resulting in spurious statistical associations. Strictly speaking, anything beyond one comparison on the data requires a more conservative approach than is achieved even with the traditional control for FDR to account for the multiplicity of genes.
The ideal situation is the tests are prespecified. If that is the case, then we should really run an anova to confirm that a change exists somewhere, and then do at most n-1 comparisons (where n is the number of levels in the factor of interest, so at most three comparisons for our four-treatment example) and choose transcripts where both the anova and contrast are significant. There’s a further technical constraint that the comparisons should be independent but that is often ignored. If there are more comparisons of interest, then the p-values won’t be controlled at the correct rate.
Often, the ideal situation is not achieved, and an exploratory approach is requested. Being strict one should immediately correct for the abundance of hypotheses in this situation. Pragmatically and traditionally we neither adjust in this situation nor consider possible adjusting in the pre-specificied case - this means that the p-values aren’t correctly calibrated, and one of the reasons why I discourage presenting them in any quantitative form as part of the results.
Also remember the “Table Two Fallacy” - any inference made about non-randomly-allocated groups is an observational finding, and doesn’t carry the same rigour as inferences where treatment was randomly allocated. Observations of associations can easily be spurious and non-causal.
Rank deficiency
One of the most common hurdles in complex designs is the rank-deficiency problem. The way the experiment has been described can result in the analysis struggling to identify an estimate for a particular experimental condition. This happens for one of three reasons:
The design has a fatal flaw. If all your treated samples are in one batch, and all your control samples in another batch, you cannot estimate the extent on how these individually influence expression.
Some conditions have no samples. For example, in our genotype x treatment experiment, we might not have a ‘vehicle’ treatment in one of our genotypes.
There is a nesting indicative of repeated measurements. Individuals might have submitted multiple samples (perhaps a time-course), and also be members of larger groups (their disease status, which is assumed constant throughout the time-course).
The latter two are remediable through some tweaks to the inputs of DESeq2, the first will need the advice of a statistician to see if anything is recoverable and at what compromise.
The ‘missing condition’ case is easily remedied by supplying the
option drop_unsupported_combinations=TRUE to any
model you want this to apply to. The ‘nesting’ case normally
requires a tweak to the way factors are coded, and we have made it easy
to achieve this through a transform option to any
sample_set. Both options are illustrated below. Note, it is
important that there be only one
transform=mutate component per sample_set (if you want to
introduce/transform multiple derived covariates, then separate them by
commas within the mutate expression:
specification(
sample_sets =list(
all = sample_set(
subset = TRUE,
transform = mutate(unit = recode_within(unit, genotype),
time = as.numeric(sub("^t", "", timepoint))),
models = list(
nested = model(
design = ~timepoint * genotype + person:genotype,
comparisons = list(
mult_comp(revpairwise ~ timepoint + genotype, interaction=TRUE)
),
drop_unsupported_combinations=TRUE
)
)
)
)
)The ‘missing condition’ case is achieved by just the one line; the
nesting is more complex - we firstly have to have a ‘unit’ column in our
colData that identifies which biological unit (person,
animal, plate) a sample belongs to. Then the transform line
calls a recode_ within function that tells us that unit is
nested within genotype here. And we alter the design in
accordance with the DESeq2 vignette on within- and between- group
comparisons.
The transform statement can be used to adjust any of the
colData columns in-line. It’s a question of personal
preference whether ‘mistakes’ in the original experiment table should be
fixed at source (which allows the scientist to review changes manually),
or via a transformation specified in the spec file (which keeps an
entirely faithful record of what is done).
Other functionality
Profile formulae
In the exploratory plots, we present PC and dendodrograms. Primarily these are of the normalised counts without any further processing, but there is a mechanism for removing the effect of known covariates prior to the calculation of variable genes (for heatmaps) and rotations (for PCs). This is useful in cases where the dominant transcriptional variation might be driven by “nuisance” covariates, such as batch or patient, and so the simply normalised plots focus on experimentally uninteresting features.
This is not the same as looking at an unadjusted PCA plot, noticing that the main clusters appear to relate to batch, and the then noticing that, say, timepoint rather than treatment is determining the subclusters within the large clusters, and from this saying that timepoint is more “important” than treatment. That is a fallacy, as the original dimension reduction is looking in the “wrong” direction. Recalculating the PCs having removed the nuisance effect is more valid.
By default, no additional plots are generated - this is a change from
previous qc_formulae versious, because we were starting to
see more complex designs where the default option (producing all colours
x covariate plots) was becoming unmanageably verbose. Now it is
necessary to add descriptions of the profiles to be plotted
explicitly:
all = sample_set(
subset = TRUE,
models = list(
nested = model(
design = ~treatment * time + batch,
profile_plots = list(
aes(x=time, colour=batch, shape=treatment) ~ treatment + time + timepoint * time + batch,
aes(x=time, colour=treatment) ~ treatment + time + timepoint * time,
aes(y=vs_t0, x=time, colour=treatment) ~ . - batch
)
)
)
)Everything on the LHS is used as an aesthetic mapping for any given plot, and the model on the right says which covariates’s effects are kept in the ‘normalised’ data: anything that is omitted is effectively normalised/regressed-out .
In the first element of the profile_plots list, the RHS
of the formula is algebraically identical to the design, so
no effects will be regressed out. But plots will be drawn so that time
is on the horizontal axis, expression will be on the vertical axes - and
samples will be coloured by batch, and shaped according to which
treatment group they were in.
The second element will produce a different set of plots, this time
we’ve regressed out (corrected-for) batch, so the PCA and heatmaps will
be different from those on raw vst data. Note, for the two-dimensional
PC plots, we need both the x and y aesthetics
to represent the projection, so the time aspect of this
data would not be represented in the visualisation: to get round this,
we automatically generate a plot where we use the colour
aesthetic to represent the overwritten x.
The third example illustrates the standard use of the .
placeholder in R formulae to represent the ‘existing’ model. Here, the
. represents the design, and the - is the
standard Wilkinson notation to remove an effect, so the third
example is a different way of phrasing the second instance: regress out
the effect of batch.
The third example also illustrates how to access
extra_assay as defined here -
simply setting the y aesthetic to map to the name of the
extra assay will replace the default assay (vst-transformed normalised
counts) in the plots.
Clustering
We use the bluster package to cluster samples and
features in our heatmaps, via the gene_clust and
sample_clust settings. The defaults are:
settings( ...
gene_clust = bluster::HclustParam(cut.params = list(k = 12)), ## When we need to chose clusters of genes, how many?
sample_clust = bluster::HclustParam(metric="pearson"),...
)
The k there determines the number of feature clusters
that a heatmap is broken into (which are then used in the profile-plots,
so we’ll be plotting e.g. 12 centroids). The metric can be any
dist metric, or cor method.
Sample swaps
If there have been a pair of samples swapped, this can be corrected in the spec file.
This will swap the metadata of sample with ID A
with that of sample ID B. Every column of
the metadata will be exchanged: this is generally what is required
(remember that ID is what is used to connect the metadata
to the correct quantified counts file):
ID Name Treatment Batch
A T1 Treated B1
B C1 Control B1
correctly becomes
ID Name Treatment Batch
A C1 Control B1
B T1 Treated B1
Tellings us that the contents of nfcore file
A.genes.results are from a control sample from batch B1.
But if somehow the swap had not affected the Name, and the
situation that needed correcting was
ID Name Treatment Batch
A C1 Treated B1
B T1 Control B1
then you might want to consider fixing the anomaly manually (perhaps
via a mutate) as swapping around that table will result in
the somewhat confusingly “Named”
ID Name Treatment Batch
A T1 Control B1
B C1 Treated B1
(the correct association between A and “Control Batch 1” has been made, but the name is now “off” - unimportant for the analysis, but counter-intuitive).
Technical Replicates
Sometimes a scientist has done purely technical replicates. Ideally these should be concatenated at the nfcore level, but sometimes they get kept separate. We can correct for this in the spec file, for example:
all = sample_set(
subset = TRUE,
collapse=~treatment + biological_replicate + timepoint + batch,
...The collapse formula should identify the distinct experimental
groups, so in the above formula we might expect that there are multiple
samples (rows in the experiment_table) with the same values for each of
those four variables; so there may be three rows with
treatment=control, biological_replicate=3,
timepoint=7hr and batch=1 corresponding to
three technical replicates. Ensure you have enough terms in your model
to distinguish all experimental conditions and true replicates.
Keeping technical replicates in (without a collapse) is
almost always bad, as the standard errors are badly miscalculated.
Normalisation strategies
We can fine-tune which genes are analysed and which genes are used to calculate size factors:
all = sample_set(
subset = TRUE,
controlGenes=expression(grepl("^ercc", ID)),
subsetGenes=expression(symbol!=""),
...controlGenes and subsetGenes both take an R
expression that should evaluate to a boolean of the featureData of the
dataset (traditionally, symbol and entrez get
set, and ID is my convention for accessing the row.names
(probably the ensembl ID). The size factors are calculated on all rows
that pass the controlGenes filter, and then the subset filter is applied
to all the original rows (so one can calculate on the spike-ins, and
then drop them out.)
For SummarizedExperiment objects (e.g. proteomics or
metabolomics analysis, but not yet DESeq2 RNASeq analyses), we
have the capability to stratify the normalisation process: adding a
strata = ~ batch to a sample_set definition
will separately calculate the imputation and vst transformations for
each stratum defined by the formula.
External genelists
If there are lists of a priori interesting genes, the analysis can be informed of these via an additional csv file, the location of which is supplied with a line in the spec-file:
external_list="extdata/lists.csv"This can either be an element of the settings, or the
specification or any individual sample_set,
and is expected to be of the form:
| ID | Comparison | Cluster |
|---|---|---|
| S1 | Pos Control | C1 |
| S2 | Pos Control | C2 |
| S3 | Pos Control | C3 |
| S4 | Public | C4 |
| S5 | Public | C4 |
ID needs to match up with the row.names of
your assay, Comparison will determine which genes are
grouped together in a plot (given the same colour in scatter plots;
given their own plot in a heatmap or profile-plot), and the
Cluster is only used in the profile plots: genes with the
same Cluster within a Comparison will be averaged, to provide a single
trace - if Cluster is omitted, each symbol will be plotted as its own
singleton cluster.
Imputation
For limma-like analyses, (see this vignette on running just the ‘differential’
phase of the analysis ), it is possible to define different
imputation strategies:
sample_set(...,
normalise = "vsn",
impute = list(method="MinProb", q=0.01),
...
)We will better document this shortly.
Settings
Other than the overall modelling strategy, it’s important to be able
to tune the analysis by choosing the parameters behind the algorithms.
It’s possible to set any parameters that the core analysis script has
made available to you through the settings top level
option:
specification(
sample_sets =list(
...
),
settings=settings( ## analysis parameters
alpha = 0.05, ## p-value cutoff
lfcThreshold = 0, ## abs lfc threshold
baseMeanMin = 0, ## discard transcripts with average normalised counts lower than this
top_n_variable = 500, ## For PCA
showCategory = 25, ## For enrichment analyses
seed = 1, ## random seed gets set at start of script, just in case.
filterFun = IHW::ihw ## NULL for standard DESeq2 results, otherwise functions,
gene_clust = bluster::HclustParam(cut.params = list(k=12)) )
)Cascading properties of a spec file
To save repeating oneself, there are various aspects of a spec file
that will propagate down the
specificiation > sample_sets > models > comparisons
hierarchy. We list the expected behaviour of those:
-
modelscan be set at the topspecificationlevel, and will get pre-pended to any sample-set specific models. -
subsetandinfluential_samplescan be set at the topspecificationlevel, and will get boolean-ANDed with sample-set specific subsets. -
transformcan be set at the topspecificationlevel, and will only be used when sample-set doesn’t specify its own transform -
drop_unsupported_combinationscan be set at thespecificationand/orsample_setlevel, and will only be used when a model doesn’t specify this directly. -
drop_incompletecan be set at thespecificationand/orsample_setlevel, and will only be used when a model doesn’t specify this directly. -
profile_plotscan be set at thespecificationand/orsample_setlevel, and will be added to those of any model-specific plot profiles -
extra_assays can be set at thespecificationlevel, and will be added to any sample_set-specific ones.
These are intended to ‘do what I mean’ for the most typical situations.
Less verbose specification
Sometime the specfile can seem somewhat long-winded:
specification(
sample_sets =list(
S1=sample_set(
models=list(
M1=model (
comparisons=list(
c1 = comparison(
...We can cut this down in three ways:
Shortcuts for parent-list > child
There are a set of functions sample_sets,
models, comparisons which replace the whole
plurals=list(singular..., so the above can become:
Using R’s meta-programming capabilities, these expand to the same
syntax as the original above, there’s often no need to have the explicit
list.
Automatic wrapping of objects
A further level of abbreviation allows us to say that any singular
object (model, sample_set, comparison) that’s not in a corresponding
list (models, samples_sets, comparisons) - in either of the above
syntaxes - should get wrapped in the corresponding ‘plural list’
wrapper. For example, here we have a singular model that
isn’t explicitly container in a models list, but the
pipeline will now automically adjust for that, so the following is
valid:
We can have multiple singletons, and give it/them short IDs to help identification:
Automatic nesting
Sometime’s we’ll want, say, the same model fitted to each subset. If a model is defined at the top of the specification (ie alongside, rather than within, a sample_set), then it will get added to all the sample_sets it sits alongside.
specification(
sample_set(subset=TRUE),
sample_set(subset=!ID in c("outlier1","outlier2")),
model(design ~ treatment + batch)
)will result in two sample_sets, each fitted with the same design.
Similarly a comparison that is not within a model will be applied across the elements it is defined alongside:
specification(
sample_set(subset=TRUE),
sample_set(subset=!ID in c("outlier1","outlier2")),
model(design ~ treatment + batch),
model(design ~ treatment)
comparison(mult_comp(revpairwise ~ treatment)
)will result in two sample_sets, each fitted with two models, and the comparison between treatments being applied to these four settings separetly. This is the same as the cascading idea mentioned previously, but works particularly well in combination with the two other tricks.
Availability of these helper functions
The plural( shortcut notation for
plural=list( is available for the following objects:
sample_setsmodelscomparisonsmissingnessfeature_filterssettingsstrataprofile_plotsextra_assays
basically, anywhere one used to type plural=list(!
The automatic wrapping of singleton objects in their corresponding plurals is more limited to the primary specfile objects
sample_setmodelcomparisonextra_assayprofile_plot