Enrichment analysis GRCh38 analyse

analyse

GRCh38

Enrichment

Enrichment and over-representation analysis of GRCh38-aligned data according to plan ‘analyse’

Authors

Affiliations

Gavin Kelly ^[Analyst]

Francis Crick Institute

BABS ^{[Development]}

Francis Crick Institute

Gavin Kelly ^{[Developer,Statistical Metholodogy]}

Francis Crick Institute

1 Preface

We load in all the necessary additional R packages, and set up some initial parameters.

Setting	Value
section	Enrichment analysis
res_dir	results/v9.9.9
VERSION	v9.9.9
TAG	_v9.9.9-00eb166
staging_dir	staging
file_col	ID
name_col	sample_name
metadata	extdata/metadata_GRCh38.csv
counts	extdata/genes.results/GRCh38/
alignment	GRCh38
spec	analyse
specname	analyse
script	staging/03_enrichment_analyse_GRCh38.qmd

Setting analysis parameter

A random seed of 1 is used to ensure reproduciblity
Using analysis plan ‘analyse’.
Using alignment settings ‘GRCh38’

2 Introduction to Functional Analysis

In contrast to the differential analysis, functional analyses (which attempt to associate biological meaning to those differential genelists) make a lot of subtle assumptions. We provide a ‘default’ set of analyses, to help ‘make sense’ of the lists we’ve so far provided, but please use them as a starting point, and don’t read too much into the presence/absence of pathways in this report.

Functional analysis has three components: a comparison of expression between experimental states; a database of genesets; and an algorithm that determines the association between those prior two components. Let us explore these other two components a bit more:

2.1 Expression comparisons

The differential report this page is derived from returned a number of comparisons so those are familiar already. We will analyse each of them separately in what follows, but combine them into natural groups for presentation. One subtlety is that in those reports we may have two classes of test:

a contrast, which estimates a single effect size such as fold-change, and tests whether it is compatible with a null value
a “goodness of fit” model comparison which doesn’t have a single representative “effect size”, as it is testing a null of whether e.g. expression is flat over multiple time points, or many treatments (against an alternative that there exists some difference amongst the many possible pairs of contrasts).

This difference will have affect what algorithm we use. We have traditionally attempted to fake an effect size for the latter (using the largest pairwise difference as representing the size), but this is somewhat contentious so check with us on interpreting this.

2.2 Databases of annotated genes

We traditionally look at which Reactome and/or GO molecular functions. It is possible for us to restrict/generalise these to suit the specific question (even if we don’t actually update the text that gets automatically included in every report) so if there are other sources of gene-sets that you might want to examine in the context of your expression results, do let us know. But essentially, any collection of gene-sets (ie gene IDs that we can map back to transcripts found in your sequencing) can be tested for association.

2.3 Enrichment vs Over-representation algorithms

There are two broad approaches. The results of each expression contrast (e.g. WT vs KO) gives

a significance flag indicating whether the gene is called as differentially expressed controlling for false discovery rate, and
an effect size for each gene (though see the ‘goodness of fit’ comment above);

2.3.1 Over-representation

An over-representation analysis takes the first of these as the ‘truth’ - a very binary picture that can be very powerful if one believes the significance calls, that has the advantage of being easily and compactly interpretable. You have two subsets of the whole transcriptome: your differential gene list and one particular pathway’s genes, and over-representation tests whether they have more in common than random subsets of the transcriptome. For pathways where this is the case, we report back two useful measures:

Relative Effect - how many times larger is the overlap than would have been observed by chance - this is exactly the statistic we are testing to say whether a pathway:genelist combination are over-representative. The ‘null’ over-representation would be 1, when the number of statistically differential genes in a pathway is exactly as expected from the sizes of the pathway and differential genelist.
Proportion of Differential - To give some ‘scientific significance’ context, we also report the fraction of the pathway that is actually differential. Theoretically it’s possible for a pathway with only one differential gene to be flagged as significantly associated, so to guide sensible selection of candidates to explore, we recommend you also look at this statistic.

2.3.2 Enrichment analysis (aka GSEA)

If we have an estimate of effect-size (change in expression) for each gene, we might be interested in this value of this statistic for genes in our pathway regardless of whether they passed our criterion for statistical significance: if they’re all showing the same degree of change, then we may assert that the pathway is of interest.

We present the standard statistics for GSEA. The enrichment score tracks how much the distribution of effect-sizes in the pathway differs from the distribution of effect-sizes of genes not in the pathway, and the NES attempts to make this comparable amongst differently sized pathways. But the interpretation is subtle, so we recommend looking at plots of the ‘Running Enrichment Score’ for individual cases to make sure you’re interpreting the summary statistics correctly (these will take up a lot of space in what is already a large report, so are best delivered as and when needed.)

This is a somewhat complementary view to over-representation. If they tell very different pictures, then it’s likely that examining what one means by “pathway regulation” deeply will be worthwhile. The landscape of “enrichment phenotypes” is a very broad one (are you looking for pathways whose genes are all up-regulated, or ones whose changes are broadly similar magnitudes but opposite directions…)

2.4 GO Enrichment

Table 1

GO Enrichment - D1 M1

Table 2

GO Enrichment - D1 M2

Table 3

GO Enrichment - D1 M3

2.5 GO Over-representation

Table 4

GO Over-representation - D1 M1

Table 5

GO Over-representation - D1 M2

Table 6

GO Over-representation - D1 M3

2.6 Reactome Enrichment

Table 7

Reactome Enrichment - D1 M1

Table 8

Reactome Enrichment - D1 M2

2.7 Reactome Over-representation

Table 9

Reactome Over-representation - D1 M1

Table 10

Reactome Over-representation - D1 M2

Table 11

Reactome Over-representation - D1 M3

3 Terms Of Use

The Crick has a publication policy and we expect to be included on publications, regardless of funding arrangements. Any use of these results in publication must be discussed with BABS regarding authorship. If not authorship then the BABS analyst must receive a named acknowledgement. Please also cite the following sources which have enabled the analysis to be carried out.