Running the RNASeq bulk pipeline

Installation

It’s necessary to configure git so that the author of any analyses you create is correctly assigned; this is a one-off process if you haven’t done it before:

git config --global user.name "First Last"
git config --global user.email "first.last@crick.ac.uk"

Installation into a project

TLDR:

ts clone hash=ab123

(you can do ts -n clone hash=ab123 to see what it’s going to do). This does everything discussed in the rest of this section, but directly into a directory that conforms to BABS’ usual project structure based on the ticket ID. It only works if you have BABS’ ts command ‘installed’ ( it’s just an alias alias ts=make -f /nemo/stp/babs/working/time/makefile that is probably obsolescent.)

The whole pipeline release is stored at /nemo/stp/babs/working/bioinformatics/templates/rnaseq.tar.gz, and on the github release page. This archive is contains a babs folder that contains the logic, but also includes a .github folder and a .gitignore file so watch out if you want to preserve your existing copies of those. The expectation is that the babs folder from the archive will live alongside the .babs file: the only the other thing the pipeline needs is the docs folder that is in the project’s repo. You can either git clone that (from nemo or from github), or you can create a git worktree from the nemo copy which allows other users to see you have a branch.

Some project repositories used to come with an old version of the pipeline ‘installed’. I recommend you remove that if applicable, just leaving the docs folder out of the all the ./babs/ contents.

Once the projects has a .babs file and a babs/docs subfolder, one can simply

 tar -xzf
 /nemo/stp/babs/working/bioinformatics/templates/rnaseq.tar.gz

to get the pipeline.

Prerequisites in `docs`

The pipeline expects the following prerequisites in the babs/docs folder. Most of them should have been prepared during the design discussions. (There is also a way of running just the statistical part of the pipeline described here starting from the quantified raw counts/abundances.)

experiment_table.csv

The experiment table CSV file containing sample level experimental meta data.

⚠️ This file must be in CSV format
⚠️ Please check the first column, which must be named ID. It is left blank at the design stage, but must be filled in with labels that correspond to the IDs used in the fastq files.
⚠️ If there’s another column in the table that has unduplicated text values, that column is picked up as a convenient sample label - otherwise the ID column will be used as the label.

It is best practice to create a draft of this file using the information provided in proposal document, prior to the design meeting. The table can then be used to frame the experimental design discussion with the scientist, where required comparisons and potential confounding covariates can be checked.

This table will be cross referenced against the samples passing QC when submitted to the ASF to ensure the experimental aims can be achieved with the samples being sequenced.

In the analysis, this file will be the sole source of truth for the experimental metadata. Columns that only contain numeric values will be treated as such in the analysis, so if they are used in a design without first casting them to a character vector, it will be assumed that you are interested in genes that follow a strict linear gradient. Using a numeric “Replicate” column is particularly fraught - actually the intent underlying any use of a column to denote replicate should always be questioned.

*.spec

These denote the statistical analysis plan(s) that will be carried out. Usually there will be just one, but you can have more, which will create multiple results-websites. There is a whole article on spec files.

*.config

These define the extra parameters needed to specify the alignment process. Most of them are in direct correspondence to the nfcore parameters - we require something along he following lines:

genome_fasta=/path/to/fasta.fa
gtf=/path/to/genome.gtf
org.db=org.Hs.eg.db
nfcore=-profile crick -resume -r 3.10.1

It’s recommended to name the config file after the genome being used (e.g. GRCh38), as this label will be propagated through to subsequent files names. Again, you can have multiple config files (nfcore will be called as many times, and each spec file will be applied to each set of results). So for dual genomes, you can analyse them separately, or include an additional fasta option in a single config file. We use star_rsem as the default aligner - if you want to use Salmon, then an additional line:

aligner=star_salmon

in the config will suffice.

There are extra options if you are joining the pipeline at the statistical stage (where we only need the org.db line, the other three only being used by the nfcore parts of the pipeline.)

samplesheet.csv

This file contains the locations of the fastq files and is generated at the point the pipeline is run - if fastq files are in an non-traditional place, it will be necessary to generate this file yourself, in accordance with standard nfcore requirements.

Usage

Whole pipeline

TLDR: make all in the babs folder will run the whole pipeline. make all sbatch_args= will do the same but run the final differential stage over slurm.

cd into the ./babs folder, and there should be a makefile there, along with folders for each ‘module’ of functionality. docs is the source of all truth; ingress will copy everything from docs (a trivial operation, but if we ever change the ‘API’ this will allow this reasonably easily). nfcore is the first non-trivial one - quantify the counts. And differential is the main body. Each module follows the same pattern: make run will perform a pre-flight check to ensure it has all the prerequisites from upstream modules, and then launch into generating the required outputs (reports, intermediate files, processed data objects).

make all from the top level babs directory is hard-wired to go to the endpoint module (currently the differential module) and do make run from there, so the two approaches are equivalent.

The option sbatch_args option signals that the differential part of the analysis should be submitted to the slurm scheduler. So

make all sbatch_args=
make all sbatch_args=--time=01-00
make all sbatch_args="--time=01-00 --job-name='Hello World'"

if you want to run the job under slurm. The sbatch_args= can be empty (purely to flag to make that sbatch is required, but with default parameters), or contain one or more sbatch command-line options

Just the differential part

If you’re using the full bulk RNASeq pipeline, everything should be set up for you. There will be a folder called differential. To get the three main inputs, metadata needs to be transferred from the docs folder (the original source of experiment table and spec files that describe the analysis to be carried out), and data (the aligned counts) need to be transferred from the nfcore module. Achieving that should just be a matter of running make run (or make run sbatch_args=) in the differential folder. However:

If you’re taking counts etc from different sources, then it’s possible to join the pipeline just at this differential stage by manually ‘simulating’ the ‘gathering’ stage that make run would do in an end-to-end context. Set an environment variable that stores a tag (alphanumeric please) that identifies how you quantified your data: I’m going to use the tag GRCh38 as a toy value. Then you’ll need to achieve the following to enable a manual analysis:

cd differential

export alignment=GRCh38 # change `mytag` to a meaningful but arbitrary tag
mkdir -p extdata/genes.results/${alignment}

# change `path-to` to the right location!
cp /path-to/*.genes.results extdata/genes.results/${alignment}/

# If there exists an 'nfcore' directory, rename it so that it doesn't get launched
[ ! -d ../nfcore ] || mv ../nfcore ../unused-nfcore

## If you have a ../docs file with `*.spec`, `experiment_table.csv` and 
## `${alignment}.config` correctly set up, you can skip all remaining steps
## and do a `make run` now.  Otherwise:


# First column is `ID` and its values match the extensionless basenames of *.genes.results
cp /path-to/experiment_table.csv extdata/metadata.csv
cp /path-to/experiment_table.csv extdata/metadata_${alignment}.csv
cp /path-to/*.spec .

# make sure XXX is the correct annotation package
echo "{alignment}.org.db=org.XXX.eg.db" > extdata/${alignment}.config

That should replace everything that would have been automatically retrieved into the differential folder from outside if one were running the whole pipeline.

Following on from that, the folder will locally contain everything needed for future analyses to be entirely self-contained: changes to the local copies of the spec files and the contents of extdata are sufficient (you can copy changes back to their original source modules for reference, but if the ‘originals’ remain older then they won’t overwrite these local copies even if another build is carried out.)

This will not work if there are sibling folders from the pipeline present, as the pipeline may still consult those directories for the prerequisites - if you experience problems, please move any of the folders docs/ingress/nfcore so that they’re not next to the differential folder.

Complete Isolation Starting from pre-processed

There is a separate vignette on how to get a subset of the pipeline that starts from a single DESeqDataSet object (or SummarizedExpression or CSV file) alongside a spec-file, and produces the reports from there.

Differential analysis

The vast majority of work is carried out by:

# cd babs/differential
make run
# or to auto-submit to via slurm
# make run sbatch_args="--time=04:00:00"

which will discover all spec files in the current directory, and analyse the data according to the analysis plan.

The rough steps this goes through is:

Set up a singularity environment that isolates a version of R and Quarto for reproducibility
Set up the R packages that are necessary to run the analysis
‘Join’ the quarto scripts in the resources folder together with the spec files and count files, so that the staging folder contains scripts for all the analyses required.
Render the quarto ‘project’ that encompasses all these staging reports in the singularity container.

More often than not, the spec file from the initial design isn’t correct/sufficient. Editing it in place is absolutely fine, and a repeat make run will take a fresh attempt. There’s no harm in a rm -rf staging to clear out old scripts. Similarly the results directory can get populated with old files - manually cleaning these out is fine from a pipeline-perspective. But if you want to keep separate generations of results, it is worth performing a git tag to bump the version of the analysis as this will put results in a separate subfolder.

The quarto report will be an html website. The individual quarto scripts get translated from their ‘raw template’ form in resources, to their realised form in the staging directory. These also produce objects in the data folder for further bespoke analysis.

Output

R object

The bioinformatician’s output is contained in the data/counts_alignmentID.rda object (where alignmentID is the label described in the very first part of this document). If you have R with the working directory at the top level of our project, you will be able to access it with

R> data(counts_alignmentID)

That is the pure DESeq2 data object prior to processing. We also store the analysed data in data/specID_x_alignmentID (where specID identifies the name of the spec file used). This is a triply-nested list, called dds_model_comp. The first corresponds to the datasets, so for example

R> data(analysis_x_nfcore)
R> names(dds_model_comp)
[1] "all" "tumour" "normal"

corresponds to our situation where we analysed the tumour/normal cohorts both together and individually. The next level of the list corresponds to the models that were run on that dataset:

R> names(dds_model_comp$all)
[1] "accurate" "powerful"

might represent the two different approaches we used to account for a potential batch effect. And finally the third level corresponds to the individual comparisons:

R> names(default_dds$all$accurate)
[1] "comp1" "comp2" "comp3" "comp4"
R> class(default_dds$all$accurate$comp1
[1] "DESeqDataSet"
R> mcols(default_dds$all$accurate$comp1)$results
DataFrame with 19319 rows and 11 columns
                 baseMean log2FoldChange     lfcSE      stat      pvalue        padj    weight      symbol      entrez       class  shrunkLFC
                <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric> <numeric> <character> <character> <character>  <numeric>

So we have the results stored in the mcols of each DESeqDataSet that’s stored in the third level of the list. You can use purrr:map_depth(default_dds, .depth=3) as one way of extracting these, or loops, or …

HTML Report

We also generate a standard report of the results, most of the sections being fairly self-explanatory. We may produce some more documentation on this at some point.

Gavin Kelly