Installation
It’s necessary to configure git so that the author of any analyses you create is correctly assigned; this is a one-off process if you haven’t done it before:
Installation into a project
TLDR:
ts clone hash=ab123
(you can do ts -n clone hash=ab123 to see what it’s
going to do). This does everything discussed in the rest of this
section, but directly into a directory that conforms to BABS’ usual
project structure based on the ticket ID. It only works if you have
BABS’ ts command ‘installed’ ( it’s just an alias
alias ts=make -f /nemo/stp/babs/working/time/makefile that
is probably obsolescent.)
The whole pipeline release is stored at
/nemo/stp/babs/working/bioinformatics/templates/rnaseq.tar.gz,
and on the github
release page. This archive is contains a babs folder
that contains the logic, but also includes a .github folder
and a .gitignore file so watch out if you want to preserve
your existing copies of those. The expectation is that the
babs folder from the archive will live alongside the
.babs file: the only the other thing the pipeline needs is
the docs folder that is in the project’s repo. You can
either git clone that (from nemo or from github), or you
can create a git worktree from the nemo copy which allows
other users to see you have a branch.
Some project repositories used to come with an old version of the
pipeline ‘installed’. I recommend you remove that if applicable, just
leaving the docs folder out of the all the
./babs/ contents.
Once the projects has a .babs file and a
babs/docs subfolder, one can simply
tar -xzf
/nemo/stp/babs/working/bioinformatics/templates/rnaseq.tar.gz
to get the pipeline.
Prerequisites in docs
The pipeline expects the following prerequisites in the
babs/docs folder. Most of them should have been prepared
during the design discussions. (There is also a way of running just the
statistical part of the pipeline described
here starting from the quantified raw counts/abundances.)
experiment_table.csv
The experiment table CSV file containing sample level experimental meta data.
- ⚠️ This file must be in CSV format
- ⚠️ Please check the first column, which must be named
ID. It is left blank at the design stage, but must be filled in with labels that correspond to the IDs used in the fastq files. - ⚠️ If there’s another column in the table that has unduplicated text values, that column is picked up as a convenient sample label - otherwise the ID column will be used as the label.
It is best practice to create a draft of this file using the information provided in proposal document, prior to the design meeting. The table can then be used to frame the experimental design discussion with the scientist, where required comparisons and potential confounding covariates can be checked.
This table will be cross referenced against the samples passing QC when submitted to the ASF to ensure the experimental aims can be achieved with the samples being sequenced.
In the analysis, this file will be the sole source of truth for the experimental metadata. Columns that only contain numeric values will be treated as such in the analysis, so if they are used in a design without first casting them to a character vector, it will be assumed that you are interested in genes that follow a strict linear gradient. Using a numeric “Replicate” column is particularly fraught - actually the intent underlying any use of a column to denote replicate should always be questioned.
*.spec
These denote the statistical analysis plan(s) that will be carried out. Usually there will be just one, but you can have more, which will create multiple results-websites. There is a whole article on spec files.
*.config
These define the extra parameters needed to specify the alignment process. Most of them are in direct correspondence to the nfcore parameters - we require something along he following lines:
genome_fasta=/path/to/fasta.fa
gtf=/path/to/genome.gtf
org.db=org.Hs.eg.db
nfcore=-profile crick -resume -r 3.10.1
It’s recommended to name the config file after the genome being used (e.g. GRCh38), as this label will be propagated through to subsequent files names. Again, you can have multiple config files (nfcore will be called as many times, and each spec file will be applied to each set of results). So for dual genomes, you can analyse them separately, or include an additional fasta option in a single config file. We use star_rsem as the default aligner - if you want to use Salmon, then an additional line:
aligner=star_salmon
in the config will suffice.
There are extra options if
you are joining the pipeline at the statistical stage (where we only
need the org.db line, the other three only being used by
the nfcore parts of the pipeline.)
Usage
Whole pipeline
TLDR: make all in the babs folder will run
the whole pipeline. make all sbatch_args= will do the same
but run the final differential stage over slurm.
cd into the ./babs folder, and there should
be a makefile there, along with folders for each ‘module’ of
functionality. docs is the source of all truth;
ingress will copy everything from docs (a
trivial operation, but if we ever change the ‘API’ this will allow this
reasonably easily). nfcore is the first non-trivial one -
quantify the counts. And differential is the main body.
Each module follows the same pattern: make run will perform
a pre-flight check to ensure it has all the prerequisites from upstream
modules, and then launch into generating the required outputs (reports,
intermediate files, processed data objects).
make all from the top level babs directory
is hard-wired to go to the endpoint module (currently the
differential module) and do make run from
there, so the two approaches are equivalent.
The option sbatch_args option signals that the
differential part of the analysis should be submitted to the slurm
scheduler. So
make all sbatch_args=
make all sbatch_args=--time=01-00
make all sbatch_args="--time=01-00 --job-name='Hello World'"
if you want to run the job under slurm. The sbatch_args=
can be empty (purely to flag to make that sbatch is
required, but with default parameters), or contain one or more sbatch command-line
options
Just the differential part
If you’re using the full bulk RNASeq pipeline, everything should be
set up for you. There will be a folder called differential.
To get the three main inputs, metadata needs to be transferred from the
docs folder (the original source of experiment table and
spec files that describe the analysis to be carried out), and data (the
aligned counts) need to be transferred from the nfcore
module. Achieving that should just be a matter of running
make run (or make run sbatch_args=) in the
differential folder. However:
If you’re taking counts etc from different sources, then it’s
possible to join the pipeline just at this differential
stage by manually ‘simulating’ the ‘gathering’ stage that
make run would do in an end-to-end context. Set an
environment variable that stores a tag (alphanumeric please) that
identifies how you quantified your data: I’m going to use the tag
GRCh38 as a toy value. Then you’ll need to achieve the
following to enable a manual analysis:
cd differential
export alignment=GRCh38 # change `mytag` to a meaningful but arbitrary tag
mkdir -p extdata/genes.results/${alignment}
# change `path-to` to the right location!
cp /path-to/*.genes.results extdata/genes.results/${alignment}/
# If there exists an 'nfcore' directory, rename it so that it doesn't get launched
[ ! -d ../nfcore ] || mv ../nfcore ../unused-nfcore
## If you have a ../docs file with `*.spec`, `experiment_table.csv` and
## `${alignment}.config` correctly set up, you can skip all remaining steps
## and do a `make run` now. Otherwise:
# First column is `ID` and its values match the extensionless basenames of *.genes.results
cp /path-to/experiment_table.csv extdata/metadata.csv
cp /path-to/experiment_table.csv extdata/metadata_${alignment}.csv
cp /path-to/*.spec .
# make sure XXX is the correct annotation package
echo "{alignment}.org.db=org.XXX.eg.db" > extdata/${alignment}.config
That should replace everything that would have been automatically
retrieved into the differential folder from outside if one
were running the whole pipeline.
Following on from that, the folder will locally contain everything
needed for future analyses to be entirely self-contained: changes to the
local copies of the spec files and the contents of extdata
are sufficient (you can copy changes back to their original
source modules for reference, but if the ‘originals’ remain older then
they won’t overwrite these local copies even if another build is carried
out.)
This will not work if there are sibling folders from the pipeline
present, as the pipeline may still consult those directories for the
prerequisites - if you experience problems, please move any of the
folders docs/ingress/nfcore so
that they’re not next to the differential folder.
Complete Isolation Starting from pre-processed
There is a separate vignette on how to get a subset of the pipeline that starts from a single DESeqDataSet object (or SummarizedExpression or CSV file) alongside a spec-file, and produces the reports from there.
Differential analysis
The vast majority of work is carried out by:
# cd babs/differential
make run
# or to auto-submit to via slurm
# make run sbatch_args="--time=04:00:00"
which will discover all spec files in the current
directory, and analyse the data according to the analysis plan.
The rough steps this goes through is:
- Set up a singularity environment that isolates a version of R and Quarto for reproducibility
- Set up the R packages that are necessary to run the analysis
- ‘Join’ the quarto scripts in the
resourcesfolder together with the spec files and count files, so that thestagingfolder contains scripts for all the analyses required. - Render the quarto ‘project’ that encompasses all these
stagingreports in the singularity container.
More often than not, the spec file from the initial design isn’t
correct/sufficient. Editing it in place is absolutely fine, and a repeat
make run will take a fresh attempt. There’s no harm in a
rm -rf staging to clear out old scripts. Similarly the
results directory can get populated with old files -
manually cleaning these out is fine from a pipeline-perspective. But if
you want to keep separate generations of results, it is worth performing
a git tag to bump the version of the analysis as this will
put results in a separate subfolder.
The quarto report will be an html website. The individual quarto
scripts get translated from their ‘raw template’ form in
resources, to their realised form in the
staging directory. These also produce objects in the
data folder for further bespoke analysis.
Output
R object
The bioinformatician’s output is contained in the
data/counts_alignmentID.rda object (where
alignmentID is the label described in the very first part of
this document). If you have R with the working directory at the top
level of our project, you will be able to access it with
R> data(counts_alignmentID)
That is the pure DESeq2 data object prior to processing. We also
store the analysed data in data/specID_x_alignmentID (where
specID identifies the name of the spec file used). This is a
triply-nested list, called dds_model_comp. The first
corresponds to the datasets, so for example
R> data(analysis_x_nfcore)
R> names(dds_model_comp)
[1] "all" "tumour" "normal"
corresponds to our situation where we analysed the tumour/normal cohorts both together and individually. The next level of the list corresponds to the models that were run on that dataset:
R> names(dds_model_comp$all)
[1] "accurate" "powerful"
might represent the two different approaches we used to account for a potential batch effect. And finally the third level corresponds to the individual comparisons:
R> names(default_dds$all$accurate)
[1] "comp1" "comp2" "comp3" "comp4"
R> class(default_dds$all$accurate$comp1
[1] "DESeqDataSet"
R> mcols(default_dds$all$accurate$comp1)$results
DataFrame with 19319 rows and 11 columns
baseMean log2FoldChange lfcSE stat pvalue padj weight symbol entrez class shrunkLFC
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <character> <character> <character> <numeric>
So we have the results stored in the mcols
of each DESeqDataSet that’s stored in the third level of the list. You
can use purrr:map_depth(default_dds, .depth=3) as one way
of extracting these, or loops, or …