Starting from preprocessed data

Occasionally we may want to run just the differential part of the analysis for a pre-processed dataset that already exists as an R object. Here, we describe the easiest way to this, assuming we have a spec-file and either an rds or rda R object that contains the raw counts and metadata as a DESeq2::DESeqDataSet object.

The objective is to create a folder containing the scripts and resources usually provided by the differential folder of the whole pipeline. Navigate to the directory in which you want to carry out the analysis: probably initially empty to avoid over-writing any previous preexisting content. Then, in that directory, download just the differential part of the pipeline:

tar -xzf /nemo/stp/babs/working/bioinformatics/templates/rnaseq.tar.gz --strip=2 babs/differential
make check-isolated

The make command there will progressively step you through the prerequisites for an isolated analysis. If make check-isolated completes without any errors, then you are ready to try make run, otherwise carefully examine and act on what it reports - if it finds something to be missing, it will generate an example for you, but it won’t fit the specifics of your design. Once you have customised it to fit your requirements, run make check-isolated again to see what the next missing piece might be.

The non-negotiable is a .config file:

Config file

The first run of that command will create a config called extdata/example.config if you haven’t already got a config file. The exmple contains empty versions of all possible entries:

# Name of bioconductor annotation package
example.org.db=org.Hs.eg.db

# Name of rda or rds file in extdata that will supply a preprocessed object.
# Will default to the name of the config file - so delete lines if not used
example.rda=example
example.rds=example

# Name of CSV file in extdata that will supply assay data. Delete line if not used.
# You can set example.missing to this value _instead_, and a binary 'missingness'
# analysis will be carried out instead.
example.csv=example
# Name of a column in a CSV that contains the gene symbols.
# Defaults to first column if you delete this line.
example.symbolColumn=Symbol

# What word to use in reports to refer to a feature (protein, peptide, metabolite, gene...)
# Defaults to 'gene'
example.rowNoun=gene

# Comma-separated list of names of spec files that will _not_ be combined with this config settings.
example.nospec=

Base your actual config file on that - it is important to prefix each line with the name of the config file: if you rename the above to my.config but still have example.org.db=org.Hs.eg.db that won’t be recognised. If you have a working config file in another location, copy that to the extdata folder and remove the example before continuing.

The above already gives some guidance on ways to get the data into the pipeline: a path to an rda file, an rds file or a csv file - plus the fallback of the output of a Nextflow run if the extdata doesn’t find any of the others. Whichever, it needs to be directly named after the config file, so I will always use e.g. extdata/example.csv here, but change that to your own preference.

Assay Data

make check-isolated will not create example assay data for you, but will report the lack of compatible data.

RDS/RDA

For these binary files, it expects either a fully-fledged DESeqDataSet object with populated colData etc. If it is a SummarizedExperiment object, the pipeline will run in limma-mode.

CSV

This will also require a extdata/metadata.csv file to be present, that conforms to the usual requirements describing the experimental conditions. If metadata.csv and example.csv are both present, they will be combined into a SummarizedExperiment object, and analysed in limma-mode. example.csv must have columns named exactly the same as the entries in the first column of metadata.csv - they will act as the assay data. Additional columns in the example.csv will be imported as rowData.

Spec file(s)

Then into extdata/ (which gets created by the script), you need to place a *.spec file. `make check-isolated’ will put an example spec file in if you don’t have one.

Running the pipeline

Once you’ve done that (got x.config, x.{rda,rds,csv} and y.spec all in ./extdata) a make run should do everything required. It’s all a lot easier than comes across writing it down long-hand. You can also do

make run sbatch_args=
make run sbatch_args=--time=01-00
make run sbatch_args="--time=01-00 --job-name='Hello World'"

if you want to run the job under slurm. The sbatch_args= can be empty (purely to flag to make that sbatch is required, but with default parameters), or contain one or more sbatch command-line options

Limma-mode

If you provide a SummarizedExperiment object (or a CSV file), the analysis will be carried out using limma - using Gaussian assumptions rather than the Gamma-Poisson DESeq2 approach. The spec-file format remains identical.