Introduction
Whatever directory you are in, typing make at the
command line is equivalent to make help and will give a
reminder of the subcommands that are available to you.
The above phases are stages of the analysis. Most of the
interesting ones would be fixed for a specific pipeline (rnaseq,
chipseq…) and their corresponding directory should need no changes by an
analyst: the vast majority of the time an analysis will be entirely
determined by what is in docs.
docs should (start to) be populated during the proposal
stage, and contain reference material about what analysis is to be done,
on what data, and how. All the other phases flow from that. There is a
readme giving further details of the
required inputs.
ingress is where the files from docs are
coralled to, to be useful to downstream phases. For instance, an nfcore
samplesheet could be generated in here based on a query to LIMS and
contents of docs, or a public dataset could be downloaded
and prepared for analysis.
The downstream phases should be obvious: nfcore to run
the nfcore RNASeq pipeline, differential to build the
DESeq2 objects and a static report.
Recommended structure of individual phases
So far differential is the most
feature-rich phase, so I’ll focus on using that as an illustrative
example. The initial state of the phase contains: an empty skeleton of
an R project (split into an R folder which contains
function definitions, and a resources folder which contains
template markdown documents) and environment; a settings file
module.mk that contains e.g. what the R executable is; and
a makefile which serves two purposes.
The first time this makefile is activated with make run
in the differential folder for a project, the differential folder will
‘realise’ that it needs a spec-file from the docs folder (in the
terminology of make, it has a prerequisite on a file in the
docs folder), and so copy it. It also realises it needs a counts file
from the nfcore folder, so it will attempt to copy it. But the nfcore
folder will postpone that copying because the counts don’t exist yet! So
it will itself detect that it needs information from the ingress folder,
which in turn consults the docs folder…
On subsequent make runs, it should realise it has
everything it needs, but will scan back up the dependencies, so if a
spec file in the docs folder has changed (and is more recent than the
one in the differential folder) that will propogate into the
differential folder. An important aspect is that after everything had
run through successfully, it should be possible to share just, say, the
differential folder and for everything to work.
As long as things run through once, we are subsequently free to
either analyse with all the component directories in place (in case we
make a change that e.g. means nfcore needs to be re-run), or we can chop
off the final stage of the analysis as an entirely self-contained
directory (the only criterion being that the directory is not located
alongside other directories that share a name of one of the original
siblings, such as docs as that will confuse things!)
This illustrates the general flow: there’s a linear progression of phases, each of which will bootstrap itself so it gathers all its pre-requisites into its own scope, and then generate all its output in a controlled manner so that other phases downstream of it can gather them for their own needs.
nfcore follows the same flow, but is much
simpler. make run here will first gather the samplesheet
and alignment configuration files from docs (via
ingress) and make them nfcore-compliant. It will then
proceed to run the nfcore RNASeq pipeline based on the information it
has gathered.
Orchestration of phases
At the top ‘babs’ level of the hierarchy, there is another makefile.
When a make all is executed at that level, it will descend
into the ultimate (ie differential) folder and do a
make run. In this way, the whole pre-specified analysis
should be carried out. But if you only want the results of the nfcore,
then make nfcore in the babs folder will
suffice (so make all at this top level is directly
equivalent to make differential).
A *.err file at this top level will store the commentary
of what happened - and on successful completion of a phase it will get
turned into a corresponding phase.log file. Within each
phase, there is likely to be a logs subdirectory which
contains individual log files.
Also in this top-level are two makefiles that will be loaded into
every phase automatically. shared.mk will be under version
control, and includes generic recipes and variables that are likely to
be used by more than one phase. It also loads secret.mk
which contains site-specific settings that you don’t want to put in a
public repository as it may contain file-paths etc. Both of these files
get automatically duplicated into each phase that uses them, to maintain
the ability to cleave a folder away from the whole pipeline once it has
been run.