All Projects → danrlu → Nextflow_cheatsheet

danrlu / Nextflow_cheatsheet

Licence: MIT license
Tips for Nextflow and cheatsheet for channel operation

Projects that are alternatives of or similar to Nextflow cheatsheet

Oscp Prep
my oscp prep collection
Stars: ✭ 105 (+176.32%)
Mutual labels:  tips, cheatsheet
Android Tips Tricks
☑️ [Cheatsheet] Tips and tricks for Android Development
Stars: ✭ 4,496 (+11731.58%)
Mutual labels:  tips, cheatsheet
Vue Cheatsheet
Modified version of the official VueMastery cheatsheet
Stars: ✭ 188 (+394.74%)
Mutual labels:  tips, cheatsheet
Synology
Cheatsheet and bash scripts sripts for Synology Nas Stations cheet cheat sheet nas networkdisk
Stars: ✭ 125 (+228.95%)
Mutual labels:  tips, cheatsheet
iOS-Daily-Tips
Daily Tips From iOS World 🔥
Stars: ✭ 42 (+10.53%)
Mutual labels:  tips
Preguntas Y Respuestas Entrevistas Frontend
Un listado de preguntas y respuestas que hemos y nos han preguntado en entrevistas para Ingenieros y Desarrolladores de Front End (Para facilitar las entrevistas y el estudio)
Stars: ✭ 207 (+444.74%)
Mutual labels:  tips
Cs Univ Wiki
컴공생을 위한 대학 생활 가이드라인
Stars: ✭ 202 (+431.58%)
Mutual labels:  tips
Git Tips
这里是我的笔记,记录一些git常用和一些记不住的命令。
Stars: ✭ 185 (+386.84%)
Mutual labels:  tips
FSharpCheatSheet
Reference sheet of the basics of F# ^_^
Stars: ✭ 20 (-47.37%)
Mutual labels:  cheatsheet
hackthebox
Notes Taken for HTB Machines & InfoSec Community.
Stars: ✭ 286 (+652.63%)
Mutual labels:  cheatsheet
8086-cheatsheet
8086 Microprocessor Cheat sheet with Programs
Stars: ✭ 81 (+113.16%)
Mutual labels:  cheatsheet
Awesome Mac Things
some useful mac things:scripts, shells, doc, shortcut keys
Stars: ✭ 210 (+452.63%)
Mutual labels:  tips
quickref
Git quick reference application for android
Stars: ✭ 111 (+192.11%)
Mutual labels:  cheatsheet
Guides
Here you will find Guides mainly for Sonarr/Radarr/Bazarr and everything related to it.
Stars: ✭ 207 (+444.74%)
Mutual labels:  tips
TipJarViewController
Easy, drop-in tip jar for iOS apps.
Stars: ✭ 79 (+107.89%)
Mutual labels:  tips
ruuand.github.io
Wiki for stuff
Stars: ✭ 30 (-21.05%)
Mutual labels:  cheatsheet
mkBox
MacApp、PythonPackage、Scripts ..
Stars: ✭ 66 (+73.68%)
Mutual labels:  tips
myths-and-facts-about-programming
No description or website provided.
Stars: ✭ 31 (-18.42%)
Mutual labels:  tips
Awesome Android Tips
some code tips for android 💯
Stars: ✭ 2,490 (+6452.63%)
Mutual labels:  tips
docker-cheat-sheet
Important Docker Commands Ordered by Compute (Services), Network, Storage then by Docker CLI, Dockerfile, Compose, and Swarm
Stars: ✭ 21 (-44.74%)
Mutual labels:  cheatsheet

Tips and cheatsheet for Nextflow

These are notes for myself gathered through using Nextflow, and hopefully useful for others. Error reports and suggestions welcome!

Some resources

The working directory

Understanding working directory was the hardest learning piece for me, and it turned out to be key to understand where the files are and how to debug errors b/c often all files and logs you need are in the working directory.

  • Each execution of a process happens in its own temporary working directory.
  • Specify the location of the parent working directory with workDir = '/path_to_tmp/' in nextflow.config, or with -w option when running nextflow main.nf.
  • Each excecution of a process creates one folder in the working directory. This folder starts off with files only from the input channel (usually in form of symlinks, see below), so it's fairly isolated from the rest of the file system.
  • As the process runs, this folder will also contain all intermediate files, logs, and output files (unless specifically directed elsewhere), and only those specified in the output channels and publishDir will be moved or copied to the publishDir.
    • Anything you want to specify in publishDir needs to be in an output channel.
    • Note that with publishDir "path", mode: 'move', the output file will be moved away from the working directory and Nextflow will not be able to use it as input for another process, so only use this option when there is not a following process that uses the output file.
    • Be mindful that if the """ (script section) """ involves changing directory, such as cd or rmarkdown::render( knit_root_dir = "folder/" ), Nextflow will still only search the working directory for output files b/c the execution is in the working directory. tl;dr is this gets tricky, so try let Nextflow handle folder navigation as much as possible.
  • To find the location of the working direcotry: it is the folder named like /path_to_tmp/4d9c3b333734a5b63d66f0bc0cfcdc that Nextflow points you to when there is an error in execution. This folder usually already contains all files needed to reproduce the error, and Nextflow error message gives clear direction how reproduce the error. One can also find the folder path in the .nextflow.log or in the report.html.
  • Run nextflow clean -f in the excecution folder to clean up the working directories, which often gets large unnoticed.

Where am I?

Actual data is usually elsewhere from where the Nextflow scripts are, and be able to specify relative file path makes the code more portable. The options below are much more reiable than $PWD or $pwd.

  • In Nextflow scripts (.nf files), one can use
    • ${workflow.projectDir} to refer where the project locates (usually the folder of main.nf). For example: publishDir "${workflow.projectDir}/output", mode: 'copy' or Rscript ${workflow.projectDir}/bin/task.R.
    • ${workflow.launchDir} to refer to where the script is called from, aka the current folder in Terminal when running nextflow main.nf.
  • $baseDir usually refers to the same folder as ${workflow.projectDir} but it can also be used in the config file, where ${workflow.projectDir} and ${workflow.launchDir} are not accessible.

Print - debugger's best friend

The hardest error to debug (assuming one is familiar with bioinformatics tools) is often channels structure TnT

  • To print a channel, use .view(). It's especially useful to resolve WARN: Input tuple does not match input set cardinality declared by process. (Don't forget to remove .view() after debugging)
  channel_vcf
    .combine(channel_index)
    .combine(channel_chr)
    .view()
  • To print from the script section inside the processes, add echo true. This is very useful to check whether a channel has passed desired information in correct format to the process.
  process test {
    echo true    // this will print the stdout from the script section on Terminal
    input: path(file)
    """
    head $file
    """
  }

Channel.from and Channel.fromPath what's the difference?

As biologists, we turn every rock.

  • Channel.from( "A.txt" ) will put A.txt as is into the channel
  • Channel.fromPath( "A.txt" ) will add a full path (usually current directory) and put /path/A.txt into the channel.
  • Channel.fromPath( "folder/A.txt" ) will add a full path (usually current directory) and put /path/folder/A.txt into the channel.
  • Channel.fromPath( "/path/A.txt" ) will put /path/A.txt into the channel.
  • In other words, Channel.fromPath will only add a full path if there isn't already one and ensure there is always a full path in the resulting channel.
  • This goes hand in hand with input: path("A.txt") inside the process, where Nextflow actually creates a symlink named A.txt (note the path from first / to last / is stripped) linking to /path/A.txt in the working directory, so it can be accessed within the working directory by the script cat A.txt without specifying a path.

input: path("A.txt") in the process section

  • With input: path("A.txt") one can refer to the file in the script as A.txt. Side note A.txt doesn't have to be the same name as in channel creation, it can be anything, input: path("B.txt"), input: path("n") etc.
  • With input: path(A) one can refer to the file in the script as $A, and the value of $A will be the original file name (without path, see section above).
  • input: path("A.txt") and input: path "A.txt" generally both work. Occasionally had errors that required the following (tip from @danielecook):
    • If not in a tuple, use input: path "A.txt"
    • If in a tuple, use input: tuple path("A.txt"), path("B.txt")
    • This goes the same for output.
  • From pditommaso: path(A) is almost the same as file(A), however the first interprets a value of type string as the input file path (ie the location in the file system where it's stored), the latter interprets a value of type string and materialise it to a temporary files. It's recommended the use of path since it's less ambiguous and fits better in most use-cases.

DSL2

This is a little outdated. Is anyone still DSL1-ing??

  • Moving to DSL2 is a one-way street. It's so intuitive with clean and readable code.
  • In DSL1, each queue channel can only be used once.
  • In DSL2, a channel can be fed into multiple processes
  • In DSL2, each process can only be called once. The solution is either .concat() the input channels so they run as parallel processes, or put the process in a module and import multiple times from the module. (One may be able to call a process in different workflows, haven't tested yet).
  • DSL2 also enforces that all inputs needs to be combined into 1 channel before it goes into a process. See the cheatsheet for useful operators.
  • Simple steps to convert from original syntax to DSL2
  • Deprecated operators.

Run reports

Beautiful graphics especially useful for performance monitoring.

  • nextflow main.nf -with-report -with-timeline -with-dag
  • -with-report Nextflow html report contains resource usage for each process, and details (most useful being the status and working directory) for each process.
  • -with-timeline How much wait time and run time each process took for the run. Very useful reference for optimizing resource allocation and improving run time.
  • -with-dag Make a flowchart to show the relationship of channels and processes.
  • Software dependencies to use these features. Note the differences on Mac and Linux.
  • How to set them up in the nextflow.config so they are automatically generated for each run. Credit danielecook

Require users to sepcify a parameter value

  • There are 2 types of paramters: (a) one with no actual value (b) one with actual values.
  • (a) If a parameter is specified but no value is given, it is implicitly considered true. For example, one can use this to run debug mode nextflow main.nf --debug
    if (params.debug) {
        ... (set parameters for debug mode)
    } else {
        ... (set parameters for normal use)
    }
  • or to print help message nextflow main.nf --help
    if (params.help) {
        println """
        ... (help msg here)
        """
        exit 0
    }
  • (b) For parameters that need to contain a value, Nextflow recommends to set a default and let users to overwrite it as needed. However, if you want to require it to be specified by the user:
    params.reference = null   // no quotes. this line is optional, since without initialising the parameter it will default to null. 
    if (params.reference == null) error "Please specify a reference genome with --reference"
  • Below works as long as the user always append a value: --reference=something. It will not print the error message with: nextflow main.nf --reference (without specifying a value) because this will set params.reference to true (see point (a)) and !params.reference will be false.
    if (!params.reference) error "Please specify a reference genome with --reference"

Acknowledgement

  • danielecook for offering lots of help and advice.
  • The last function .collect{ it[1] } in the cheatsheet came from a post in Nextflow Gitter (now replaced by Nextflow Slack) by Juke34
  • pditommaso for suggesting edits.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].