poreCov | SARS-CoV-2 Workflow for nanopore sequencing data

Citation:

poreCov - an easy to use, fast, and robust workflow for SARS-CoV-2 genome reconstruction via nanopore sequencing
Christian Brandt, Sebastian Krautwurst, Riccardo Spott, Mara Lohde, Mateusz Jundzill, Mike Marquet, Martin Hölzer
https://www.frontiersin.org/articles/10.3389/fgene.2021.711437/full

What is this Repo?

poreCov is a SARS-CoV-2 analysis workflow for nanopore data (via the ARTIC protocol) or SARS-CoV-2 genomes (fasta)
the workflow is pre-configured to simplify data analysis:

poreCov | SARS-CoV-2 Workflow for nanopore sequencing data
- What is this Repo?
Table of Contents
1. Quick Setup (Ubuntu)
2. Run poreCov
3. Quality Metrics (default)
4. Workflow
5. Literature / References to cite
6. Troubleshooting
- Singularity
7. Time to results
8. Credits

1. Quick Setup (Ubuntu)

1.1 Nextflow (the workflow manager)

poreCov needs Nextflow and java run time (default-jre)
- install java run time via: sudo apt install -y default-jre
- install Nextflow via: curl -s https://get.nextflow.io | bash && sudo mv nextflow /bin && sudo chmod 770 /bin/nextflow

1.2 Container (choose one - they manage all the tools)

Docker

installation here (recommended), alternatively via: sudo apt install -y docker
add Docker to the user: sudo usermod -a -G docker $USER

Singularity

Singularity installation here
if you can't use Docker

Note, that with Singularity the following environment variables are automatically passed to the container to ensure execution on HPCs: HTTPS_PROXY, HTTP_PROXY, http_proxy, https_proxy, FTP_PROXY and ftp_proxy.

Conda (not recommended)

Conda installation here
install Nextflow and Singularity via conda (not cluster compatible) - and use the singularity profile

1.3 Basecalling (optional)

only important if you want to do basecalling via GPU with the workflow:
- local guppy installation (see oxford nanopore installation guide)
- or: install nvidia Docker tool kit
- or: Singularity (with --nv support)

2. Run poreCov

2.1 Test run

validate your installation via test data:

# for a Docker installation
nextflow run replikation/poreCov -profile test_fastq,local,docker -r 1.1.0 --update

# or for Singularity or conda installation
nextflow run replikation/poreCov -profile test_fastq,local,singularity -r 1.1.0 --update

2.2 Quick run examples

poreCov with basecalling and Docker
- --update tryies to force the most recent pangolin lineage and nextclade release version (optional)
- -r 1.1.0 specifies the workflow release from here
- --primerV specifies the primer sets that were used, see --help to see what is supported
  - alternatively provide a primer bed file on your own

nextflow run replikation/poreCov --fast5 fast5/ -r 1.1.0 \
    --cores 6 -profile local,docker --update --primerV V4

poreCov with a basecalled fastq directory and custom primer bed file

nextflow run replikation/poreCov --fastq_pass 'fastq_pass/' -r 1.1.0 \
    --cores 32  -profile local,docker --update --primerV primers.bed

poreCov with basecalling and renaming of barcodes based on sample_names.csv

# rename barcodes automatically by providing an input file, also using another primer scheme
nextflow run replikation/poreCov --fast5 fast5_dir/ --samples sample_names.csv \
   --primerV V1200 --output results -profile local,docker --update

2.3 Extended Usage

see also nextflow run replikation/poreCov --help -r 1.1.0

Version control

poreCov supports version control via -r this way, you can run everything reproducible (e.g. -r 1.1.0)
- moreover only releases are extensively tested and validated
poreCov releases are listed here
add -r <version> to a poreCoV run to activate this
run nextflow pull replikation/poreCov to install updates
- if you have issues during update try rm -rf ~/.nextflow and then nextflow pull replikation/poreCov
- this removes old files and downloads everything new

Important input flags (choose one)

these are the flags to get "data" into the workflow
- --fast5 fast5_dir/ for fast5 directory input
- --fastq_pass fastq_dir/ directory with basecalled data (contains "barcode01" etc. directories)
- --fastq "sample*.fastq.gz" alternative fastq input (one sample per file)
- --fasta "*genomes.fasta" SARS-CoV-2 genomes as fasta (.gz allowed)

Custom primer bed files

poreCov supports the input of primer.bed files via --primerV instead of selecting a preexisting primer version like --primerV V4
- for an example see 2.2 Quick run examples
- feature available for poreCov version 1.1.0 or greater
the main issue with primer bed files is that they need to have the correct columns and text to be recognized via artic
the following rules apply to the bed file (see also example)
- each column is separated via one tab or \t
- column 1 is the fasta reference, and it should be MN908947.3 (poreCov replaces that automatically)
- column 2 is the primer start
- column 3 is the primer end
- column 4 is the primer name, and it has to end with _RIGHT or _LEFT
- column 5 is the pool and it should be named nCoV-2019_1 or nCoV-2019_2
- column 6 defines the strand orientation with either - or +

MN908947.3	30	54	nCoV-2019_1_LEFT	nCoV-2019_1	+
MN908947.3	1183	1205	nCoV-2019_1_RIGHT	nCoV-2019_1	-
MN908947.3	1100	1128	nCoV-2019_2_LEFT	nCoV-2019_2	+
MN908947.3	2244	2266	nCoV-2019_2_RIGHT	nCoV-2019_2	-
MN908947.3	2153	2179	nCoV-2019_3_LEFT	nCoV-2019_1	+
MN908947.3	3235	3257	nCoV-2019_3_RIGHT	nCoV-2019_1	-
MN908947.3	3144	3166	nCoV-2019_4_LEFT	nCoV-2019_2	+
MN908947.3	4240	4262	nCoV-2019_4_RIGHT	nCoV-2019_2	-

Sample sheet

barcodes can be automatically renamed via --samples sample_names.csv
required columns:
- _id = sample name
- Status = barcode number which should be renamed
optional column:
- Description = description column to be included in the output report and tables

Example comma separated file (don't replace the header):

_id,Status,Description
Sample_2021,barcode01,good
2ndSample,BC02,bad

Pangolin Lineage definitions

lineage determinations are quickly changing in response to the pandemic
to avoid using out of date lineage schemes, a --update flag can be added to each poreCov run to get the most recent version-controlled pangolin container
we are currently building two times every week version-controlled pangolin container automatically, see here
- it is also possible to use instead of --update this flag: --pangolindocker 'nanozoo/pangolin:3.1.1--2021-06-14'
- this way you can use other container or version, but beware some containers might not be compatible with poreCov

3. Quality Metrics (default)

Regions with coverage of 20 or less are masked ("N")
Genome quality is compared to NC_045512.2
- Genome quality assessment is based on RKIBioinformaticsPipelines/president
  - also prepares csv and fasta for upload via DESH portal
Pangolin lineages are determined
nextstrain clades are determined including mutation infos
reads are classified to human and SARS-CoV-2 to check for possible contamination and sample prep issues

4. Workflow

poreCov was coded with "easy to use" in mind, while staying flexible
therefore we provide a few input types which adjusts the workflow automatically (see image below)
- fast5 raw data, fastq files (one sample per file), fastq_pass (the basecalling output) or fasta (supports multifastas)
primer schemes for ARTIC can be V1, V2, V3(default), V4, V4.1 or V1200 (the 1200bp amplicon ones)

5. Literature / References to cite

If you are using poreCov please also check the used software to cite in your work:

6. Troubleshooting

Collection of some helpful infos

Singularity

Singularity needs additional option flags to run like --userns Solution on how to pass Singularity commands to poreCov

7. Time to results

Table 1: Execution speed of poreCov on different Ubuntu 20 Systems using a single sample file with 167,929 reads. Command used: nextflow run replikation/poreCov -r 0.9.4 -profile test_fastq,local,docker.

Hardware	First time with download (DB+container)¹	Default settings
2 CPUs 4 GB RAM	1h 2min	32 min 30s ²
2 CPUs 8 GB RAM	46 min	21m 20s
4 CPUs 16 GB RAM	40 min	12m 48s
8 CPUs 32 GB RAM	35 min	11m 39s
16 CPUs 64 GB RAM	30 min	9m 39s

¹ time depends mostly on available internet speed
² was not able to execute read classification due to limited hardware, but generated and classified SARS-CoV-2 genomes

Table 2: Execution speed of poreCov on different Ubuntu 20 Systems using 24 fastq samples. Command used: nextflow run replikation/poreCov -r 0.9.4 --fastq "*.fastq.gz" --primerV V1200 --samples samplenames.csv -profile local,docker. Time meassured by the start of the workflow.

Hardware	Default settings
2 CPUs 4 GB RAM	13h 33m ¹
2 CPUs 8 GB RAM	7h 56m ¹
4 CPUs 16 GB RAM	4h 10 min
8 CPUs 32 GB RAM	2h 15 min
16 CPUs 64 GB RAM	1h 25 min

¹ was not able to execute read classification due to limited hardware, but generated and classified SARS-CoV-2 genomes

8. Credits

The key steps of poreCov are carried out using the ARTIC Network field bioinformatics pipeline. Kudos to all amazing developers for your incredible efforts during this pandemic! Many thanks to all others who have helped out and contributed to poreCov as well.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

replikation / poreCov

Programming Languages

Labels

Projects that are alternatives of or similar to poreCov

poreCov | SARS-CoV-2 Workflow for nanopore sequencing data

What is this Repo?

Table of Contents

1. Quick Setup (Ubuntu)

1.1 Nextflow (the workflow manager)

1.2 Container (choose one - they manage all the tools)

Docker

Singularity

Conda (not recommended)

1.3 Basecalling (optional)

2. Run poreCov

2.1 Test run

2.2 Quick run examples

2.3 Extended Usage

Version control

Important input flags (choose one)

Custom primer bed files

Sample sheet

Pangolin Lineage definitions

3. Quality Metrics (default)

4. Workflow

5. Literature / References to cite

6. Troubleshooting

Singularity

7. Time to results

8. Credits