Tailseeker 3.1

(CAUTION) The current version tailseeker 3 fails to process many of HiSeq- and MiSeq-derived data due to a problem in the normalization parameter estimation. It is highly recommended to validate whether the length measurements were done correctly using the QC outputs in qcplots/*.pdf. In the case with apparent errors, it is advised to try an older version instead, which is relatively insensitive than tailseeker 3. The issue originates from the patterned imbalance of fluorescence signals among the channels. An additional locally adaptive normalization step before the processing would be required to fix this. There's currently no progression on the implementation from the authors as they don't have a project using this recently. Please consider opening a pull request once you write one.

Tailseeker is the official pipeline for TAIL-seq, which measures poly(A) tail lengths and 3′-end modifications with Illumina SBS sequencers.

Analysis levels
Running with Docker
Non-Docker installations
- Installing tailseeker
  - Installing from a full binary bundle
  - Installing from a source package
    - Prerequisite software
    - Installation
- Running the pipeline
  - Generating genome reference databases
  - Running the pipeline
Pre-built genome resource packages
Data outputs
- Read name format (analysis level 1 only)
Software licenses

Analysis levels

Users can choose the extent of analysis by Tailseeker to let Tailseeker do almost everything, or just minimal tail length measurement. The options and the list of supported genomes are as followed:

Level	Genomes	Features
1	Any	Poly(A) length measurement (≥ 5nt) Non-A additions to poly(A) tails PCR duplicate removal Quality check for poly(A) length measurement
2	BDGP6 (D. melanogaster) JGIxl91 (Xenopus laevis)	All features from level 1 Poly(A) length refinement based on genome sequence Non-templated 3′-end tails Alignments to genome (BAM)
3	GRCh38 (Homo sapiens) GRCm38 (Mus musculus) GRCz10 (Danio rerio) WBcel235 (C. elegans) Rnor_6.0 (Rattus norvegicus)	All features from level 2 Gene-level statistics for poly(A) length and non-templated additions Gene-level quantifications

Running with Conda

Conda is the most convenient way to install tailseeker. The current tailseeker version depends on old versions of several programs. You can install them without a significant effort in an isolated environment. Try this command:

conda create -n tailseeker -c conda-forge -c bioconda -c qbio tailseeker

As soon as the installation finishes you can use it like this:

conda activate tailseeker

tseek

You need a reference annotation database for level 2 or 3 analyses. Downloaded databases can be took up by tailseeker if you specify the path in TAILSEEKER_REFDIR.

Running with Docker

If you have a host running Docker, you can run the tailseeker pipeline without installing any. For Apple macOS or Microsoft Windows users, this is the only easy way to run tailseeker without extensive effort. The current image is not ready for running it on multi-node HPC clusters. For those environments, you're encouraged to install the software in conventional way as described later.

Download the image and a wrapper script:

docker pull hyeshik/tailseeker:latest

curl -L http://bit.ly/tseek-docker > tseek
chmod 755 tseek

Prepare a project configuration on this page. Fill /data in “Data dir.” instead of the original paths.

Set the environment variables up:

# Point the directory holding the raw data from an Illumina sequencer
export TAILSEEKER_DATADIR=/storage/150922_M01178_0123_00000000-ACB72

# Create an empty directory for new temporary and output files
mkdir myproject    # replace myproject with your favorite name
cd myproject
cat > tailseeker.yaml
# and paste the content generated from the settings web page.
# Press Ctrl-D.

Run the pipeline:

../tseek -j

Then, the results will be located in the current directory.

When you run an analysis with references to the genome (level 2 and 3), you need to extend the Docker image to supplement a genome reference database. Build the Docker image from an empty directory like this:

curl -L http://bit.ly/Dockerfile-withref > Dockerfile
docker build -t tailseeker:GRCz10 --build-arg genome=GRCz10 .

Then, you'll need to define an environment variable before running tseek to use your own Docker image.

export TAILSEEKER_IMAGE=tailseeker:GRCz10

Non-Docker installations

Installing tailseeker

You can install tailseeker from either a source distribution or a binary package. The binary package includes many of pre-compiled external programs that were built on a x64 Linux box with Ubuntu 16.04. For the other environments, it is recommended to use the source package to install it.

Installing from a full binary bundle

Download a tarball from the download section. Extract the files into an appropriate place inside your filesystem.

wget {the download URL}
tar -xzf tailseeker-3.x.x-bundle-ubuntu_xenial.tar.gz
cd tailseeker-3.x.x-bundle-ubuntu_xenial

Install Python modules that are used in tailseeker using pip.

pip3 install --user --upgrade --requirement install/requirements.txt

Add the bin/ subdirectory of the tailseeker top directory to your PATH. To continue using tailseeker later, you will need to add this to a shell startup script such as .bashrc or .zshrc according to your login shell.

export PATH="{PATH_TO}/tailseeker-3.x.x-bundle-ubuntu_xenial/bin:$PATH"

Now, you can invoke the tailseeker pipeline with tseek command from anywhere. Proceed to generate the genome reference database.

Installing from a source package

Prerequisite software

Here're the list of software that must be installed before using tailseeker.

Essential dependencies
- Python 3.3 or higher
- pkg-config
- bash
- wget
- make and a C compilation toolchain
- whiptail
- htslib – htslib depends on zlib 1.2.4 or later. If you are using an old system released before 2010, you may need to upgrade zlib first.
- Python packages that can be easily installed using pip (see below)
  - Snakemake - 3.5 or higher
  - colormath
  - matplotlib
  - NumPy
  - SciPy
  - pandas
  - PyYAML
Required only for optional gene-level statistics
- STAR
- samtools
- bedtools
- seqtk
- GNU parallel
- feather
- Python lzma module
- XlsxWriter
Optional for more sensitive analysis
- All Your Bases - requires my patch to work with the recent Illumina sequencers.
- GSNAP

The toolchains and generic command line utilities can be installed if you're an administrator on a Debian or Ubuntu system:

sudo apt install whiptail pkg-config gcc wget make

You can install the Python modules in the list with this command from the top source directory:

pip3 install --user -r install/requirements.txt

Installation

A script in the top directory will check the paths of prerequisite tools and guide you to set configurations correctly. Please run:

./setup.sh

Proceed to generate the genome reference database.

Running the pipeline

Generating genome reference databases

First of all, build reference databases unless you're going to run tailseeker in genome-independent mode, or the level 1 analysis.

cd {tailseeker home}/refdb/level2 && snakemake -j -- {genome}
cd {tailseeker home}/refdb/level3 && snakemake -j -- {genome}

Type the identifier of the genome to be used in place of {genome}. List of the available genomes are shown in the first section of this tutorial.

Running the pipeline

Copy the full output hierarchy from MiSeq or HiSeq to somewhere in your machine.
Create an empty work directory. This is used for storing the final result files and the intermediate files which you may want to look into when something went wrong.
Prepare a settings file on this page. Paste the content into a new file tailseeker.yaml inside the work directory.

Run the pipeline with one of these commands:

# In case you have an access to a job queuing system of a cluster. Change 150 to the
# maximum number of jobs that you can put into the queue at a time.
tseek -c qsub -j 150

# In case you have a single multi-core machine,
tseek -j

All Snakemake options can be used in tseek, too.

Take a look at the qcplots/ on the work directory. The plots there show how poly(A) length calling was accurate.
Perform the downstream analyses using the output files.

Pre-built genome resource packages

Instead of building a resource database by yourself, you can download one of the pre-built packages that are updated from time to time. Here're are the pointers for those files.

Date	Species	Genome
Dec 5, 2016	Danio rerio	GRCz10
Dec 5, 2016	Drosophila melanogater	BDGP6
Dec 5, 2016	Caenorhabditis elegans	WBcel235
Dec 15, 2016	Mus musculus	GRCm38
Dec 15, 2016	Homo sapiens	GRCh38
Dec 16, 2016	Xenopus laevis	JGIxl91

Data outputs

Read name format (analysis level 1 only)

In an analysis level 1 output, FASTQ files are fulfilled with nucleotide sequences, quality scores as well as poly(A) tail information in read name. For the higher analysis levels, read names only include minimal identifiers. Use refined-taginfo/*.txt.gz for tailing status of each read this case.

FASTQ files are located in fastq/ in the level 1 analysis. It will contain _R5.fastq.gz and _R3.fastq.gz files for each sample. _R5 includes the sequences from the 5′-end of the RNA fragments, which is generally sequenced by read 1. _R3 is from the other end.

Each sequence entry has the identifier names in the following structure:

```
(1)   (2)      (3) (4) (5)(6)
a1101:00003863:0012:17:10:TT

(1) Tile number with an internal lane identifier.
(2) Serial number of the sequence, which is unique in the tile.
(3) Flags in hexadecimal representing data processing procedure of the read.
(4) Length of poly(A) tail.
(5) Length of additions modifications to poly(A).
(6) Post-poly(A) nucleotide additions.
```

Flags on the third field are encoded by combinations of the following bits:

Bit (decimal)	Bit (hexadecimal)	Description
1	0x0001	A poly(A) tail is detected
2	0x0002	Delimiter sequence is matched with one or more mismatch
4	0x0004	Have a post-poly(A) modification
8	0x0008	Poly(A) length is measured using fluorescence signal
16	0x0010	Index sequence is matched to a sample with one or more mismatches
32	0x0020	Delimiter sequence is found at a shifted position
64	0x0040	One or more cycle in 3′-read are dark (no fluorescence signal)
128	0x0080	Delimiter sequence is not found
256	0x0100	Basecalling quality of balancer region is bad
512	0x0200	Nucleotide composition of balancer region is biased
1024	0x0400	Fluorescence signal in balancer region is irregular or too dark
2048	0x0800	Number of dark cycle in read 2 exceeds the threshold
4096	0x1000	(level 2) 5′-read and 3′-read are aligned to two very distant positions in the genome
8192	0x2000	(level 2) 3′-read is aligned to a position adjacent to an expected polyadenylation site

Software licenses

The tailseeker suite

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

`strstr` implementation from the FreeBSD libc (`src/contrib/my_strstr.c`)

This code is derived from software contributed to Berkeley by Chris Torek.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of the University nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

SIMD Smith-Waterman alignment library (`src/contrib/ssw.c` and `src/contrib/ssw.h`)

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

INIH configuration file parser (`src/contrib/ini.c` and `src/contrib/ini.h`)

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of Ben Hoyt nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY BEN HOYT ''AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL BEN HOYT BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

hyeshik / tailseeker

Programming Languages

Labels

Projects that are alternatives of or similar to tailseeker

Tailseeker 3.1

Table of Contents

Analysis levels

Running with Conda

Running with Docker

Non-Docker installations

Installing tailseeker

Installing from a full binary bundle

Installing from a source package

Prerequisite software

Installation

Running the pipeline

Generating genome reference databases

Running the pipeline

Pre-built genome resource packages

Data outputs

Read name format (analysis level 1 only)

Software licenses

The tailseeker suite

`strstr` implementation from the FreeBSD libc (`src/contrib/my_strstr.c`)

SIMD Smith-Waterman alignment library (`src/contrib/ssw.c` and `src/contrib/ssw.h`)

INIH configuration file parser (`src/contrib/ini.c` and `src/contrib/ini.h`)

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

hyeshik / tailseeker

Programming Languages

Labels

Projects that are alternatives of or similar to tailseeker

Tailseeker 3.1

Table of Contents

Analysis levels

Running with Conda

Running with Docker

Non-Docker installations

Installing tailseeker

Installing from a full binary bundle

Installing from a source package

Prerequisite software

Installation

Running the pipeline

Generating genome reference databases

Running the pipeline

Pre-built genome resource packages

Data outputs

Read name format (analysis level 1 only)

Software licenses

The tailseeker suite

strstr implementation from the FreeBSD libc (src/contrib/my_strstr.c)

SIMD Smith-Waterman alignment library (src/contrib/ssw.c and src/contrib/ssw.h)

INIH configuration file parser (src/contrib/ini.c and src/contrib/ini.h)

`strstr` implementation from the FreeBSD libc (`src/contrib/my_strstr.c`)

SIMD Smith-Waterman alignment library (`src/contrib/ssw.c` and `src/contrib/ssw.h`)

INIH configuration file parser (`src/contrib/ini.c` and `src/contrib/ini.h`)