All Projects → BigDataBiology → SemiBin

BigDataBiology / SemiBin

Licence: other
No description or website provided.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to SemiBin

micca
micca - MICrobial Community Analysis
Stars: ✭ 19 (-24%)
Mutual labels:  bioinformatics, metagenomics
matam
Mapping-Assisted Targeted-Assembly for Metagenomics
Stars: ✭ 18 (-28%)
Mutual labels:  bioinformatics, metagenomics
Binning refiner
Improving genome bins through the combination of different binning programs
Stars: ✭ 26 (+4%)
Mutual labels:  bioinformatics, metagenomics
virnet
VirNet: A deep attention model for viral reads identification
Stars: ✭ 26 (+4%)
Mutual labels:  bioinformatics, metagenomics
catch
A package for designing compact and comprehensive capture probe sets.
Stars: ✭ 55 (+120%)
Mutual labels:  bioinformatics, metagenomics
AMIDD
Introduction to Applied Mathematics and Informatics in Drug Discovery (AMIDD)
Stars: ✭ 13 (-48%)
Mutual labels:  bioinformatics
peppy
Project metadata manager for PEPs in Python
Stars: ✭ 29 (+16%)
Mutual labels:  bioinformatics
polyRAD
Genotype Calling with Uncertainty from Sequencing Data in Polyploids 🍌🍓🥔🍠🥝
Stars: ✭ 16 (-36%)
Mutual labels:  bioinformatics
bistro
A library to build and execute typed scientific workflows
Stars: ✭ 43 (+72%)
Mutual labels:  bioinformatics
snpsea
📊 Identify cell types and pathways affected by genetic risk loci.
Stars: ✭ 26 (+4%)
Mutual labels:  bioinformatics
tiptoft
Predict plasmids from uncorrected long read data
Stars: ✭ 27 (+8%)
Mutual labels:  bioinformatics
picardmetrics
🚦 Run Picard on BAM files and collate 90 metrics into one file.
Stars: ✭ 38 (+52%)
Mutual labels:  bioinformatics
GraphBin
GraphBin: Refined binning of metagenomic contigs using assembly graphs
Stars: ✭ 35 (+40%)
Mutual labels:  metagenomics
Rcpi
Molecular informatics toolkit with a comprehensive integration of bioinformatics and cheminformatics tools for drug discovery.
Stars: ✭ 22 (-12%)
Mutual labels:  bioinformatics
microbiomeMarker
R package for microbiome biomarker discovery
Stars: ✭ 89 (+256%)
Mutual labels:  metagenomics
MSFragger
Ultrafast, comprehensive peptide identification for mass spectrometry–based proteomics
Stars: ✭ 43 (+72%)
Mutual labels:  bioinformatics
CellO
CellO: Gene expression-based hierarchical cell type classification using the Cell Ontology
Stars: ✭ 34 (+36%)
Mutual labels:  bioinformatics
echolocatoR
Automated statistical and functional fine-mapping pipeline with extensive API access to datasets.
Stars: ✭ 13 (-48%)
Mutual labels:  bioinformatics
SigProfilerExtractor
SigProfilerExtractor allows de novo extraction of mutational signatures from data generated in a matrix format. The tool identifies the number of operative mutational signatures, their activities in each sample, and the probability for each signature to cause a specific mutation type in a cancer sample. The tool makes use of SigProfilerMatrixGen…
Stars: ✭ 86 (+244%)
Mutual labels:  bioinformatics
argparse2tool
transparently build CWL and Galaxy XML tool definitions for any script that uses argparse
Stars: ✭ 24 (-4%)
Mutual labels:  bioinformatics

SemiBin: Semi-supervised Metagenomic Binning Using Siamese Neural Networks

Command tool for metagenomic binning with semi-supervised deep learning using information from reference genomes in Linux and MacOS.

BioConda Install Test Status Documentation Status License: MIT

CONTACT US: This tool is still in development. You are welcome to try it out and feedback is appreciated, but expect some bugs/rapid changes until it stabilizes. Please use GitHub issues for bug reports and the SemiBin users mailing-list for more open-ended discussions or questions.

If you use this software in a publication please cite:

SemiBin: Incorporating information from reference genomes with semi-supervised deep learning leads to better metagenomic assembled genomes (MAGs) Shaojun Pan, Chengkai Zhu, Xing-Ming Zhao, Luis Pedro Coelho bioRxiv 2021.08.16.456517; doi: https://doi.org/10.1101/2021.08.16.456517

Basic usage of SemiBin

A tutorial of running SemiBin from scrath can be found here SemiBin tutorial.

Installation:

conda create -n SemiBin
conda activate SemiBin
conda install -c conda-forge -c bioconda semibin

The inputs to the SemiBin are contigs (assembled from the reads) and bam files (reads mapping to the contigs). In the docs you can see how to generate the inputs starting with a metagenome.

Running with single-sample binning (for example: human gut samples):

SemiBin single_easy_bin -i contig.fa -b *.bam -o output --environment human_gut

Running with multi-sample binning:

SemiBin multi_easy_bin -i contig_whole.fa -b *.bam -o output -s :

The output includes the bins in the output_recluster_bins directory (including the bin.*.fa and recluster.*.fa).

Please find more options and details below and read the docs.

Advanced Installation

SemiBin runs on Python 3.7-3.9.

Bioconda

The simplest mode is shown above. However, if you want to use SemiBin with GPU (which is faster if you have one available), you need to install PyTorch with GPU support:

conda create -n SemiBin
conda activate SemiBin
conda install -c conda-forge -c bioconda semibin
conda install -c pytorch-lts pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch-lts

MacOS note: you can only install the CPU version of PyTorch in MacOS with conda and you need to install from source to take advantage of a GPU (see #72). For more information on how to install PyTorch, see their documentation.

Source

You will need the following dependencies:

The easiest way to install the dependencies is with conda:

conda install -c conda-forge -c bioconda mmseqs2=13.45111 # (for GTDB support)
conda install -c bioconda bedtools hmmer prodigal
conda install -c bioconda fraggenescan

Once the dependencies are installed, you can install SemiBin by running:

python setup.py install

Examples of binning

NOTE: The SemiBin API is a work-in-progress. The examples refer to version 0.6, but this may change in the near future (after the release of version 1.0, we expect to freeze the API for at least 5 years). We are very happy to hear any feedback on API design, though.

SemiBin runs on single-sample, co-assembly and multi-sample binning. Here we show the simple modes as an example. For the details and examples of every SemiBin subcommand, please read the docs.

Easy single/co-assembly binning mode

Single sample and co-assembly are handled the same way by SemiBin.

You will need the following inputs:

  1. A contig file (contig.fa in the example below)
  2. BAM file(s) from mapping short reads to the contigs (mapped_reads.bam in the example below)

The single_easy_bin command can be used to produce results in a single step.

For example:

SemiBin \
    single_easy_bin \
    --input-fasta contig.fa \
    --input-bam mapped_reads.bam \
    --environment human_gut \
    --output output

Alternatively, you can train a new model for that sample, by not passing in the --environment flag:

SemiBin \
    single_easy_bin \
    --input-fasta contig.fa \
    --input-bam mapped_reads.bam \
    --output output

The following environments are supported:

  • human_gut
  • dog_gut
  • ocean
  • soil
  • cat_gut
  • human_oral
  • mouse_gut
  • pig_gut
  • built_environment
  • wastewater
  • global

The global environment can be used if none of the others is appropriate. Note that training a new model can take a lot of time and disk space. Some patience will be required. If you have a lot of samples from the same environment, you can also train a new model from them and reuse it.

Easy multi-samples binning mode

The multi_easy_bin command can be used in multi-samples binning mode:

You will need the following inputs:

  1. A combined contig file
  2. BAM files from mapping

For every contig, format of the name is <sample_name>:<contig_name>, where : is the default separator (it can be changed with the --separator argument). NOTE: Make sure the sample names are unique and the separator does not introduce confusion when splitting. For example:

>S1:Contig_1
AGATAATAAAGATAATAATA
>S1:Contig_2
CGAATTTATCTCAAGAACAAGAAAA
>S1:Contig_3
AAAAAGAGAAAATTCAGAATTAGCCAATAAAATA
>S2:Contig_1
AATGATATAATACTTAATA
>S2:Contig_2
AAAATATTAAAGAAATAATGAAAGAAA
>S3:Contig_1
ATAAAGACGATAAAATAATAAAAGCCAAATCCGACAAAGAAAGAACGG
>S3:Contig_2
AATATTTTAGAGAAAGACATAAACAATAAGAAAAGTATT
>S3:Contig_3
CAAATACGAATGATTCTTTATTAGATTATCTTAATAAGAATATC

You can use this to get the combined contig:

SemiBin concatenate_fasta -i contig*.fa -o output

You can get the results with one line of code.

SemiBin multi_easy_bin -i concatenated.fa -b *.bam -o output

Output

The output folder will contain:

  1. Datasets used for training and clustering
  2. Saved semi-supervised deep learning model
  3. Output bins
  4. Some intermediate files

By default, reconstructed bins are in output_recluster_bins directory.

For more details about the output, read the docs.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].