Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → sacdallago → Bio_embeddings

sacdallago / Bio_embeddings

Licence: mit

Get protein embeddings from protein sequences

Programming Languages

python

139335 projects - #7 most used programming language

Labels

machine-learning pipeline language-model

Projects that are alternatives of or similar to Bio embeddings

pd3f

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

Stars: ✭ 132 (+53.49%)

Mutual labels: pipeline, language-model

Ngseasy

Dockerised Next Generation Sequencing Pipeline (QC, Align, Calling, Annotation)

Stars: ✭ 80 (-6.98%)

Mutual labels: pipeline

Gpt2

PyTorch Implementation of OpenAI GPT-2

Stars: ✭ 64 (-25.58%)

Mutual labels: language-model

Atacseq

ATAC-seq peak-calling, QC and differential analysis pipeline

Stars: ✭ 72 (-16.28%)

Mutual labels: pipeline

Stetl

Stetl, Streaming ETL, is a lightweight geospatial processing and ETL framework written in Python.

Stars: ✭ 64 (-25.58%)

Mutual labels: pipeline

Mlbox

MLBox is a powerful Automated Machine Learning python library.

Stars: ✭ 1,199 (+1294.19%)

Mutual labels: pipeline

Scrapy S3pipeline

Scrapy pipeline to store chunked items into Amazon S3 or Google Cloud Storage bucket.

Stars: ✭ 57 (-33.72%)

Mutual labels: pipeline

Terraform Aws Ecs Codepipeline

Terraform Module for CI/CD with AWS Code Pipeline and Code Build for ECS https://cloudposse.com/

Stars: ✭ 85 (-1.16%)

Mutual labels: pipeline

Setl

A simple Spark-powered ETL framework that just works 🍺

Stars: ✭ 79 (-8.14%)

Mutual labels: pipeline

Full stack transformer

Pytorch library for end-to-end transformer models training, inference and serving

Stars: ✭ 71 (-17.44%)

Mutual labels: language-model

Nezha chinese pytorch

NEZHA: Neural Contextualized Representation for Chinese Language Understanding

Stars: ✭ 65 (-24.42%)

Mutual labels: language-model

Irap

integrated RNA-seq Analysis Pipeline

Stars: ✭ 65 (-24.42%)

Mutual labels: pipeline

Gitlab Branch Source Plugin

Jenkins-Plugin to create a multi-branch-project from gitlab

Stars: ✭ 76 (-11.63%)

Mutual labels: pipeline

Indonesian Language Models

Indonesian Language Models and its Usage

Stars: ✭ 64 (-25.58%)

Mutual labels: language-model

Hookah

A cross-platform tool for data pipelines.

Stars: ✭ 83 (-3.49%)

Mutual labels: pipeline

Char rnn lm zh

language model in Chinese，基于Pytorch官方文档实现

Stars: ✭ 57 (-33.72%)

Mutual labels: language-model

Cross Domain ner

Cross-domain NER using cross-domain language modeling, code for ACL 2019 paper

Stars: ✭ 67 (-22.09%)

Mutual labels: language-model

Flowr

Robust and efficient workflows using a simple language agnostic approach

Stars: ✭ 73 (-15.12%)

Mutual labels: pipeline

Clusterflow

A pipelining tool to automate and standardise bioinformatics analyses on cluster environments.

Stars: ✭ 85 (-1.16%)

Mutual labels: pipeline

Greek Bert

A Greek edition of BERT pre-trained language model

Stars: ✭ 84 (-2.33%)

Mutual labels: language-model

View All Similar Projects ➔

Bio Embeddings

Resources to learn about bio_embeddings:

Quickly predict protein structure and function from sequence via embeddings: embed.protein.properties.
Read the current documentation: docs.bioembeddings.com.
Chat with us: chat.bioembeddings.com.
We presented the bio_embeddings pipeline as a talk at ISMB 2020 & LMRL 2020. You can find the talk on YouTube, and the poster on F1000.
Check out the examples of pipeline configurations a and notebooks.

Project aims:

Facilitate the use of language model based biological sequence representations for transfer-learning by providing a single, consistent interface and close-to-zero-friction
Reproducible workflows
Depth of representation (different models from different labs trained on different dataset for different purposes)
Extensive examples, handle complexity for users (e.g. CUDA OOM abstraction) and well documented warnings and error messages.

The project includes:

General purpose python embedders based on open models trained on biological sequence representations (SeqVec, ProtTrans, UniRep,...)
A pipeline which:
- embeds sequences into matrix-representations (per-amino-acid) or vector-representations (per-sequence) that can be used to train learning models or for analytical purposes
- projects per-sequence embedidngs into lower dimensional representations using UMAP or t-SNE (for lightwieght data handling and visualizations)
- visualizes low dimensional sets of per-sequence embeddings onto 2D and 3D interactive plots (with and without annotations)
- extracts annotations from per-sequence and per-amino-acid embeddings using supervised (when available) and unsupervised approaches (e.g. by network analysis)
A webserver that wraps the pipeline into a distributed API for scalable and consistent workfolws

Installation

You can install bio_embeddings via pip or use it via docker.

Pip

Install the pipeline like so:

pip install bio-embeddings[all]

To get the latest features, please install the pipeline like so:

pip install -U "bio-embeddings[all] @ git+https://github.com/sacdallago/bio_embeddings.git"

Docker

We provide a docker image at ghcr.io/bioembeddings/bio_embeddings. Simple usage example:

docker run --rm --gpus all \
    -v "$(pwd)/examples/docker":/mnt \
    -v bio_embeddings_weights_cache:/root/.cache/bio_embeddings \
    -u $(id -u ${USER}):$(id -g ${USER}) \
    ghcr.io/bioembeddings/bio_embeddings:v0.1.6 /mnt/config.yml

See the docker example in the examples folder for instructions. You can also use ghcr.io/bioembeddings/bio_embeddings:latest which is built from the latest commit.

Installation notes:

bio_embeddings was developed for unix machines with GPU capabilities and CUDA installed. If your setup diverges from this, you may encounter some inconsitencies (e.g. speed is significantly affected by the absence of a GPU and CUDA). For Windows users, we strongly recommend the use of Windows Subsystem for Linux.

What model is right for you?

Each models has its strengths and weaknesses (speed, specificity, memory footprint...). There isn't a "one-fits-all" and we encourage you to at least try two different models when attempting a new exploratory project.

The models prottrans_bert_bfd, prottrans_albert_bfd, seqvec and prottrans_xlnet_uniref100 were all trained with the goal of systematic predictions. From this pool, we believe the optimal model to be prottrans_bert_bfd, followed by seqvec, which has been established for longer and uses a different principle (LSTM vs Transformer).

Usage and examples

We highly recommend you to check out the examples folder for pipeline examples, and the notebooks folder for post-processing pipeline runs and general purpose use of the embedders.

After having installed the package, you can:

Use the pipeline like:
```
bio_embeddings config.yml
```
A blueprint of the configuration file, and an example setup can be found in the examples directory of this repository.

Use the general purpose embedder objects via python, e.g.:

from bio_embeddings.embed import SeqVecEmbedder

embedder = SeqVecEmbedder()

embedding = embedder.embed("SEQVENCE")

More examples can be found in the notebooks folder of this repository.

Cite

While we are working on a proper publication, if you are already using this tool, we would appreciate if you could cite the following poster:

Dallago C, Schütze K, Heinzinger M et al. bio_embeddings: python pipeline for fast visualization of protein features extracted by language models [version 1; not peer reviewed]. F1000Research 2020, 9(ISCB Comm J):876 (poster) (doi: 10.7490/f1000research.1118163.1)

Contributors

Christian Dallago (lead)
Konstantin Schütze
Tobias Olenyi
Michael Heinzinger

Development status

Pipeline stages

embed:
- [x] ProtTrans BERT trained on BFD (https://doi.org/10.1101/2020.07.12.199554)
- [x] SeqVec (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3220-8)
- [x] ProtTrans ALBERT trained on BFD (https://doi.org/10.1101/2020.07.12.199554)
- [x] ProtTrans XLNet trained on UniRef100 (https://doi.org/10.1101/2020.07.12.199554)
- [ ] Fastext
- [ ] Glove
- [ ] Word2Vec
- [x] UniRep (https://www.nature.com/articles/s41592-019-0598-1)
- [x] ESM/ESM1b (https://www.biorxiv.org/content/10.1101/622803v3)
- [x] PLUS (https://github.com/mswzeus/PLUS/)
- [x] CPCProt (https://www.biorxiv.org/content/10.1101/2020.09.04.283929v1.full.pdf)
project:
- [x] t-SNE
- [x] UMAP
visualize:
- [x] 2D/3D sequence embedding space
extract:
- supervised:
  - [x] SeqVec: DSSP3, DSSP8, disorder, subcellular location and membrane boundness as in https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3220-8
  - [x] Bert: DSSP3, DSSP8, disorder, subcellular location and membrane boundness as in https://doi.org/10.1101/2020.07.12.199554
- unsupervised:
  - [x] via sequence-level (reduced_embeddings), pairwise distance (euclidean like goPredSim, more options available, e.g. cosine)

Web server (unpublished)

[x] SeqVec supervised predictions
[x] Bert supervised predictions
[ ] SeqVec unsupervised predictions for GO: CC, BP,..
[ ] Bert unsupervised predictions for GO: CC, BP,..
[ ] SeqVec unsupervised predictions for SwissProt (just a link to the 1st-k-nn)
[ ] Bert unsupervised predictions for SwissProt (just a link to the 1st-k-nn)

General purpose embedders

[x] ProtTrans BERT trained on BFD (https://doi.org/10.1101/2020.07.12.199554)
[x] SeqVec (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3220-8)
[x] ProtTrans ALBERT trained on BFD (https://doi.org/10.1101/2020.07.12.199554)
[x] ProtTrans XLNet trained on UniRef100 (https://doi.org/10.1101/2020.07.12.199554)
[x] Fastext
[x] Glove
[x] Word2Vec
[x] UniRep (https://www.nature.com/articles/s41592-019-0598-1)
[x] ESM/ESM1b (https://www.biorxiv.org/content/10.1101/622803v3)
[x] PLUS (https://github.com/mswzeus/PLUS/)
[x] CPCProt (https://www.biorxiv.org/content/10.1101/2020.09.04.283929v1.full.pdf)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 86

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (29) 🔗