All Projects → pnpnpn → Dna2vec

pnpnpn / Dna2vec

Licence: mit
dna2vec: Consistent vector representations of variable-length k-mers

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Dna2vec

lda2vec
Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec from this paper https://arxiv.org/abs/1605.02019
Stars: ✭ 27 (-76.92%)
Mutual labels:  word2vec, word-embeddings, embeddings
word2vec-tsne
Google News and Leo Tolstoy: Visualizing Word2Vec Word Embeddings using t-SNE.
Stars: ✭ 59 (-49.57%)
Mutual labels:  word2vec, word-embeddings, embeddings
Magnitude
A fast, efficient universal vector embedding utility package.
Stars: ✭ 1,394 (+1091.45%)
Mutual labels:  word2vec, embeddings, word-embeddings
Dict2vec
Dict2vec is a framework to learn word embeddings using lexical dictionaries.
Stars: ✭ 91 (-22.22%)
Mutual labels:  word2vec, embeddings, word-embeddings
sentiment-analysis-of-tweets-in-russian
Sentiment analysis of tweets in Russian using Convolutional Neural Networks (CNN) with Word2Vec embeddings.
Stars: ✭ 51 (-56.41%)
Mutual labels:  word2vec, word-embeddings, embeddings
Text2vec
Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
Stars: ✭ 715 (+511.11%)
Mutual labels:  word2vec, word-embeddings
Philo2vec
An implementation of word2vec applied to [stanford philosophy encyclopedia](http://plato.stanford.edu/)
Stars: ✭ 33 (-71.79%)
Mutual labels:  word2vec, embeddings
Embeddingsviz
Visualize word embeddings of a vocabulary in TensorBoard, including the neighbors
Stars: ✭ 40 (-65.81%)
Mutual labels:  embeddings, word-embeddings
Chinese Word Vectors
100+ Chinese Word Vectors 上百种预训练中文词向量
Stars: ✭ 9,548 (+8060.68%)
Mutual labels:  embeddings, word-embeddings
Vectorhub
Vector Hub - Library for easy discovery, and consumption of State-of-the-art models to turn data into vectors. (text2vec, image2vec, video2vec, graph2vec, bert, inception, etc)
Stars: ✭ 317 (+170.94%)
Mutual labels:  word2vec, embeddings
Word2vec Win32
A word2vec port for Windows.
Stars: ✭ 41 (-64.96%)
Mutual labels:  word2vec, word-embeddings
Russian news corpus
Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ
Stars: ✭ 76 (-35.04%)
Mutual labels:  ml, word2vec
Deep learning nlp
Keras, PyTorch, and NumPy Implementations of Deep Learning Architectures for NLP
Stars: ✭ 407 (+247.86%)
Mutual labels:  word2vec, word-embeddings
Lmdb Embeddings
Fast word vectors with little memory usage in Python
Stars: ✭ 404 (+245.3%)
Mutual labels:  word2vec, embeddings
Finalfusion Rust
finalfusion embeddings in Rust
Stars: ✭ 35 (-70.09%)
Mutual labels:  word2vec, embeddings
Wego
Word Embeddings (e.g. Word2Vec) in Go!
Stars: ✭ 336 (+187.18%)
Mutual labels:  word2vec, word-embeddings
Deeplearning Nlp Models
A small, interpretable codebase containing the re-implementation of a few "deep" NLP models in PyTorch. Colab notebooks to run with GPUs. Models: word2vec, CNNs, transformer, gpt.
Stars: ✭ 64 (-45.3%)
Mutual labels:  word2vec, embeddings
Postgres Word2vec
utils to use word embedding like word2vec vectors in a postgres database
Stars: ✭ 96 (-17.95%)
Mutual labels:  word2vec, word-embeddings
Text Summarizer
Python Framework for Extractive Text Summarization
Stars: ✭ 96 (-17.95%)
Mutual labels:  word2vec, word-embeddings
Hub
A library for transfer learning by reusing parts of TensorFlow models.
Stars: ✭ 3,007 (+2470.09%)
Mutual labels:  ml, embeddings

dna2vec

Build Status

Dna2vec is an open-source library to train distributed representations of variable-length k-mers.

For more information, please refer to the paper: dna2vec: Consistent vector representations of variable-length k-mers

Installation

Note that this implementation has only been tested on Python 3.5.3, but we welcome any contributions or bug reporting to make it more accessible.

  1. Clone the dna2vec repository: git clone https://github.com/pnpnpn/dna2vec
  2. Install Python dependencies: pip3 install -r requirements.txt
  3. Test the installation: python3 ./scripts/train_dna2vec.py -c configs/small_example.yml

Training dna2vec embeddings

  1. Download hg38 from http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFa.tar.gz. This will take a while as it's 938MB.
  2. Untar with tar -zxvf hg38.chromFa.tar.gz. You should see FASTA files for chromosome 1 to 22: chr1.fa, chr2.fa, ..., chr22.fa.
  3. Move the 22 FASTA files to folder inputs/hg38/
  4. Start the training with: python3 ./scripts/train_dna2vec.py -c configs/hg38-20161219-0153.yml
  5. Wait for a couple of days ...
  6. Once the training is done, there should be a dna2vec-<ID>.w2v and a corresponding dna2vec-<ID>.txt file in your results/ directory.

Reading pretrained dna2vec

You can read pretrained dna2vec vectors pretrained/dna2vec-*.w2v using the class MultiKModel in dna2vec/multi_k_model.py. For example:

from dna2vec.multi_k_model import MultiKModel

filepath = 'pretrained/dna2vec-20161219-0153-k3to8-100d-10c-29320Mbp-sliding-Xat.w2v'
mk_model = MultiKModel(filepath)

You can fetch the vector representation of AAA with:

>>> mk_model.vector('AAA')
array([ 0.023137  ,  0.156295 , ...

Compute the cosine distance between two k-mers via dna2vec:

>>> mk_model.cosine_distance('AAA', 'GCT')
0.14546435594464155
>>> mk_model.cosine_distance('AAA', 'AAAA')
0.89000147450211231

FAQ

Does the pre-trained dna2vec data (w2v file) cover all k-mers?

The pre-trained data should cover all k-mers for 3 ≤ k ≤ 8

>>> [len(mk_model.model(k).vocab) for k in range(3,9)]
[64, 256, 1024, 4096, 16384, 65536]
>>> [4**k for k in range(3,9)]
[64, 256, 1024, 4096, 16384, 65536]

Contribute

I would love for you to fork and send me pull request for this project. Please contribute.

License

This software is licensed under the MIT license

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].