Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Vector Hub - Library for easy discovery, and consumption of State-of-the-art models to turn data into vectors. (text2vec, image2vec, video2vec, graph2vec, bert, inception, etc)

Stars: ✭ 317 (+170.94%)

Mutual labels: word2vec, embeddings

Word2vec Win32

A word2vec port for Windows.

Stars: ✭ 41 (-64.96%)

Mutual labels: word2vec, word-embeddings

Russian news corpus

Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ

Stars: ✭ 76 (-35.04%)

Mutual labels: ml, word2vec

Deep learning nlp

Keras, PyTorch, and NumPy Implementations of Deep Learning Architectures for NLP

Stars: ✭ 407 (+247.86%)

Mutual labels: word2vec, word-embeddings

Lmdb Embeddings

Fast word vectors with little memory usage in Python

Stars: ✭ 404 (+245.3%)

Mutual labels: word2vec, embeddings

Finalfusion Rust

finalfusion embeddings in Rust

Stars: ✭ 35 (-70.09%)

Mutual labels: word2vec, embeddings

Wego

Word Embeddings (e.g. Word2Vec) in Go!

Stars: ✭ 336 (+187.18%)

Mutual labels: word2vec, word-embeddings

Deeplearning Nlp Models

A small, interpretable codebase containing the re-implementation of a few "deep" NLP models in PyTorch. Colab notebooks to run with GPUs. Models: word2vec, CNNs, transformer, gpt.

Stars: ✭ 64 (-45.3%)

Mutual labels: word2vec, embeddings

Postgres Word2vec

utils to use word embedding like word2vec vectors in a postgres database

Stars: ✭ 96 (-17.95%)

Mutual labels: word2vec, word-embeddings

Text Summarizer

Python Framework for Extractive Text Summarization

Stars: ✭ 96 (-17.95%)

Mutual labels: word2vec, word-embeddings

Hub

A library for transfer learning by reusing parts of TensorFlow models.

Stars: ✭ 3,007 (+2470.09%)

Mutual labels: ml, embeddings

View All Similar Projects ➔

dna2vec

Dna2vec is an open-source library to train distributed representations of variable-length k-mers.

For more information, please refer to the paper: dna2vec: Consistent vector representations of variable-length k-mers

Installation

Note that this implementation has only been tested on Python 3.5.3, but we welcome any contributions or bug reporting to make it more accessible.

Clone the dna2vec repository: git clone https://github.com/pnpnpn/dna2vec
Install Python dependencies: pip3 install -r requirements.txt
Test the installation: python3 ./scripts/train_dna2vec.py -c configs/small_example.yml

Training dna2vec embeddings

Download hg38 from http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFa.tar.gz. This will take a while as it's 938MB.
Untar with tar -zxvf hg38.chromFa.tar.gz. You should see FASTA files for chromosome 1 to 22: chr1.fa, chr2.fa, ..., chr22.fa.
Move the 22 FASTA files to folder inputs/hg38/
Start the training with: python3 ./scripts/train_dna2vec.py -c configs/hg38-20161219-0153.yml
Wait for a couple of days ...
Once the training is done, there should be a dna2vec-<ID>.w2v and a corresponding dna2vec-<ID>.txt file in your results/ directory.

Reading pretrained dna2vec

You can read pretrained dna2vec vectors pretrained/dna2vec-*.w2v using the class MultiKModel in dna2vec/multi_k_model.py. For example:

from dna2vec.multi_k_model import MultiKModel

filepath = 'pretrained/dna2vec-20161219-0153-k3to8-100d-10c-29320Mbp-sliding-Xat.w2v'
mk_model = MultiKModel(filepath)

You can fetch the vector representation of AAA with:

>>> mk_model.vector('AAA')
array([ 0.023137  ,  0.156295 , ...

Compute the cosine distance between two k-mers via dna2vec:

>>> mk_model.cosine_distance('AAA', 'GCT')
0.14546435594464155
>>> mk_model.cosine_distance('AAA', 'AAAA')
0.89000147450211231

FAQ

Does the pre-trained dna2vec data (`w2v` file) cover all k-mers?

The pre-trained data should cover all k-mers for 3 ≤ k ≤ 8

>>> [len(mk_model.model(k).vocab) for k in range(3,9)]
[64, 256, 1024, 4096, 16384, 65536]
>>> [4**k for k in range(3,9)]
[64, 256, 1024, 4096, 16384, 65536]

Contribute

I would love for you to fork and send me pull request for this project. Please contribute.

License

This software is licensed under the MIT license

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 117

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (7) 🔗

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

pnpnpn / Dna2vec

Programming Languages

Labels

Projects that are alternatives of or similar to Dna2vec

dna2vec

Installation

Training dna2vec embeddings

Reading pretrained dna2vec

FAQ

Does the pre-trained dna2vec data (w2v file) cover all k-mers?

Contribute

License

Does the pre-trained dna2vec data (`w2v` file) cover all k-mers?