All Projects → sonvx → word2vecVN

sonvx / word2vecVN

Licence: Apache-2.0 License
Pre-trained Word2Vec models for Vietnamese

Programming Languages

python
139335 projects - #7 most used programming language
CSS
56736 projects
HTML
75241 projects

I. word2vecVN

Word2Vec models for Vietnamese

Download models:

  • Model trained on Vietnamese Wiki: click here.
  • Model trained on Le et al.'s data (window-size 5, 400 dims): click here.
  • Model trained on Le et al.'s data (window-size 2, 300 dims): click here.

Visualization:

  • word2vec-visualization (using TensorBoard):
    • Download tf_files: TBA
    • Run $ tensorboard --log_dir=./ --port=10001
  • word2vec-simple-visualization: It is working well. Please read the readme file inside that folder to know how to test the model.

Note:

  • This model is trained using data of Le et al. http://mim.hus.vnu.edu.vn/phuonglh/node/72
    • Data information: 7.1G text with 1,675,819 unique words from a corpus of 974,393,244 raw words and 97,440 documents. Note that all words are tokenized words.

II. How do I cite?

Please CITE paper the Arxiv paper whenever ETNLP (or the pre-trained embeddings) is used to produce published results or incorporated into other software:

@article{vu:2019n,
  title={ETNLP: A Visual-Aided Systematic Approach to Select Pre-Trained Embeddings for a Downstream Task},
  author={Xuan-Son Vu, Thanh Vu, Son N. Tran, Lili Jiang},
  journal={In: Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP)},
  year={2019}
  }
  
 @misc{word2vecvn_2016,
    author = {Xuan-Son Vu},
    title = {Pre-trained Word2Vec models for Vietnamese},
    year = {2016},
    howpublished = {\url{https://github.com/sonvx/word2vecVN}},
    note = {commit xxxxxxx}
  }

Cited in papers:

  1. Thanh Vu, Dat Quoc Nguyen, Dai Quoc Nguyen, Mark Dras and Mark Johnson. 2018. VnCoreNLP: A Vietnamese Natural Language Processing Toolkit. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, NAACL 2018, pages 56-60. bibtext, github

III. Some screenshots:

Screenshot of word2vec

Alt text

Screenshot of spacy-n-fastext

Alt text

Attributions/Thanks

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].