All Projects → Hironsan → Ja.text8

Hironsan / Ja.text8

Japanese text8 corpus for word embedding.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Ja.text8

Russian news corpus
Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ
Stars: ✭ 76 (-3.8%)
Mutual labels:  corpus, word2vec
Nlp In Practice
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
Stars: ✭ 790 (+900%)
Mutual labels:  natural-language-processing, word2vec
Cs224n
CS224n: Natural Language Processing with Deep Learning Assignments Winter, 2017
Stars: ✭ 656 (+730.38%)
Mutual labels:  natural-language-processing, word2vec
Awesome Persian Nlp Ir
Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
Stars: ✭ 460 (+482.28%)
Mutual labels:  corpus, natural-language-processing
Coarij
Corpus of Annual Reports in Japan
Stars: ✭ 55 (-30.38%)
Mutual labels:  corpus, natural-language-processing
Weixin public corpus
微信公众号语料库
Stars: ✭ 465 (+488.61%)
Mutual labels:  corpus, natural-language-processing
Nlp chinese corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+8325.32%)
Mutual labels:  corpus, word2vec
wordfish-python
extract relationships from standardized terms from corpus of interest with deep learning 🐟
Stars: ✭ 19 (-75.95%)
Mutual labels:  word2vec, corpus
Pujangga
Pujangga - Indonesian Natural Language Processing Tool with REST API, an Interface for InaNLP and Deeplearning4j's Word2Vec
Stars: ✭ 47 (-40.51%)
Mutual labels:  natural-language-processing, word2vec
Typing Assistant
Typing Assistant provides the ability to autocomplete words and suggests predictions for the next word. This makes typing faster, more intelligent and reduces effort.
Stars: ✭ 32 (-59.49%)
Mutual labels:  corpus, natural-language-processing
Natural Language Processing
Programming Assignments and Lectures for Stanford's CS 224: Natural Language Processing with Deep Learning
Stars: ✭ 377 (+377.22%)
Mutual labels:  natural-language-processing, word2vec
Kor2vec
Library for Korean morpheme and word vector representation
Stars: ✭ 64 (-18.99%)
Mutual labels:  natural-language-processing, word2vec
Languagecrunch
LanguageCrunch NLP server docker image
Stars: ✭ 281 (+255.7%)
Mutual labels:  natural-language-processing, word2vec
Quanteda
An R package for the Quantitative Analysis of Textual Data
Stars: ✭ 647 (+718.99%)
Mutual labels:  corpus, natural-language-processing
Fakenewscorpus
A dataset of millions of news articles scraped from a curated list of data sources.
Stars: ✭ 255 (+222.78%)
Mutual labels:  corpus, natural-language-processing
Text2vec
Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
Stars: ✭ 715 (+805.06%)
Mutual labels:  natural-language-processing, word2vec
Nlvr
Cornell NLVR and NLVR2 are natural language grounding datasets. Each example shows a visual input and a sentence describing it, and is annotated with the truth-value of the sentence.
Stars: ✭ 192 (+143.04%)
Mutual labels:  corpus, natural-language-processing
text-classification-cn
中文文本分类实践,基于搜狗新闻语料库,采用传统机器学习方法以及预训练模型等方法
Stars: ✭ 81 (+2.53%)
Mutual labels:  word2vec, corpus
Insuranceqa Corpus Zh
🚁 保险行业语料库,聊天机器人
Stars: ✭ 821 (+939.24%)
Mutual labels:  corpus, natural-language-processing
Repo 2017
Python codes in Machine Learning, NLP, Deep Learning and Reinforcement Learning with Keras and Theano
Stars: ✭ 1,123 (+1321.52%)
Mutual labels:  natural-language-processing, word2vec

ja.text8

ja.text8 is a small (100MB) text corpus from the web (japanese wikipedia).

You can download ja.text8 corpus from the following link:

Usage

You can train word2vec by ja.text8. After downloading ja.text8, run the following code. It takes about 2 minutes to finish training:

import logging
from gensim.models import word2vec

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
 
sentences = word2vec.Text8Corpus('ja.text8')
model = word2vec.Word2Vec(sentences, size=200)

After the training, you can test the model as follows:

>>> model.most_similar(['日本'])
[('中国', 0.598496675491333),
 ('韓国', 0.5914819240570068),
 ('アメリカ', 0.5286925435066223),
 ('英国', 0.5090063810348511),
 ('台湾', 0.4761126637458801),
 ('米国', 0.45954638719558716),
 ('アメリカ合衆国', 0.45181626081466675),
 ('イギリス', 0.44740626215934753),
 ('ソ連', 0.43657147884368896),
 ('海外', 0.4325913190841675)]

Great!

Requirements

  • Python 3.x
  • MeCab
  • virtualenv

Make corpus by yourself

You can download ja.text8. But you can make the corpus by yourself.

Simply run:

$ ./setup.sh

License

CC-BY-SA

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].