Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → Hironsan → Ja.text8

Hironsan / Ja.text8

Japanese text8 corpus for word embedding.

Programming Languages

python

139335 projects - #7 most used programming language

Labels

deep-learning machine-learning natural-language-processing word2vec corpus

Projects that are alternatives of or similar to Ja.text8

Russian news corpus

Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ

Stars: ✭ 76 (-3.8%)

Mutual labels: corpus, word2vec

Nlp In Practice

Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.

Stars: ✭ 790 (+900%)

Mutual labels: natural-language-processing, word2vec

Cs224n

CS224n: Natural Language Processing with Deep Learning Assignments Winter, 2017

Stars: ✭ 656 (+730.38%)

Mutual labels: natural-language-processing, word2vec

Awesome Persian Nlp Ir

Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources

Stars: ✭ 460 (+482.28%)

Mutual labels: corpus, natural-language-processing

Coarij

Corpus of Annual Reports in Japan

Stars: ✭ 55 (-30.38%)

Mutual labels: corpus, natural-language-processing

Weixin public corpus

微信公众号语料库

Stars: ✭ 465 (+488.61%)

Mutual labels: corpus, natural-language-processing

Nlp chinese corpus

大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP

Stars: ✭ 6,656 (+8325.32%)

Mutual labels: corpus, word2vec

wordfish-python

extract relationships from standardized terms from corpus of interest with deep learning 🐟

Stars: ✭ 19 (-75.95%)

Mutual labels: word2vec, corpus

Pujangga

Pujangga - Indonesian Natural Language Processing Tool with REST API, an Interface for InaNLP and Deeplearning4j's Word2Vec

Stars: ✭ 47 (-40.51%)

Mutual labels: natural-language-processing, word2vec

Typing Assistant

Typing Assistant provides the ability to autocomplete words and suggests predictions for the next word. This makes typing faster, more intelligent and reduces effort.

Stars: ✭ 32 (-59.49%)

Mutual labels: corpus, natural-language-processing

Natural Language Processing

Programming Assignments and Lectures for Stanford's CS 224: Natural Language Processing with Deep Learning

Stars: ✭ 377 (+377.22%)

Mutual labels: natural-language-processing, word2vec

Kor2vec

Library for Korean morpheme and word vector representation

Stars: ✭ 64 (-18.99%)

Mutual labels: natural-language-processing, word2vec

Languagecrunch

LanguageCrunch NLP server docker image

Stars: ✭ 281 (+255.7%)

Mutual labels: natural-language-processing, word2vec

Quanteda

An R package for the Quantitative Analysis of Textual Data

Stars: ✭ 647 (+718.99%)

Mutual labels: corpus, natural-language-processing

Fakenewscorpus

A dataset of millions of news articles scraped from a curated list of data sources.

Stars: ✭ 255 (+222.78%)

Mutual labels: corpus, natural-language-processing

Text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.

Stars: ✭ 715 (+805.06%)

Mutual labels: natural-language-processing, word2vec

Nlvr

Cornell NLVR and NLVR2 are natural language grounding datasets. Each example shows a visual input and a sentence describing it, and is annotated with the truth-value of the sentence.

Stars: ✭ 192 (+143.04%)

Mutual labels: corpus, natural-language-processing

text-classification-cn

中文文本分类实践，基于搜狗新闻语料库，采用传统机器学习方法以及预训练模型等方法

Stars: ✭ 81 (+2.53%)

Mutual labels: word2vec, corpus

Insuranceqa Corpus Zh

🚁 保险行业语料库，聊天机器人

Stars: ✭ 821 (+939.24%)

Mutual labels: corpus, natural-language-processing

Repo 2017

Python codes in Machine Learning, NLP, Deep Learning and Reinforcement Learning with Keras and Theano

Stars: ✭ 1,123 (+1321.52%)

Mutual labels: natural-language-processing, word2vec

View All Similar Projects ➔

ja.text8

ja.text8 is a small (100MB) text corpus from the web (japanese wikipedia).

You can download ja.text8 corpus from the following link:

ja.text8.zip

Usage

You can train word2vec by ja.text8. After downloading ja.text8, run the following code. It takes about 2 minutes to finish training:

import logging
from gensim.models import word2vec

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
 
sentences = word2vec.Text8Corpus('ja.text8')
model = word2vec.Word2Vec(sentences, size=200)

After the training, you can test the model as follows:

>>> model.most_similar(['日本'])
[('中国', 0.598496675491333),
 ('韓国', 0.5914819240570068),
 ('アメリカ', 0.5286925435066223),
 ('英国', 0.5090063810348511),
 ('台湾', 0.4761126637458801),
 ('米国', 0.45954638719558716),
 ('アメリカ合衆国', 0.45181626081466675),
 ('イギリス', 0.44740626215934753),
 ('ソ連', 0.43657147884368896),
 ('海外', 0.4325913190841675)]

Great!

Requirements

Python 3.x
MeCab
virtualenv

Make corpus by yourself

You can download ja.text8. But you can make the corpus by yourself.

Simply run:

$ ./setup.sh

License

CC-BY-SA

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 79

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗