Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

A python package to run contextualized topic modeling. CTMs combine BERT with topic models to get coherent topics. Also supports multilingual tasks. Cross-lingual Zero-shot model published at EACL 2021.

Stars: ✭ 318 (-57.32%)

Mutual labels: nlp-library

Spacy

💫 Industrial-strength Natural Language Processing (NLP) in Python

Stars: ✭ 21,978 (+2850.07%)

Mutual labels: nlp-library

Yakuhanjp

Yakumono-Hankaku Only Web Fonts

Stars: ✭ 288 (-61.34%)

Mutual labels: japanese

Site

cpprefjpサイトのMarkdownソース

Stars: ✭ 275 (-63.09%)

Mutual labels: japanese

Janome

Japanese morphological analysis engine written in pure Python

Stars: ✭ 630 (-15.44%)

Mutual labels: nlp-library

Wanikani For Android

An android client application for the awesome kanji learning website wanikani.com

Stars: ✭ 506 (-32.08%)

Mutual labels: japanese

Ekphrasis

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

Stars: ✭ 433 (-41.88%)

Mutual labels: nlp-library

Lingua

👄 The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike

Stars: ✭ 341 (-54.23%)

Mutual labels: nlp-library

Awesome Persian Nlp Ir

Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources

Stars: ✭ 460 (-38.26%)

Mutual labels: part-of-speech-tagger

Giveme5w1h

Extraction of the journalistic five W and one H questions (5W1H) from news articles: who did what, when, where, why, and how?

Stars: ✭ 316 (-57.58%)

Mutual labels: nlp-library

Quick Nlp

Pytorch NLP library based on FastAI

Stars: ✭ 279 (-62.55%)

Mutual labels: nlp-library

Pynlpl

PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).

Stars: ✭ 426 (-42.82%)

Mutual labels: nlp-library

Sudachi

A Japanese Tokenizer for Business

Stars: ✭ 496 (-33.42%)

Mutual labels: nlp-library

Kuroshiro

Japanese language library for converting Japanese sentence to Hiragana, Katakana or Romaji with furigana and okurigana modes supported.

Stars: ✭ 386 (-48.19%)

Mutual labels: japanese

View All Similar Projects ➔

Kuromoji

Kuromoji is an easy to use and self-contained Japanese morphological analyzer that does

Word segmentation. Segmenting text into words (or morphemes)
Part-of-speech tagging. Assign word-categories (nouns, verbs, particles, adjectives, etc.)
Lemmatization. Get dictionary forms for inflected verbs and adjectives
Readings. Extract readings for kanji

Several other features are supported. Please consult each dictionaries' Token class for details.

Using Kuromoji

The example below shows how to use the Kuromoji morphological analyzer in its simlest form; to segment text into tokens and output features for each token.

package com.atilika.kuromoji.example;

import com.atilika.kuromoji.ipadic.Token;
import com.atilika.kuromoji.ipadic.Tokenizer;
import java.util.List;

public class KuromojiExample {
    public static void main(String[] args) {
        Tokenizer tokenizer = new Tokenizer() ;
        List<Token> tokens = tokenizer.tokenize("お寿司が食べたい。");
        for (Token token : tokens) {
            System.out.println(token.getSurface() + "\t" + token.getAllFeatures());
        }
    }
}

Make sure you add the dependency below to your pom.xml before building your project.

<dependency>
  <groupId>com.atilika.kuromoji</groupId>
  <artifactId>kuromoji-ipadic</artifactId>
  <version>0.9.0</version>
</dependency>

When running the above program, you will get the following output:

お　　　接頭詞,名詞接続,*,*,*,*,お,オ,オ
寿司　　名詞,一般,*,*,*,*,寿司,スシ,スシ
が　　　助詞,格助詞,一般,*,*,*,が,ガ,ガ
食べ　　動詞,自立,*,*,一段,連用形,食べる,タベ,タベ
たい　　助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイ
。　　　記号,句点,*,*,*,*,。,。,。

See the documentation for the com.atilika.kuromoji.ipadic.Token class for more information on the per-token features available.

Supported dictionaries

Kuromoji currently supports the following dictionaries:

IPADIC (2.7.0-20070801)
IPADIC NEologd (2.7.0-20070801-neologd-20171113)
JUMANDIC (7.0-20130310)
NAIST jdic (0.6.3b-20111013)
UniDic (2.1.2)
UniDic Kana Accent (2.1.2)
UniDic NEologd (2.1.2-neologd-20171002)

Question: So which of these dictionaries should I use?

Answer: That depends on your application. Yes, we know - it's a boring answer... :)

If you are not sure about which dictionary you should use, kuromoji-ipadic is a good starting point for many applications.

See the getters in the per-dictionary Token classes for some more information on available token features - or consult the technical dictionary documentation elsewhere. (We plan on adding better guidance on choosing a dictionary.)

Maven coordinates and user classes

Each dictionary has its own Maven coordinates, and a Tokenizer and a Token class similar to that in the above example. These classes live in a designated packaged space indicated by the dictionary type.

The sections below list fully qualified class names and the Maven coordinates for each dictionary supported.

kuromoji-ipadic

com.atilika.kuromoji.ipadic.Tokenizer
com.atilika.kuromoji.ipadic.Token

<dependency>
  <groupId>com.atilika.kuromoji</groupId>
  <artifactId>kuromoji-ipadic</artifactId>
  <version>0.9.0</version>
</dependency>

kuromoji-ipadic-neologd

com.atilika.kuromoji.ipadic.neologd.Tokenizer
com.atilika.kuromoji.ipadic.neologd.Token

This dictionary will be available from Maven Central in a future version.

kuromoji-jumandic

com.atilika.kuromoji.jumandic.Tokenizer
com.atilika.kuromoji.jumandic.Token

<dependency>
  <groupId>com.atilika.kuromoji</groupId>
  <artifactId>kuromoji-jumandic</artifactId>
  <version>0.9.0</version>
</dependency>

kuromoji-naist-jdic

com.atilika.kuromoji.naist.jdic.Tokenizer
com.atilika.kuromoji.naist.jdic.Token

<dependency>
  <groupId>com.atilika.kuromoji</groupId>
  <artifactId>kuromoji-naist-jdic</artifactId>
  <version>0.9.0</version>
</dependency>

kuromoji-unidic

com.atilika.kuromoji.unidic.Tokenizer
com.atilika.kuromoji.unidic.Token

<dependency>
  <groupId>com.atilika.kuromoji</groupId>
  <artifactId>kuromoji-unidic</artifactId>
  <version>0.9.0</version>
</dependency>

kuromoji-unidic-kanaaccent

com.atilika.kuromoji.unidic.kanaaccent.Tokenizer
com.atilika.kuromoji.unidic.kanaaccent.Token

<dependency>
  <groupId>com.atilika.kuromoji</groupId>
  <artifactId>kuromoji-unidic-kanaaccent</artifactId>
  <version>0.9.0</version>
</dependency>

kuromoji-unidic-neologd

com.atilika.kuromoji.unidic.neologd.Tokenizer
com.atilika.kuromoji.unidic.kanaaneologdcent.Token

This dictionary will be available from Maven Central in a future version.

Building Kuromoji from source code

Released version of Kuromoji are available from Maven Central.

If you want to build Kuromoji from source code, run the following command:

$ mvn clean package

This will download all source dictionary data and build Kuromoji with all dictionaries. The following jars will then be available:

kuromoji-core/target/kuromoji-core-1.0-SNAPSHOT.jar
kuromoji-ipadic/target/kuromoji-ipadic-1.0-SNAPSHOT.jar
kuromoji-ipadic-neologd/target/kuromoji-ipadic-neologd-1.0-SNAPSHOT.jar
kuromoji-jumandic/target/kuromoji-jumandic-1.0-SNAPSHOT.jar
kuromoji-naist-jdic/target/kuromoji-naist-jdic-1.0-SNAPSHOT.jar
kuromoji-unidic/target/kuromoji-unidic-1.0-SNAPSHOT.jar
kuromoji-unidic-kanaaccent/target/kuromoji-unidic-kanaaccent-1.0-SNAPSHOT.jar
kuromoji-unidic-neologd/target/kuromoji-unidic-neologd-1.0-SNAPSHOT.jar

The following additional build options are available:

-DskipCompileDictionary Do not recompile the dictionaries
-DskipDownloadDictionary Do not download source dictionaries
-DbenchmarkTokenizers Profile each tokenizer during the package phase using content from Japanese Wikipedia
-DskipDownloadWikipedia Prevent the compressed version of the Japanese Wikipedia (~765 MB) from being downloaded during profiling, i.e. if it has already been downloaded.

License

Kuromoji is licensed under the Apache License, Version 2.0. See LICENSE.md for details.

This software also includes a binary and/or source version of data from various 3rd party dictionaries. See NOTICE.md for these details.

Contributing

Please open up issues if you have a feature request. We also welcome contributions through pull requests.

You will retain copyright to your own contributions, but you need to license them using the Apache License, Version 2.0. All contributors will be mentioned in the CONTRIBUTORS.md file.

About us

We are a small team of experienced software engineers based in Tokyo who offers technologies and good advice in the field of search, natural language processing and big data analytics.

Please feel free to contact us at [email protected] if you have any questions or need help.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 745

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (22) 🔗