go-ego / Gse

Licence: apache-2.0

Go efficient multilingual NLP and text segmentation; support english, chinese, japanese and other. Go 高性能多语言 NLP 和分词

Programming Languages

31211 projects - #10 most used programming language

75241 projects

Projects that are alternatives of or similar to Gse

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

Stars: ✭ 154 (-90.91%)

Mutual labels: japanese, english, chinese

tudien

Từ điển tiếng Việt dành cho Kindle

Stars: ✭ 38 (-97.76%)

Mutual labels: english, chinese

syng

A free, open source, cross-platform, Chinese-To-English dictionary for desktops.

Stars: ✭ 108 (-93.63%)

Mutual labels: english, chinese

Chinese Text Classification

Chinese-Text-Classification，Tensorflow CNN（卷积神经网络）实现的中文文本分类。QQ群：522785813，微信群二维码：http://www.tensorflownews.com/

Stars: ✭ 284 (-83.24%)

Mutual labels: chinese, jieba

unihandecode

unihandecode is a transliteration library to convert all characters/words in Unicode into ASCII alphabet that aware with Language preference priorities

Stars: ✭ 71 (-95.81%)

Mutual labels: japanese, chinese

mchmm

Markov Chains and Hidden Markov Models in Python

Stars: ✭ 89 (-94.75%)

Mutual labels: hmm, hmm-viterbi-algorithm

TALPCo

TUFS Asian Language Parallel Corpus

Stars: ✭ 32 (-98.11%)

Mutual labels: japanese, english

say-it

TTS in command line -- Pronounce the Chinese and English words you typed in.

Stars: ✭ 19 (-98.88%)

Mutual labels: english, chinese

Jiebar

Chinese text segmentation with R. R语言中文分词（文档已更新 🎉 ：https://qinwenfeng.com/jiebaR/ )

Stars: ✭ 302 (-82.18%)

Mutual labels: chinese, jieba

Most Frequent Technology English Words

程序员工作中常见的英语词汇

Stars: ✭ 4,711 (+177.94%)

Mutual labels: chinese, english

Chrome Extension Udemy Translate

Translate Udemy's subtitles into Chinese、English etc(Disneyplus+netflix+udemy+lynda+hulu+hbo now+primevideo)

Stars: ✭ 553 (-67.37%)

Mutual labels: chinese, english

Google Ime Dictionary

日英変換・英語略語展開のための IME 追加辞書 📙 日本語から英語への和英変換や英語略語の展開を Google 日本語入力や ATOK などで可能にする IME 拡張辞書です

Stars: ✭ 30 (-98.23%)

Mutual labels: japanese, english

BSD

The Business Scene Dialogue corpus

Stars: ✭ 51 (-96.99%)

Mutual labels: japanese, english

jiten

jiten - japanese android/cli/web dictionary based on jmdict/kanjidic — 日本語　辞典　和英辞典　漢英字典　和独辞典　和蘭辞典

Stars: ✭ 64 (-96.22%)

Mutual labels: japanese, english

next-qrcode

React hooks for generating QRCode for your next React apps.

Stars: ✭ 87 (-94.87%)

Mutual labels: japanese, chinese

OpenGNT

Open Greek New Testament Project; NA28 / NA27 Equivalent Text & Resources

Stars: ✭ 55 (-96.76%)

Mutual labels: english, chinese

Opencc4j

🇨🇳Open Chinese Convert is an opensource project for conversion between Traditional Chinese and Simplified Chinese.(java 中文繁简体转换)

Stars: ✭ 187 (-88.97%)

Mutual labels: chinese, trie

ark-pixel-font

Open source Pan-CJK pixel font / 开源的泛中日韩像素字体

Stars: ✭ 1,767 (+4.25%)

Mutual labels: japanese, chinese

Borgert Cms

Borgert is a CMS Open Source created with Laravel Framework 5.6

Stars: ✭ 298 (-82.42%)

Mutual labels: chinese, english

Mouse Dictionary

📘A super fast dictionary for Chrome/Firefox

Stars: ✭ 670 (-60.47%)

Mutual labels: japanese, english

View All Similar Projects ➔

gse

Go efficient multilingual NLP and text segmentation; support english, chinese, japanese and other. And supports with elasticsearch and bleve.

简体中文

Gse is implements jieba by golang, and try add NLP support and more feature

Feature:

Support common, search engine, full mode, precise mode and HMM mode multiple word segmentation modes;
Support user and embed dictionary, Part-of-speech/POS tagging, analyze segment info, stop and trim words
Support multilingual: English, Chinese, Japanese and other
Support traditional chinese
Support HMM cut text use Viterbi algorithm
Support NLP by TensorFlow (in work)
Named Entity Recognition (in work)
Supports with elasticsearch and bleve
run JSON RPC service.

Algorithm:

Dictionary with double array trie (Double-Array Trie) to achieve
Segmenter algorithm is the shortest path (based on word frequency and dynamic programming), and DAG and HMM algorithm word segmentation.

Text Segmentation speed:

single thread 9.2MB/s
goroutines concurrent 26.8MB/s.
HMM text segmentation single thread 3.2MB/s. (2core 4threads Macbook Pro).

Binding:

gse-bind, binding JavaScript and other, support more language.

Install / update

With Go module support (Go 1.11+), just import:

import "github.com/go-ego/gse"

Otherwise, to install the gse package, run the command:

go get -u github.com/go-ego/gse

Use

package main

import (
	"fmt"
	"regexp"

	"github.com/go-ego/gse"
	"github.com/go-ego/gse/hmm/pos"
)

var (
	text = "Hello world, Helloworld. Winter is coming! 你好世界."

	new, _ = gse.New("zh,testdata/test_dict3.txt", "alpha")

	seg gse.Segmenter
	posSeg pos.Segmenter
)

func main() {
	// Loading the default dictionary
	seg.LoadDict()
	// Loading the default dictionary with embed
	// seg.LoadDictEmbed()
	// 
	// Loading the simple chinese dictionary
	// seg.LoadDict("zh_s")
	// seg.LoadDictEmbed("zh_s")
	//
	// Loading the traditional chinese dictionary
	// seg.LoadDict("zh_t")
	// 
	// Loading the japanese dictionary
	// seg.LoadDict("jp")
	// 
	// Load the dictionary
	// seg.LoadDict("your gopath"+"/src/github.com/go-ego/gse/data/dict/dictionary.txt")

	cut()

	segCut()
}

func cut() {
	hmm := new.Cut(text, true)
	fmt.Println("cut use hmm: ", hmm)

	hmm = new.CutSearch(text, true)
	fmt.Println("cut search use hmm: ", hmm)
	fmt.Println("analyze: ", new.Analyze(hmm, text))

	hmm = new.CutAll(text)
	fmt.Println("cut all: ", hmm)

	reg := regexp.MustCompile(`(\d+年|\d+月|\d+日|[\p{Latin}]+|[\p{Hangul}]+|\d+\.\d+|[a-zA-Z0-9]+)`)
	text1 := `헬로월드 헬로 서울, 2021年09月10日, 3.14`
	hmm = seg.CutDAG(text1, reg)
	fmt.Println("Cut with hmm and regexp: ", hmm, hmm[0], hmm[6])
}

func analyzeAndTrim(cut []string) {
	a := seg.Analyze(cut, "")
	fmt.Println("analyze the segment: ", a)

	cut = seg.Trim(cut)
	fmt.Println("cut all: ", cut)

	fmt.Println(seg.String(text, true))
	fmt.Println(seg.Slice(text, true))
}

func cutPos() {
	po := seg.Pos(text, true)
	fmt.Println("pos: ", po)
	po = seg.TrimPos(po)
	fmt.Println("trim pos: ", po)

	pos.WithGse(seg)
	po = posSeg.Cut(text, true)
	fmt.Println("pos: ", po)

	po = posSeg.TrimWithPos(po, "zg")
	fmt.Println("trim pos: ", po)
}

func segCut() {
	// Text Segmentation
	tb := []byte(text)
	fmt.Println(seg.String(text, true))

	segments := seg.Segment(tb)
	// Handle word segmentation results, search mode
	fmt.Println(gse.ToString(segments, true))
}

Look at an custom dictionary example

package main

import (
	"fmt"
	_ "embed"

	"github.com/go-ego/gse"
)

//go:embed test_dict3.txt
var testDict string

func main() {
	// var seg gse.Segmenter
	// seg.LoadDict("zh, testdata/test_dict.txt, testdata/test_dict1.txt")
	// seg.LoadStop()
	seg, err := gse.NewEmbed("zh, word 20 n"+testDict, "en")
	// seg.LoadDictEmbed()
	seg.LoadStopEmbed()

	text1 := "你好世界, Hello world"
	fmt.Println(seg.Cut(text1, true))
	fmt.Println(seg.String(text1, true))

	segments := seg.Segment([]byte(text1))
	fmt.Println(gse.ToString(segments))
}

Look at an Chinese example

Look at an Japanese example

Elasticsearch

How to use it with elasticsearch?

go-gse-elastic

Authors

License

Gse is primarily distributed under the terms of "both the MIT license and the Apache License (Version 2.0)". See LICENSE-APACHE, LICENSE-MIT.

Thanks for sego and jieba(jiebago).

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

go-ego / Gse

Programming Languages

Labels

Projects that are alternatives of or similar to Gse

gse

Feature:

Algorithm:

Text Segmentation speed:

Binding:

Install / update

Use

Elasticsearch

Authors

License