All Projects → go-ego → Gse

go-ego / Gse

Licence: apache-2.0
Go efficient multilingual NLP and text segmentation; support english, chinese, japanese and other. Go 高性能多语言 NLP 和分词

Programming Languages

go
31211 projects - #10 most used programming language
HTML
75241 projects

Projects that are alternatives of or similar to Gse

FCH-TTS
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。
Stars: ✭ 154 (-90.91%)
Mutual labels:  japanese, english, chinese
tudien
Từ điển tiếng Việt dành cho Kindle
Stars: ✭ 38 (-97.76%)
Mutual labels:  english, chinese
syng
A free, open source, cross-platform, Chinese-To-English dictionary for desktops.
Stars: ✭ 108 (-93.63%)
Mutual labels:  english, chinese
Chinese Text Classification
Chinese-Text-Classification,Tensorflow CNN(卷积神经网络)实现的中文文本分类。QQ群:522785813,微信群二维码:http://www.tensorflownews.com/
Stars: ✭ 284 (-83.24%)
Mutual labels:  chinese, jieba
unihandecode
unihandecode is a transliteration library to convert all characters/words in Unicode into ASCII alphabet that aware with Language preference priorities
Stars: ✭ 71 (-95.81%)
Mutual labels:  japanese, chinese
mchmm
Markov Chains and Hidden Markov Models in Python
Stars: ✭ 89 (-94.75%)
Mutual labels:  hmm, hmm-viterbi-algorithm
TALPCo
TUFS Asian Language Parallel Corpus
Stars: ✭ 32 (-98.11%)
Mutual labels:  japanese, english
say-it
TTS in command line -- Pronounce the Chinese and English words you typed in.
Stars: ✭ 19 (-98.88%)
Mutual labels:  english, chinese
Jiebar
Chinese text segmentation with R. R语言中文分词 (文档已更新 🎉 :https://qinwenfeng.com/jiebaR/ )
Stars: ✭ 302 (-82.18%)
Mutual labels:  chinese, jieba
Most Frequent Technology English Words
程序员工作中常见的英语词汇
Stars: ✭ 4,711 (+177.94%)
Mutual labels:  chinese, english
Chrome Extension Udemy Translate
Translate Udemy's subtitles into Chinese、English etc(Disneyplus+netflix+udemy+lynda+hulu+hbo now+primevideo)
Stars: ✭ 553 (-67.37%)
Mutual labels:  chinese, english
Google Ime Dictionary
日英変換・英語略語展開のための IME 追加辞書 📙 日本語から英語への和英変換や英語略語の展開を Google 日本語入力や ATOK などで可能にする IME 拡張辞書です
Stars: ✭ 30 (-98.23%)
Mutual labels:  japanese, english
BSD
The Business Scene Dialogue corpus
Stars: ✭ 51 (-96.99%)
Mutual labels:  japanese, english
jiten
jiten - japanese android/cli/web dictionary based on jmdict/kanjidic — 日本語 辞典 和英辞典 漢英字典 和独辞典 和蘭辞典
Stars: ✭ 64 (-96.22%)
Mutual labels:  japanese, english
next-qrcode
React hooks for generating QRCode for your next React apps.
Stars: ✭ 87 (-94.87%)
Mutual labels:  japanese, chinese
OpenGNT
Open Greek New Testament Project; NA28 / NA27 Equivalent Text & Resources
Stars: ✭ 55 (-96.76%)
Mutual labels:  english, chinese
Opencc4j
🇨🇳Open Chinese Convert is an opensource project for conversion between Traditional Chinese and Simplified Chinese.(java 中文繁简体转换)
Stars: ✭ 187 (-88.97%)
Mutual labels:  chinese, trie
ark-pixel-font
Open source Pan-CJK pixel font / 开源的泛中日韩像素字体
Stars: ✭ 1,767 (+4.25%)
Mutual labels:  japanese, chinese
Borgert Cms
Borgert is a CMS Open Source created with Laravel Framework 5.6
Stars: ✭ 298 (-82.42%)
Mutual labels:  chinese, english
Mouse Dictionary
📘A super fast dictionary for Chrome/Firefox
Stars: ✭ 670 (-60.47%)
Mutual labels:  japanese, english

gse

Go efficient multilingual NLP and text segmentation; support english, chinese, japanese and other. And supports with elasticsearch and bleve.

Build Status CircleCI Status codecov Build Status Go Report Card GoDoc GitHub release Join the chat at https://gitter.im/go-ego/ego

简体中文

Gse is implements jieba by golang, and try add NLP support and more feature

Feature:

  • Support common, search engine, full mode, precise mode and HMM mode multiple word segmentation modes;
  • Support user and embed dictionary, Part-of-speech/POS tagging, analyze segment info, stop and trim words
  • Support multilingual: English, Chinese, Japanese and other
  • Support traditional chinese
  • Support HMM cut text use Viterbi algorithm
  • Support NLP by TensorFlow (in work)
  • Named Entity Recognition (in work)
  • Supports with elasticsearch and bleve
  • run JSON RPC service.

Algorithm:

  • Dictionary with double array trie (Double-Array Trie) to achieve
  • Segmenter algorithm is the shortest path (based on word frequency and dynamic programming), and DAG and HMM algorithm word segmentation.

Text Segmentation speed:

Binding:

gse-bind, binding JavaScript and other, support more language.

Install / update

With Go module support (Go 1.11+), just import:

import "github.com/go-ego/gse"

Otherwise, to install the gse package, run the command:

go get -u github.com/go-ego/gse

Use

package main

import (
	"fmt"
	"regexp"

	"github.com/go-ego/gse"
	"github.com/go-ego/gse/hmm/pos"
)

var (
	text = "Hello world, Helloworld. Winter is coming! 你好世界."

	new, _ = gse.New("zh,testdata/test_dict3.txt", "alpha")

	seg gse.Segmenter
	posSeg pos.Segmenter
)

func main() {
	// Loading the default dictionary
	seg.LoadDict()
	// Loading the default dictionary with embed
	// seg.LoadDictEmbed()
	// 
	// Loading the simple chinese dictionary
	// seg.LoadDict("zh_s")
	// seg.LoadDictEmbed("zh_s")
	//
	// Loading the traditional chinese dictionary
	// seg.LoadDict("zh_t")
	// 
	// Loading the japanese dictionary
	// seg.LoadDict("jp")
	// 
	// Load the dictionary
	// seg.LoadDict("your gopath"+"/src/github.com/go-ego/gse/data/dict/dictionary.txt")

	cut()

	segCut()
}

func cut() {
	hmm := new.Cut(text, true)
	fmt.Println("cut use hmm: ", hmm)

	hmm = new.CutSearch(text, true)
	fmt.Println("cut search use hmm: ", hmm)
	fmt.Println("analyze: ", new.Analyze(hmm, text))

	hmm = new.CutAll(text)
	fmt.Println("cut all: ", hmm)

	reg := regexp.MustCompile(`(\d+年|\d+月|\d+日|[\p{Latin}]+|[\p{Hangul}]+|\d+\.\d+|[a-zA-Z0-9]+)`)
	text1 := `헬로월드 헬로 서울, 2021年09月10日, 3.14`
	hmm = seg.CutDAG(text1, reg)
	fmt.Println("Cut with hmm and regexp: ", hmm, hmm[0], hmm[6])
}

func analyzeAndTrim(cut []string) {
	a := seg.Analyze(cut, "")
	fmt.Println("analyze the segment: ", a)

	cut = seg.Trim(cut)
	fmt.Println("cut all: ", cut)

	fmt.Println(seg.String(text, true))
	fmt.Println(seg.Slice(text, true))
}

func cutPos() {
	po := seg.Pos(text, true)
	fmt.Println("pos: ", po)
	po = seg.TrimPos(po)
	fmt.Println("trim pos: ", po)

	pos.WithGse(seg)
	po = posSeg.Cut(text, true)
	fmt.Println("pos: ", po)

	po = posSeg.TrimWithPos(po, "zg")
	fmt.Println("trim pos: ", po)
}

func segCut() {
	// Text Segmentation
	tb := []byte(text)
	fmt.Println(seg.String(text, true))

	segments := seg.Segment(tb)
	// Handle word segmentation results, search mode
	fmt.Println(gse.ToString(segments, true))
}

Look at an custom dictionary example

package main

import (
	"fmt"
	_ "embed"

	"github.com/go-ego/gse"
)

//go:embed test_dict3.txt
var testDict string

func main() {
	// var seg gse.Segmenter
	// seg.LoadDict("zh, testdata/test_dict.txt, testdata/test_dict1.txt")
	// seg.LoadStop()
	seg, err := gse.NewEmbed("zh, word 20 n"+testDict, "en")
	// seg.LoadDictEmbed()
	seg.LoadStopEmbed()

	text1 := "你好世界, Hello world"
	fmt.Println(seg.Cut(text1, true))
	fmt.Println(seg.String(text1, true))

	segments := seg.Segment([]byte(text1))
	fmt.Println(gse.ToString(segments))
}

Look at an Chinese example

Look at an Japanese example

Elasticsearch

How to use it with elasticsearch?

go-gse-elastic

Authors

License

Gse is primarily distributed under the terms of "both the MIT license and the Apache License (Version 2.0)". See LICENSE-APACHE, LICENSE-MIT.

Thanks for sego and jieba(jiebago).

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].