All Projects → chrisport → Go Lang Detector

chrisport / Go Lang Detector

Licence: apache-2.0
A small library in golang, that detects the language of a text. (text categorization)

Programming Languages

go
31211 projects - #10 most used programming language

Projects that are alternatives of or similar to Go Lang Detector

Node Language Detect
🇫🇷 NodeJS language detection library using n-gram
Stars: ✭ 309 (+130.6%)
Mutual labels:  language-detection
Cadscenario personalisation
This is a end to end Personalisation business scenario
Stars: ✭ 10 (-92.54%)
Mutual labels:  language-detection
React Native Localize
🌍 A toolbox for your React Native app localization
Stars: ✭ 1,682 (+1155.22%)
Mutual labels:  language-detection
Franc
Natural language detection
Stars: ✭ 3,605 (+2590.3%)
Mutual labels:  language-detection
Language Detection
A language detection library for PHP. Detects the language from a given text string.
Stars: ✭ 665 (+396.27%)
Mutual labels:  language-detection
Google Translate Php
🌐 Free Google Translate API PHP Package. Translates totally free of charge.
Stars: ✭ 1,131 (+744.03%)
Mutual labels:  language-detection
laravel-nlp
Laravel wrapper for common NLP tasks
Stars: ✭ 41 (-69.4%)
Mutual labels:  language-detection
Fasttext.js
FastText for Node.js
Stars: ✭ 127 (-5.22%)
Mutual labels:  language-detection
Geomate
GeoMate is a friend in need for all things geolocation. IP to geo lookup, automatic redirects (based on country, continent, language, etc), site switcher... You name it.
Stars: ✭ 19 (-85.82%)
Mutual labels:  language-detection
Spacy Cld
Language detection extension for spaCy 2.0+
Stars: ✭ 103 (-23.13%)
Mutual labels:  language-detection
Yii2 Localeurls
Automatic locale/language management for URLs
Stars: ✭ 384 (+186.57%)
Mutual labels:  language-detection
Awesome Persian Nlp Ir
Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
Stars: ✭ 460 (+243.28%)
Mutual labels:  language-detection
Guess Language.el
Emacs minor mode that detects the language you're typing in. Automatically switches spell checker. Supports multiple languages per document.
Stars: ✭ 78 (-41.79%)
Mutual labels:  language-detection
Lingua
👄 The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike
Stars: ✭ 341 (+154.48%)
Mutual labels:  language-detection
Nlp Models Tensorflow
Gathers machine learning and Tensorflow deep learning models for NLP problems, 1.13 < Tensorflow < 2.0
Stars: ✭ 1,603 (+1096.27%)
Mutual labels:  language-detection
Lingua Rs
👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike
Stars: ✭ 260 (+94.03%)
Mutual labels:  language-detection
Cld2
R Wrapper for Google's Compact Language Detector 2
Stars: ✭ 34 (-74.63%)
Mutual labels:  language-detection
Whatthelang
Lightning Fast Language Prediction 🚀
Stars: ✭ 130 (-2.99%)
Mutual labels:  language-detection
Padatious
A neural network intent parser
Stars: ✭ 124 (-7.46%)
Mutual labels:  language-detection
Paasaa
Natural language detection for Elixir
Stars: ✭ 86 (-35.82%)
Mutual labels:  language-detection

wercker status Coverage Status

Breaking changes in v0.2: please see chapter "Migration" below. Previous version is available under Release v0.1: https://github.com/chrisport/go-lang-detector/releases/tag/v0.1

Language Detector

This golang library provides functionality to analyze and recognize language based on text.

The implementation is based on the following paper:
N-Gram-Based Text Categorization
William B. Cavnar and John M. Trenkle
Environmental Research Institute of Michigan P.O. Box 134001
Ann Arbor MI 48113-4001

Detection by Language profile

A language profile is a map[string] intthat maps n-gram tokens to its occurrency-rank. So for the most frequent token 'X' of the analyzed text, map['X'] will be 1.

Detection by unicode range

A second way to detect the language is by the unicode range used in the text. Golang has a set of predefined unicode ranges in package unicode, which can be used easily, for example for detecting Chinese/Japanese/Korean:

var CHINESE_JAPANESE_KOREAN = &langdet.UnicodeRangeLanguageComparator{"CJK", unicode.Han}

Usage

Detect

Get the closest language:

The default detector supports the following languages: Arabic, English, French, German, Hebrew, Russian, Turkish

    detector := langdetdef.NewWithDefaultLanguages()
	testString := "do not care about quantity"
	result := detector.GetClosestLanguage(testString)
	fmt.Println(result)

output:
    english

by setting the value langdet.MinimumConfidence (0-1), you can set the accepted confidence level. E.g. 0.7 --> if langdet is 70% or higher sure that the language matches, return it, else it returns 'undefined'

Get Language Probabilities

GetClosestLanguage will return the language that most probably matches. To get the result of all analyzed language, you can use GetLanguage, which will return you all analyzed languages and their percentage of matching the input snippet

testString := "ont permis d'identifier"
GetLanguages returns:
    french 86 %
    english 79 %
    german 71 %
    turkish 54 %
    hebrew 39 %
    arabic 8 %
    russian 5 %


Analyze new language

For analysing a new language random Wikipedia articles in the target languages are ideal. The result will be a Language object, containing the specified name and the profile example:

    language := langdet.Analyze(text_sample, "french")
    language.Profile // language profile in form of map[string]int as defined above
    language.Name // the name that was given as parameter

Add more languages

New languages can directly be analyzed and added to a detector by providing a text sample:

    text_sample := GetTextFromFile("samples/polish.txt")
    detector.AddLanguageFrom(text_sample, "polish")

The text sample should be bigger then 200kb and can be "dirty" (special chars, lists, etc.), but the language should not change for long parts.

Alternatively Analyze can be used and the resulting language can added using AddLanguage method:

    text_sample := GetTextFromFile("samples/polish.txt")
    french := langdet.Analyze(text_sample, "french")

    //language can be added selectively to detectors
    detectorA.AddLanguage(french)
    detectorC.AddLanguage(french)

Migration to v0.2

This library has been adapted to a more convenient and more idiomatic way.

  • Default languages are provided in Go code and there is no need for adding the json file anymore.
  • All code related to defaults has been moved to package langdetdef
  • Default languages can be added using the provided interfaces:
// detector with default languages
detector := langdetdef.NewWithDefaultLanguages()

// add all to existing detector
defaults := langdetdef.DefaultLanguages()
detector.AddLanguageComparators(defaults...)

// add selectively
detector.AddLanguageComparators(langdetdef.CHINESE_JAPANESE_KOREAN, langdetdef.ENGLISH)
  • InitWithDefaultFromXY has been removed, custom default languages can be unmarshaled manually and added to a detector through AddLanguage interface:
detector := langdet.NewDetector()
customLanguages := []langdet.Language{}

_ = json.Unmarshal(bytesFromFile, &customLanguages)
detector.AddLanguage(customLanguages...)

Contribution

Suggestions and Bug reports can be made through Github issues. Contributions are welcomed, there is currently no need to open an issue for it, but please follow the code style, including descriptive tests with GoConvey.

License

Licensed under Apache 2.0.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].