All Projects → DanielJDufour → language-detector

DanielJDufour / language-detector

Licence: Apache-2.0 license
Detect the language of text

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to language-detector

date-extractor
Extract dates from text
Stars: ✭ 58 (+107.14%)
Mutual labels:  french, arabic, kurdish, sorani
Wudooh
Browser extension that allows Arabic script modifications for clarity and customization
Stars: ✭ 43 (+53.57%)
Mutual labels:  arabic, farsi, kurdish
RitaFontTester
Auxiliary tool for the development of Persian-Arabic fonts
Stars: ✭ 16 (-42.86%)
Mutual labels:  arabic, farsi
persian
Some utilities for Persian language in Go (Golang)
Stars: ✭ 65 (+132.14%)
Mutual labels:  arabic, farsi
Nozha-rtl-Dashboard
Nozha is a rtl / ltr Admin Panel with Dark Mode
Stars: ✭ 31 (+10.71%)
Mutual labels:  arabic, farsi
MyGoldenDict
My personal goldendict-dictionaries collection
Stars: ✭ 13 (-53.57%)
Mutual labels:  french, arabic
pH7-Internationalization
🎌 pH7CMS Internationalization (I18N) package 🙊 Get new languages for your pH7CMS website!
Stars: ✭ 17 (-39.29%)
Mutual labels:  spanish, french
SoMeWeTa
A part-of-speech tagger with support for domain adaptation and external resources.
Stars: ✭ 20 (-28.57%)
Mutual labels:  german, french
number-to-words
convert number into words (english, french, italian, roman, spanish, portuguese, belgium, dutch, swedish, polish, russian, iranian, roman, aegean)
Stars: ✭ 53 (+89.29%)
Mutual labels:  spanish, french
wink-tokenizer
Multilingual tokenizer that automatically tags each token with its type
Stars: ✭ 51 (+82.14%)
Mutual labels:  german, french
textbox
Text collections made available by the CLiGS group.
Stars: ✭ 19 (-32.14%)
Mutual labels:  spanish, french
postal-codes-json-xml-csv
Collection of postal codes in different formats, ready for importing.
Stars: ✭ 181 (+546.43%)
Mutual labels:  german
OpenSourceTutorials-Introduction
Open Source Kotlin Tutorial
Stars: ✭ 38 (+35.71%)
Mutual labels:  farsi
createurstech.fr
Première plateforme collaborative et open source qui référence les créateurs de contenus tech francophone.
Stars: ✭ 174 (+521.43%)
Mutual labels:  french
phpwaf-phanalyzer
AliGuard PHP WAF
Stars: ✭ 12 (-57.14%)
Mutual labels:  turkish
ar-embeddings
Sentiment Analysis for Arabic Text (tweets, reviews, and standard Arabic) using word2vec
Stars: ✭ 83 (+196.43%)
Mutual labels:  arabic
verbecc
Complete Conjugation of any Verb using Machine Learning for French, Spanish, Portuguese, Italian and Romanian
Stars: ✭ 45 (+60.71%)
Mutual labels:  french
german-tutorial
德语零基础教程
Stars: ✭ 52 (+85.71%)
Mutual labels:  german
PersianDataAnnotations
PersianDataAnnotations is ASP.NET Core MVC & ASP.NET MVC Custom Localization DataAnnotations (Localized MVC Errors) for Persian(Farsi) language - فارسی سازی خطاهای اعتبارسنجی توکار ام.وی.سی. و کور.ام.وی.سی. برای نمایش اعتبار سنجی سمت کلاینت
Stars: ✭ 38 (+35.71%)
Mutual labels:  farsi
startup-sozlugu
Startup dünyasında sık kullan kelimeler ve terimler
Stars: ✭ 21 (-25%)
Mutual labels:  turkish

Build Status

language-detector

language-detector detects the language of text

Installation

pip install language-detector

Python Version

Works with both Python 2 and 3

Use

from language_detector import detect_language
text = "I arrived in that city on January 4, 1937"
language = detect_language(text)
# prints English

Features

Languages Supported
Arabic
English
Farsi
French
German
Khmer
Kurmanci (Kurdish)
Mandarin
Russian
Sorani (Kurdish)
Spanish
Turkish

Testing

To test the package run

python -m unittest language_detector.tests.test

Comparison

Test is a comparison of how well language-detector and langid identify languages in the data sources.

package language-detector langid
test-duration (in seconds) 0.10 3.83
accuracy 96.77% 67.74%

Excluding Languages

If you don't want language-detector to look for certain languages, you can monkey-patch the code. For example, in order to exclude English:

import language_detector
language_detector.char_language = [cl for cl in char_language if cl[1] != "English"]

# proceed as normal

Datasets

The following is a list of datasets used for each language:

Language Datasets
Arabic UN Corpora
English UN Corpora
Farsi BBC News Persian
French UN Corpora
German Deutsche Welle
Khmer Cambodia Daily
Kurmanci (Kurdish) Rudaw
Mandarin UN Corpora
Russian UN Corpora
Sorani (Kurdish) Rudaw
Spanish UN Corpora
Turkish BBC News Türkçe

How Does It Work?

When training the model, we scan all the data sources and compute the frequency of how often a character appears in each specific language. We also compute the frequency of how often a characters appears in all of the data sources for all the languages. For each language, we then calculate a score for each character as frequency_in_language / frequency_in_all_languages. We then save the top ten highest scoring characters for each language.
When detecting a language, we simply iterate through the saved characters (ten for each language), and add their score as a weighted-vote for each language. Whichever, language has the highest score is selected as the winner.

Contributing

If you'd like to contribute a new language, please consult CONTRIBUTING.md

Support

Contact the package author, Daniel J. Dufour, at [email protected]

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].