All Projects → bsolomon1124 → pycld3

bsolomon1124 / pycld3

Licence: Apache-2.0 license
Python3 bindings for the Compact Language Detector v3 (CLD3)

Programming Languages

C++
36643 projects - #6 most used programming language

Projects that are alternatives of or similar to pycld3

Geomate
GeoMate is a friend in need for all things geolocation. IP to geo lookup, automatic redirects (based on country, continent, language, etc), site switcher... You name it.
Stars: ✭ 19 (-84.43%)
Mutual labels:  language-detection
Nlp Models Tensorflow
Gathers machine learning and Tensorflow deep learning models for NLP problems, 1.13 < Tensorflow < 2.0
Stars: ✭ 1,603 (+1213.93%)
Mutual labels:  language-detection
L10n Swift
Localization of the application with ability to change language "on the fly" and support for plural form in any language.
Stars: ✭ 177 (+45.08%)
Mutual labels:  language-detection
Cld2
R Wrapper for Google's Compact Language Detector 2
Stars: ✭ 34 (-72.13%)
Mutual labels:  language-detection
Spacy Cld
Language detection extension for spaCy 2.0+
Stars: ✭ 103 (-15.57%)
Mutual labels:  language-detection
Fasttext.js
FastText for Node.js
Stars: ✭ 127 (+4.1%)
Mutual labels:  language-detection
Awesome Persian Nlp Ir
Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
Stars: ✭ 460 (+277.05%)
Mutual labels:  language-detection
cnn-ld-tf
Convolutional Neural Network for Language Detection in Tensorflow
Stars: ✭ 12 (-90.16%)
Mutual labels:  language-detection
React Native Localize
🌍 A toolbox for your React Native app localization
Stars: ✭ 1,682 (+1278.69%)
Mutual labels:  language-detection
Spark Nlp
State of the Art Natural Language Processing
Stars: ✭ 2,518 (+1963.93%)
Mutual labels:  language-detection
Google Translate Php
🌐 Free Google Translate API PHP Package. Translates totally free of charge.
Stars: ✭ 1,131 (+827.05%)
Mutual labels:  language-detection
Paasaa
Natural language detection for Elixir
Stars: ✭ 86 (-29.51%)
Mutual labels:  language-detection
Whatthelang
Lightning Fast Language Prediction 🚀
Stars: ✭ 130 (+6.56%)
Mutual labels:  language-detection
Cadscenario personalisation
This is a end to end Personalisation business scenario
Stars: ✭ 10 (-91.8%)
Mutual labels:  language-detection
Hms Ml Demo
HMS ML Demo provides an example of integrating Huawei ML Kit service into applications. This example demonstrates how to integrate services provided by ML Kit, such as face detection, text recognition, image segmentation, asr, and tts.
Stars: ✭ 187 (+53.28%)
Mutual labels:  language-detection
Language Detection
A language detection library for PHP. Detects the language from a given text string.
Stars: ✭ 665 (+445.08%)
Mutual labels:  language-detection
Padatious
A neural network intent parser
Stars: ✭ 124 (+1.64%)
Mutual labels:  language-detection
cld3-kotlin
Bindings to Google's Compact Language Detector 3 to JVM Based Languages
Stars: ✭ 20 (-83.61%)
Mutual labels:  cld3
Malaya
Natural Language Toolkit for bahasa Malaysia, https://malaya.readthedocs.io/
Stars: ✭ 239 (+95.9%)
Mutual labels:  language-detection
Go Lang Detector
A small library in golang, that detects the language of a text. (text categorization)
Stars: ✭ 134 (+9.84%)
Mutual labels:  language-detection

pycld3

Python bindings to the Compact Language Detector v3 (CLD3).

CircleCI License PyPI Wheel Status Python Implementation

Newer Alternative: gcld3

Note: Since the original publication of this pycld3, Google's cld3 authors have published the Python package gcld3, which are official Python bindings built with pybind. Please check that project out as it is part of the canonical cld3 repository and will likely stay in better lock step with any cld3 changes over time.

Overview

This package contains Python bindings (via Cython) to Google's CLD3 library.

>>> import cld3
>>> cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度")
LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)

The library outputs BCP-47-style language codes. For some languages, output is differentiated by script. Language and script names from Unicode CLDR. It supports over 100 languages/scripts. See full list of supported languages/scripts in Google's CLD3 documentation.

Installing with Wheels: Supported Versions and Platforms

This project supports CPython versions 3.6 through 3.9.

We publish wheels for the following matrix:

  • MacOS: CPython 3.6 thru 3.9
  • Linux: CPython 3.6 thru 3.9; (manylinux1)

The wheels for both MacOS and manylinux1 include the external protobuf library copied into the wheel itself via auditwheel or delocate so that you won't need to install any extra non-PyPI dependencies.

If you are installing on one of the variants listed above, you should not need to have protoc or libprotobuf installed:

python -m pip install -U pycld3

Installing from Source: Prerequisites

If you are not on a platform variant that is eligible to use a wheel, you may still be able to use pycld3 via its source distribution (tar.gz), but a bit more work is required to install. Namely, you'll also need:

  • the Protobuf compiler (the protoc executable)
  • the Protobuf development headers and libprotoc library
  • a compiler, preferably g++

Please consult the official protobuf repository for information on installing Protobuf. The project contains an Installation README that covers installation on Windows and Unix.

If for whatever reason you are on a Unix host but unable to use the wheels (for instance, if you have an i686 architecture), here is a quick-and-dirty guide to installing.

Debian/Ubuntu

sudo apt-get update -y
sudo apt-get install -y --no-install-recommends \
    g++ \
    protobuf-compiler \
    libprotobuf-dev
python -m pip install -U pycld3

Alpine Linux

Note: Alpine Linux does not support PyPI wheels as of April 2020. The steps below are mandatory on Alpine Linux because you will need to install from the source distribution. If the situation permits, using a Debian distro should be much easier (and faster).

apk --update add g++ protobuf protobuf-dev
python -m pip install -U pycld3

CentOS/RHEL

Install from source, as root/UID 0:

sudo su -
set -ex
pushd /opt
PROTOBUF_VERSION='3.11.4'
yum update -y
yum install -y autoconf automake gcc-c++ glibc-headers gzip libtool make python3-devel zlib-devel
curl -Lo /opt/protobuf.tar.gz \
    "https://github.com/protocolbuffers/protobuf/releases/download/v${PROTOBUF_VERSION}/protobuf-cpp-${PROTOBUF_VERSION}.tar.gz"
tar -xzvf protobuf.tar.gz
rm -f protobuf.tar.gz
pushd "protobuf-${PROTOBUF_VERSION}"
./configure --with-zlib --disable-debug && make && make install && ldconfig --verbose
popd && rm -rf "protobuf-${PROTOBUF_VERSION}" && popd && set +ex

python -m pip install -U pycld3

Note: the steps above are for CentOS 8. For earlier versions, you may need to replace:

  • gcc-c++ with g++
  • python3-devel with python-devel

MacOS/Homebrew

brew update
brew upgrade protobuf || brew install -v protobuf
python -m pip install -U pycld3

Windows

Please consult Protobuf's C++ Installation - Windows section for help with installing Protobuf on Windows.

If you would like to help contribute Windows wheels (preferably as a job within the project's CI/CD pipelines), please file an issue.

Usage

cld3 exports two module-level functions, get_language() and get_frequent_languages():

>>> import cld3

>>> cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度")
LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)

>>> cld3.get_language("This is a test")
LanguagePrediction(language='en', probability=0.9999980926513672, is_reliable=True, proportion=1.0)

>>> for lang in cld3.get_frequent_languages(
...     "This piece of text is in English. Този текст е на Български.",
...     num_langs=3
... ):
...     print(lang)
...
LanguagePrediction(language='bg', probability=0.9173890948295593, is_reliable=True, proportion=0.5853658318519592)
LanguagePrediction(language='en', probability=0.9999790191650391, is_reliable=True, proportion=0.4146341383457184)

FAQ

cld3 incorrectly detects my input. How can I fix this?

A first resort is to preprocess (clean) your input text based on conditions specific to your program.

A salient example is to remove URLs and email addresses from the input. CLD3 (unlike CLD2) does almost none of this cleaning for you, in the spirit of not penalizing other users with overhead that they may not need.

Here's such an example using a simplified URL regex from Regular Expressions Cookbook, 2nd ed.:

>>> import re
>>> import cld3

# cld3 does not ignore the URL components by default
>>> s = "Je veux que: https://site.english.com/this/is/a/url/path/component#fragment"
>>> cld3.get_language(s)
LanguagePrediction(language='en', probability=0.5319557189941406, is_reliable=False, proportion=1.0)

>>> url_re = r"\b(?:https?://|www\.)[a-z0-9-]+(\.[a-z0-9-]+)+(?:[/?].*)?"
>>> new_s = re.sub(url_re, "", s)
>>> new_s
'Je veux que: '
>>> cld3.get_language(new_s)
LanguagePrediction(language='fr', probability=0.9799421429634094, is_reliable=True, proportion=1.0)

Note: This URL regex aims for simplicity. It requires a domain name, and doesn't allow a username or password; it allows the scheme (http or https) to be omitted if it can be inferred from the subdomain (www). Source: Regular Expressions Cookbook, 2nd ed. - Goyvaerts & Levithan.

In some other cases, you cannot fix the incorrect detection. Language detection algorithms in general may perform poorly with very short inputs. Rarely should you trust the output of something like detect("hi"). Keep this limitation in mind regardless of what library you are using.

Please remember that, at the end of the day, this project is just a Python wrapper to the CLD3 C++ library that does the actual heavy-lifting.

I'm seeing an error during pip installation. How can I fix this?

First, please make sure you have read the installation section that that you have installed Protobuf if necessary.

If that doesn't help, please file an issue in this repository. The build process for this project is somewhat complex because it involves both Cython and Protobuf, but I do my best to make it work everywhere possible.

Protobuf is installed, but I'm still seeing "cannot open shared object file"

If you've installed Protobuf, but are seeing an error such as:

ImportError: libprotobuf.so.22: cannot open shared object file: No such file or directory

This likely means that Python is not finding the libprotobuf shared object, possibly because ldconfig didn't do what it was supposed to. You may need to tell it where to look.

You can find where the library sits via:

$ find /usr -name 'libprotoc.so' \( -type l -o -type f \)
/usr/local/lib/libprotoc.so

Then, you can add the directory containing this file to LD_LIBRARY_PATH:

export LD_LIBRARY_PATH="$(dirname $(find /usr -name 'libprotoc.so' \( -type l -o -type f \))):$LD_LIBRARY_PATH"

You can quickly test that this worked:

$ python -c 'import cld3; print(cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度"))'
LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)

Authors

This repository contains a fork of google/cld3 at commit 06f695f. The license for google/cld3 can be found at LICENSES/CLD3_LICENSE.

This repository is a combination of changes introduced by various forks of google/cld3 by the following people:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].