All Projects β†’ MartinThoma β†’ lidtk

MartinThoma / lidtk

Licence: MIT license
Language Identification Toolkit

Programming Languages

python
139335 projects - #7 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to lidtk

lingua-go
πŸ‘„ The most accurate natural language detection library for Go, suitable for long and short text alike
Stars: ✭ 684 (+3923.53%)
Mutual labels:  nlp-machine-learning, language-identification
Naive-Bayes-Evening-Workshop
Companion code for Introduction to Python for Data Science: Coding the Naive Bayes Algorithm evening workshop
Stars: ✭ 23 (+35.29%)
Mutual labels:  nlp-machine-learning
Engine
The Centrifuge process, filter and saves the relevant documents as recommendations to the relevant users
Stars: ✭ 20 (+17.65%)
Mutual labels:  nlp-machine-learning
Winter
Winter is a 2D game engine for Pharo Smalltalk
Stars: ✭ 43 (+152.94%)
Mutual labels:  mit-license
powerslaves
Taking PowerSaves as a slave to your will.
Stars: ✭ 28 (+64.71%)
Mutual labels:  mit-license
seqtolang
Multi-Langauge Identification
Stars: ✭ 26 (+52.94%)
Mutual labels:  language-identification
vr-streaming-overlay
SteamVR overlay for streamers on Linux/Windows
Stars: ✭ 29 (+70.59%)
Mutual labels:  mit-license
ShortText-Fasttext
ShortText classification
Stars: ✭ 12 (-29.41%)
Mutual labels:  nlp-machine-learning
brand-sentiment-analysis
Scripts utilizing Heartex platform to build brand sentiment analysis from the news
Stars: ✭ 21 (+23.53%)
Mutual labels:  nlp-machine-learning
Very-deep-cnn-tensorflow
Very deep CNN for text classification
Stars: ✭ 18 (+5.88%)
Mutual labels:  nlp-machine-learning
Quora QuestionPairs DL
Kaggle Competition: Using deep learning to solve quora's question pairs problem
Stars: ✭ 54 (+217.65%)
Mutual labels:  nlp-machine-learning
fight-for-artistic-creativity
Twitterγ‚’γƒ‡γ‚£γ‚Ήγƒˆγƒ”γ‚’γ«γ—γͺγ„γŸγ‚γ«γ€ζˆ‘γ€…γŒγ§γγ‚‹γ“γ¨γ€‚
Stars: ✭ 19 (+11.76%)
Mutual labels:  mit-license
nn-segmentation-for-lar
Neural networks to segment some type of biomedical images
Stars: ✭ 21 (+23.53%)
Mutual labels:  mit-license
AI-Sentiment-Analysis-on-IMDB-Dataset
Sentiment Analysis using Stochastic Gradient Descent on 50,000 Movie Reviews Compiled from the IMDB Dataset
Stars: ✭ 55 (+223.53%)
Mutual labels:  nlp-machine-learning
kex
Kex is a python library for unsupervised keyword extraction from a document, providing an easy interface and benchmarks on 15 public datasets.
Stars: ✭ 46 (+170.59%)
Mutual labels:  nlp-machine-learning
Deception-Detection-on-Amazon-reviews-dataset
A SVM model that classifies the reviews as real or fake. Used both the review text and the additional features contained in the data set to build a model that predicted with over 85% accuracy without using any deep learning techniques.
Stars: ✭ 42 (+147.06%)
Mutual labels:  nlp-machine-learning
Inventus
Inventus is a spider designed to find subdomains of a specific domain by crawling it and any subdomains it discovers.
Stars: ✭ 80 (+370.59%)
Mutual labels:  mit-license
anuvada
Interpretable Models for NLP using PyTorch
Stars: ✭ 102 (+500%)
Mutual labels:  nlp-machine-learning
fswatch
File/Directory Watcher for Modern C++
Stars: ✭ 56 (+229.41%)
Mutual labels:  mit-license
Willow
The Web Interaction Library that eases the burden of creating AJAX-based web applications
Stars: ✭ 41 (+141.18%)
Mutual labels:  mit-license

DOI PyPI version Python Support Build Status Code style: black GitHub last commit GitHub commits since latest release (by SemVer) CodeFactor

lidtk

lidtk - the language identification toolkit - was written in order to investigate the current state of language performance.

Installation

The recommended way to install clana is:

$ pip install lidtk --user

If you want the latest version:

$ git clone https://github.com/MartinThoma/lidtk.git; cd lidtk
$ pip install -e . --user

I recommend getting the WiLI-2018 dataset.

Usage

$ lidtk --help

Usage: lidtk [OPTIONS] COMMAND [ARGS]...

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  analyze-data           Utility function for the languages...
  analyze-unicode-block  Analyze how important a Unicode block is for...
  char-distrib           Use the character distribution language...
  cld2                   Use the CLD-2 language classifier.
  create-dataset         Create sharable dataset from downloaded...
  download               Download 1000 documents of each language.
  google-cloud           Use the CLD-2 language classifier.
  langdetect             Use the langdetect language classifier.
  langid                 Use the langid language classifier.
  map                    Map predictions to something known by WiLI
  nn                     Use a neural network classifier.
  textcat                Use the CLD-2 language classifier.
  tfidf_nn               Use the TfidfNNClassifier classifier.

For example:

$ lidtk cld2 predict --text 'This is a test.'
eng

The usual order is:

  1. lidtk download: Please use WiLI-2018 instead of downloading the dataset on your own.
  2. lidtk create-dataset: This step can be skipped if you use WiLI-2018
  3. lidtk analyze-unicode-block --start 0 --end 128
  4. lidtk tfidf_nn train vectorizer --config lidtk/classifiers/config/tfidf_nn.yaml
  5. lidtk tfidf_nn train vectorizer --config lidtk/classifiers/config/tfidf_nn.yaml
  6. lidtk tfidf_nn wili --config lidtk/classifiers/config/tfidf_nn.yaml

Or to use one directly:

$ lidtk cld2 predict --text 'This text is written in some language.'

eng

Development

Check tests with tox.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].