All Projects → bakwc → Jamspell

bakwc / Jamspell

Licence: mit
Modern spell checking library - accurate, fast, multi-language

Programming Languages

python
139335 projects - #7 most used programming language
java
68154 projects - #9 most used programming language
ruby
36898 projects - #4 most used programming language
cpp
1120 projects
csharp
926 projects

Projects that are alternatives of or similar to Jamspell

spell
Spelling correction and string segmentation written in Go
Stars: ✭ 24 (-93.51%)
Mutual labels:  spellcheck
contextualSpellCheck
✔️Contextual word checker for better suggestions
Stars: ✭ 274 (-25.95%)
Mutual labels:  spellcheck
cyberdic
An auxiliary spellcheck dictionary that corresponds with the Bishop Fox Cybersecurity Style Guide
Stars: ✭ 63 (-82.97%)
Mutual labels:  spellcheck
WordSegmentationDP
Word Segmentation with Dynamic Programming
Stars: ✭ 18 (-95.14%)
Mutual labels:  spellcheck
check-spelling
Spelling checker action
Stars: ✭ 139 (-62.43%)
Mutual labels:  spellcheck
ispell-lt
Lithuanian spellchecking dictionary
Stars: ✭ 26 (-92.97%)
Mutual labels:  spellcheck
Nodehun
The Hunspell binding for NodeJS that exposes as much of Hunspell as possible and also adds new features. Hunspell is a first class spellcheck library used by Google, Apple, and Mozilla.
Stars: ✭ 229 (-38.11%)
Mutual labels:  spellcheck
Nlprule
A fast, low-resource Natural Language Processing and Text Correction library written in Rust.
Stars: ✭ 309 (-16.49%)
Mutual labels:  spellcheck
identypo
identypo is a Go static analysis tool to find typos in identifiers (functions, function calls, variables, constants, type declarations, packages, labels).
Stars: ✭ 26 (-92.97%)
Mutual labels:  spellcheck
SymSpellCppPy
Fast SymSpell written in c++ and exposes to python via pybind11
Stars: ✭ 28 (-92.43%)
Mutual labels:  spellcheck
voikko-rs
Rust bindings for the Voikko library
Stars: ✭ 16 (-95.68%)
Mutual labels:  spellcheck
LinSpell
Fast approximate strings search & spelling correction
Stars: ✭ 52 (-85.95%)
Mutual labels:  spellcheck
Emacs-LanguageTool.el
LanguageTool suggestions integrated within Emacs
Stars: ✭ 44 (-88.11%)
Mutual labels:  spellcheck
neuspell
NeuSpell: A Neural Spelling Correction Toolkit
Stars: ✭ 524 (+41.62%)
Mutual labels:  spellcheck
hanspell
(주)다음과 부산대학교 인공지능연구실/(주)나라인포테크의 웹 서비스를 이용한 한글 맞춤법 검사기.
Stars: ✭ 72 (-80.54%)
Mutual labels:  spellcheck
flake8-spellcheck
❄️ Spellcheck variables, classnames, comments, docstrings etc
Stars: ✭ 71 (-80.81%)
Mutual labels:  spellcheck
yaspeller-ci
Fast spelling check for Travis CI
Stars: ✭ 60 (-83.78%)
Mutual labels:  spellcheck
Pyspellchecker
Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/
Stars: ✭ 336 (-9.19%)
Mutual labels:  spellcheck
Proofreader
Simple text proofreader based on 'write-good' (hemingway-app-like suggestions) and 'nodehun' (spelling).
Stars: ✭ 285 (-22.97%)
Mutual labels:  spellcheck
wellspell.addin
R Package - Quick Spellcheck Addin for RStudio
Stars: ✭ 22 (-94.05%)
Mutual labels:  spellcheck

JamSpell

Build Status Release

JamSpell is a spell checking library with following features:

  • accurate - it considers words surroundings (context) for better correction
  • fast - near 5K words per second
  • multi-language - it's written in C++ and available for many languages with swig bindings

Colab example

JamSpellPro

jamspell.com - check out a new jamspell version with following features

  • Improved accuracy (catboost gradient boosted decision trees candidates ranking model)
  • Splits merged words
  • Pre-trained models for many languages (small, medium, large) for:
    en, ru, de, fr, it, es, tr, uk, pl, nl, pt, hi, no
  • Ability to add words / sentences at runtime
  • Fine-tuning / additional training
  • Memory optimization for training large models
  • Static dictionary support
  • Built-in Java, C#, Ruby support
  • Windows support

Content

Benchmarks

Errors Top 7 Errors Fix Rate Top 7 Fix Rate Broken Speed
(words/second)
JamSpell 3.25% 1.27% 79.53% 84.10% 0.64% 4854
Norvig 7.62% 5.00% 46.58% 66.51% 0.69% 395
Hunspell 13.10% 10.33% 47.52% 68.56% 7.14% 163
Dummy 13.14% 13.14% 0.00% 0.00% 0.00% -

Model was trained on 300K wikipedia sentences + 300K news sentences (english). 95% was used for train, 5% was used for evaluation. Errors model was used to generate errored text from the original one. JamSpell corrector was compared with Norvig's one, Hunspell and a dummy one (no corrections).

We used following metrics:

  • Errors - percent of words with errors after spell checker processed
  • Top 7 Errors - percent of words missing in top7 candidated
  • Fix Rate - percent of errored words fixed by spell checker
  • Top 7 Fix Rate - percent of errored words fixed by one of top7 candidates
  • Broken - percent of non-errored words broken by spell checker
  • Speed - number of words per second

To ensure that our model is not too overfitted for wikipedia+news we checked it on "The Adventures of Sherlock Holmes" text:

Errors Top 7 Errors Fix Rate Top 7 Fix Rate Broken Speed (words per second)
JamSpell 3.56% 1.27% 72.03% 79.73% 0.50% 5524
Norvig 7.60% 5.30% 35.43% 56.06% 0.45% 647
Hunspell 9.36% 6.44% 39.61% 65.77% 2.95% 284
Dummy 11.16% 11.16% 0.00% 0.00% 0.00% -

More details about reproducing available in "Train" section.

Usage

Python

  1. Install swig3 (usually it is in your distro package manager)

  2. Install jamspell:

pip install jamspell
  1. Download or train language model

  2. Use it:

import jamspell

corrector = jamspell.TSpellCorrector()
corrector.LoadLangModel('en.bin')

corrector.FixFragment('I am the begt spell cherken!')
# u'I am the best spell checker!'

corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 3)
# (u'best', u'beat', u'belt', u'bet', u'bent', ... )

corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 5)
# (u'checker', u'chicken', u'checked', u'wherein', u'coherent', ...)

C++

  1. Add jamspell and contrib dirs to your project

  2. Use it:

#include <jamspell/spell_corrector.hpp>

int main(int argc, const char** argv) {

    NJamSpell::TSpellCorrector corrector;
    corrector.LoadLangModel("model.bin");

    corrector.FixFragment(L"I am the begt spell cherken!");
    // "I am the best spell checker!"

    corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);
    // "best", "beat", "belt", "bet", "bent", ... )

    corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);
    // "checker", "chicken", "checked", "wherein", "coherent", ... )
    return 0;
}

Other languages

You can generate extensions for other languages using swig tutorial. The swig interface file is jamspell.i. Pull requests with build scripts are welcome.

HTTP API

  • Install cmake

  • Clone and build jamspell (it includes http server):

git clone https://github.com/bakwc/JamSpell.git
cd JamSpell
mkdir build
cd build
cmake ..
make
./web_server/web_server en.bin localhost 8080
  • GET Request example:
$ curl "http://localhost:8080/fix?text=I am the begt spell cherken"
I am the best spell checker
  • POST Request example
$ curl -d "I am the begt spell cherken" http://localhost:8080/fix
I am the best spell checker
  • Candidate example
curl "http://localhost:8080/candidates?text=I am the begt spell cherken"
# or
curl -d "I am the begt spell cherken" http://localhost:8080/candidates
{
    "results": [
        {
            "candidates": [
                "best",
                "beat",
                "belt",
                "bet",
                "bent",
                "beet",
                "beit"
            ],
            "len": 4,
            "pos_from": 9
        },
        {
            "candidates": [
                "checker",
                "chicken",
                "checked",
                "wherein",
                "coherent",
                "cheered",
                "cherokee"
            ],
            "len": 7,
            "pos_from": 20
        }
    ]
}

Here pos_from - misspelled word first letter position, len - misspelled word len

Train

To train custom model you need:

  1. Install cmake

  2. Clone and build jamspell:

git clone https://github.com/bakwc/JamSpell.git
cd JamSpell
mkdir build
cd build
cmake ..
make
  1. Prepare a utf-8 text file with sentences to train at (eg. sherlockholmes.txt) and another file with language alphabet (eg. alphabet_en.txt)

  2. Train model:

./main/jamspell train ../test_data/alphabet_en.txt ../test_data/sherlockholmes.txt model_sherlock.bin
  1. To evaluate spellchecker you can use evaluate/evaluate.py script:
python evaluate/evaluate.py -a alphabet_file.txt -jsp your_model.bin -mx 50000 your_test_data.txt
  1. You can use evaluate/generate_dataset.py to generate you train/test data. It supports txt files, Leipzig Corpora Collection format and fb2 books.

Download models

Here is a few simple models. They trained on 300K news + 300k wikipedia sentences. We strongly recommend to train your own model, at least on a few million sentences to achieve better quality. See Train section above.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].