All Projects → R1j1t → contextualSpellCheck

R1j1t / contextualSpellCheck

Licence: MIT license
✔️Contextual word checker for better suggestions

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to contextualSpellCheck

spacy hunspell
✏️ Hunspell extension for spaCy 2.0.
Stars: ✭ 94 (-65.69%)
Mutual labels:  spacy, spelling-correction, spellchecker, spacy-extension
Semantic-Textual-Similarity
Natural Language Processing using NLTK and Spacy
Stars: ✭ 30 (-89.05%)
Mutual labels:  spacy, spelling-correction, spellchecker
spellchecker-wasm
SpellcheckerWasm is an extrememly fast spellchecker for WebAssembly based on SymSpell
Stars: ✭ 46 (-83.21%)
Mutual labels:  spellcheck, spelling-correction, spellchecker
WordSegmentationDP
Word Segmentation with Dynamic Programming
Stars: ✭ 18 (-93.43%)
Mutual labels:  spellcheck, spelling-correction, spellchecker
spacy-langdetect
A fully customisable language detection pipeline for spaCy
Stars: ✭ 86 (-68.61%)
Mutual labels:  spacy, spacy-extension
Xpersona
XPersona: Evaluating Multilingual Personalized Chatbot
Stars: ✭ 54 (-80.29%)
Mutual labels:  chatbot, bert
extractacy
Spacy pipeline object for extracting values that correspond to a named entity (e.g., birth dates, account numbers, laboratory results)
Stars: ✭ 47 (-82.85%)
Mutual labels:  spacy, spacy-extension
neuspell
NeuSpell: A Neural Spelling Correction Toolkit
Stars: ✭ 524 (+91.24%)
Mutual labels:  spellcheck, spelling-correction
Did you mean
The gem that has been saving people from typos since 2014
Stars: ✭ 1,786 (+551.82%)
Mutual labels:  spellcheck, spelling-correction
spacymoji
💙 Emoji handling and meta data for spaCy with custom extension attributes
Stars: ✭ 174 (-36.5%)
Mutual labels:  spacy, spacy-extension
spacy conll
Pipeline component for spaCy (and other spaCy-wrapped parsers such as spacy-stanza and spacy-udpipe) that adds CoNLL-U properties to a Doc and its sentences and tokens. Can also be used as a command-line tool.
Stars: ✭ 60 (-78.1%)
Mutual labels:  spacy, spacy-extension
spaczz
Fuzzy matching and more functionality for spaCy.
Stars: ✭ 215 (-21.53%)
Mutual labels:  spacy, spacy-extension
DrFAQ
DrFAQ is a plug-and-play question answering NLP chatbot that can be generally applied to any organisation's text corpora.
Stars: ✭ 29 (-89.42%)
Mutual labels:  spacy, bert
spacy-sentence-bert
Sentence transformers models for SpaCy
Stars: ✭ 88 (-67.88%)
Mutual labels:  spacy, bert
spacy-iwnlp
German lemmatization with IWNLP as extension for spaCy
Stars: ✭ 22 (-91.97%)
Mutual labels:  spacy, spacy-extension
LinSpell
Fast approximate strings search & spelling correction
Stars: ✭ 52 (-81.02%)
Mutual labels:  spellcheck, spelling-correction
spell
Spelling correction and string segmentation written in Go
Stars: ✭ 24 (-91.24%)
Mutual labels:  spellcheck, spelling-correction
SymSpellCppPy
Fast SymSpell written in c++ and exposes to python via pybind11
Stars: ✭ 28 (-89.78%)
Mutual labels:  spellcheck, spelling-correction
Symspell
SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
Stars: ✭ 1,976 (+621.17%)
Mutual labels:  spellcheck, spelling-correction
anonymisation
Anonymization of legal cases (Fr) based on Flair embeddings
Stars: ✭ 85 (-68.98%)
Mutual labels:  spacy, bert

spellCheck

Contextual word checker for better suggestions

license PyPI Python-Version Downloads GitHub contributors Help Wanted DOI

Types of spelling mistakes

It is essential to understand that identifying whether a candidate is a spelling error is a big task.

Spelling errors are broadly classified as non- word errors (NWE) and real word errors (RWE). If the misspelt string is a valid word in the language, then it is called an RWE, else it is an NWE.

-- Monojit Choudhury et. al. (2007)

This package currently focuses on Out of Vocabulary (OOV) word or non-word error (NWE) correction using BERT model. The idea of using BERT was to use the context when correcting OOV. To improve this package, I would like to extend the functionality to identify RWE, optimising the package, and improving the documentation.

Install

The package can be installed using pip. You would require python 3.6+

pip install contextualSpellCheck

Usage

Note: For use in other languages check examples folder.

How to load the package in spacy pipeline

>>> import contextualSpellCheck
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm") 
>>> 
>>> ## We require NER to identify if a token is a PERSON
>>> ## also require parser because we use `Token.sent` for context
>>> nlp.pipe_names
['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
>>> contextualSpellCheck.add_to_pipe(nlp)
>>> nlp.pipe_names
['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer', 'contextual spellchecker']
>>> 
>>> doc = nlp('Income was $9.4 milion compared to the prior year of $2.7 milion.')
>>> doc._.outcome_spellCheck
'Income was $9.4 million compared to the prior year of $2.7 million.'

Or you can add to spaCy pipeline manually!

>>> import spacy
>>> import contextualSpellCheck
>>> 
>>> nlp = spacy.load("en_core_web_sm")
>>> nlp.pipe_names
['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
>>> # You can pass the optional parameters to the contextualSpellCheck
>>> # eg. pass max edit distance use config={"max_edit_dist": 3}
>>> nlp.add_pipe("contextual spellchecker")
<contextualSpellCheck.contextualSpellCheck.ContextualSpellCheck object at 0x1049f82b0>
>>> nlp.pipe_names
['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer', 'contextual spellchecker']
>>> 
>>> doc = nlp("Income was $9.4 milion compared to the prior year of $2.7 milion.")
>>> print(doc._.performed_spellCheck)
True
>>> print(doc._.outcome_spellCheck)
Income was $9.4 million compared to the prior year of $2.7 million.

After adding contextual spellchecker in the pipeline, you use the pipeline normally. The spell check suggestions and other data can be accessed using extensions.

Using the pipeline

>>> doc = nlp(u'Income was $9.4 milion compared to the prior year of $2.7 milion.')
>>> 
>>> # Doc Extention
>>> print(doc._.contextual_spellCheck)
True
>>> print(doc._.performed_spellCheck)
True
>>> print(doc._.suggestions_spellCheck)
{milion: 'million', milion: 'million'}
>>> print(doc._.outcome_spellCheck)
Income was $9.4 million compared to the prior year of $2.7 million.
>>> print(doc._.score_spellCheck)
{milion: [('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)], milion: [('billion', 0.65934), ('million', 0.26185), ('trillion', 0.05391), ('##M', 0.0051), ('Million', 0.00425), ('##B', 0.00268), ('USD', 0.00153), ('##b', 0.00077), ('millions', 0.00059), ('%', 0.00041)]}
>>> 
>>> # Token Extention
>>> print(doc[4]._.get_require_spellCheck)
True
>>> print(doc[4]._.get_suggestion_spellCheck)
'million'
>>> print(doc[4]._.score_spellCheck)
[('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)]
>>> 
>>> # Span Extention
>>> print(doc[2:6]._.get_has_spellCheck)
True
>>> print(doc[2:6]._.score_spellCheck)
{$: [], 9.4: [], milion: [('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)], compared: []}

Extensions

To make the usage easy, contextual spellchecker provides custom spacy extensions which your code can consume. This makes it easier for the user to get the desired data. contextualSpellCheck provides extensions on the doc, span and token level. Below tables summaries the extensions.

spaCy.Doc level extensions

Extension Type Description Default
doc._.contextual_spellCheck Boolean To check whether contextualSpellCheck is added as extension True
doc._.performed_spellCheck Boolean To check whether contextualSpellCheck identified any misspells and performed correction False
doc._.suggestions_spellCheck {Spacy.Token:str} if corrections are performed, it returns the mapping of misspell token (spaCy.Token) with suggested word(str) {}
doc._.outcome_spellCheck str corrected sentence(str) as output ""
doc._.score_spellCheck {Spacy.Token:List(str,float)} if corrections are identified, it returns the mapping of misspell token (spaCy.Token) with suggested words(str) and probability of that correction None

spaCy.Span level extensions

Extension Type Description Default
span._.get_has_spellCheck Boolean To check whether contextualSpellCheck identified any misspells and performed correction in this span False
span._.score_spellCheck {Spacy.Token:List(str,float)} if corrections are identified, it returns the mapping of misspell token (spaCy.Token) with suggested words(str) and probability of that correction for tokens in this span {spaCy.Token: []}

spaCy.Token level extensions

Extension Type Description Default
token._.get_require_spellCheck Boolean To check whether contextualSpellCheck identified any misspells and performed correction on this token False
token._.get_suggestion_spellCheck str if corrections are performed, it returns the suggested word(str) ""
token._.score_spellCheck [(str,float)] if corrections are identified, it returns suggested words(str) and probability(float) of that correction []

API

At present, there is a simple GET API to get you started. You can run the app in your local and play with it.

Query: You can use the endpoint http://127.0.0.1:5000/?query=YOUR-QUERY Note: Your browser can handle the text encoding

GET Request: http://localhost:5000/?query=Income%20was%20$9.4%20milion%20compared%20to%20the%20prior%20year%20of%20$2.7%20milion.

Response:

{
    "success": true,
    "input": "Income was $9.4 milion compared to the prior year of $2.7 milion.",
    "corrected": "Income was $9.4 milion compared to the prior year of $2.7 milion.",
    "suggestion_score": {
        "milion": [
            [
                "million",
                0.59422
            ],
            [
                "billion",
                0.24349
            ],
            ...
        ],
        "milion:1": [
            [
                "billion",
                0.65934
            ],
            [
                "million",
                0.26185
            ],
            ...
        ]
    }
}

Task List

  • use cython for part of the code to improve performance (#39)
  • Improve metric for candidate selection (#40)
  • Add examples for other langauges (#41)
  • Update the logic of misspell identification (OOV) (#44)
  • better candidate generation (solved by #44?)
  • add metric by testing on datasets
  • Improve documentation
  • Improve logging in code
  • Add support for Real Word Error (RWE) (Big Task)
  • add multi mask out capability
Completed Task

  • specify maximum edit distance for candidateRanking
  • allow user to specify bert model
  • Include transformers deTokenizer to get better suggestions
  • dependency version in setup.py (#38)

Support and contribution

If you like the project, please ⭑ the project and show your support! Also, if you feel, the current behaviour is not as expected, please feel free to raise an issue. If you can help with any of the above tasks, please open a PR with necessary changes to documentation and tests.

Cite

If you are using contextualSpellCheck in your academic work, please consider citing the library using the below BibTex entry:

@misc{Goel_Contextual_Spell_Check_2021,
author = {Goel, Rajat},
doi = {10.5281/zenodo.4642379},
month = {3},
title = {{Contextual Spell Check}},
url = {https://github.com/R1j1t/contextualSpellCheck},
year = {2021}
}

Reference

Below are some of the projects/work I referred to while developing this package

  1. Explosion AI.Architecture. May 2020. url:https://spacy.io/api.
  2. Monojit Choudhury et al. “How difficult is it to develop a perfect spell-checker? A cross-linguistic analysis through complex network approach”. In:arXiv preprint physics/0703198(2007).
  3. Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transform-ers for Language Understanding. 2019. arXiv:1810.04805 [cs.CL].
  4. Hugging Face.Fast Coreference Resolution in spaCy with Neural Net-works. May 2020. url:https://github.com/huggingface/neuralcoref.
  5. Ines.Chapter 3: Processing Pipelines. May 20202. url:https://course.spacy.io/en/chapter3.
  6. Eric Mays, Fred J Damerau, and Robert L Mercer. “Context based spellingcorrection”. In:Information Processing & Management27.5 (1991), pp. 517–522.
  7. Peter Norvig. How to Write a Spelling Corrector. May 2020. url:http://norvig.com/spell-correct.html.
  8. Yifu Sun and Haoming Jiang.Contextual Text Denoising with MaskedLanguage Models. 2019. arXiv:1910.14080 [cs.CL].
  9. Thomas Wolf et al. “Transformers: State-of-the-Art Natural LanguageProcessing”. In:Proceedings of the 2020 Conference on Empirical Methodsin Natural Language Processing: System Demonstrations. Online: Associ-ation for Computational Linguistics, Oct. 2020, pp. 38–45. url:https://www.aclweb.org/anthology/2020.emnlp-demos.6.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].