All Projects → wolfgarbe → WordSegmentationDP

wolfgarbe / WordSegmentationDP

Licence: MIT license
Word Segmentation with Dynamic Programming

Programming Languages

C#
18002 projects
Batchfile
5799 projects

Projects that are alternatives of or similar to WordSegmentationDP

SymSpellCppPy
Fast SymSpell written in c++ and exposes to python via pybind11
Stars: ✭ 28 (+55.56%)
Mutual labels:  spellcheck, spell-check, word-segmentation, spelling-correction, spelling-corrector, text-segmentation, symspell
spell
Spelling correction and string segmentation written in Go
Stars: ✭ 24 (+33.33%)
Mutual labels:  spellcheck, spell-check, word-segmentation, spelling-correction, text-segmentation, symspell
Symspell
SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
Stars: ✭ 1,976 (+10877.78%)
Mutual labels:  spellcheck, spell-check, word-segmentation, spelling-correction, text-segmentation, symspell
spellchecker-wasm
SpellcheckerWasm is an extrememly fast spellchecker for WebAssembly based on SymSpell
Stars: ✭ 46 (+155.56%)
Mutual labels:  spellcheck, spell-check, spelling-correction, spellchecker, symspell
customized-symspell
Java port of SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm
Stars: ✭ 51 (+183.33%)
Mutual labels:  word-segmentation, spelling-correction, spellchecker, symspell
LinSpell
Fast approximate strings search & spelling correction
Stars: ✭ 52 (+188.89%)
Mutual labels:  spellcheck, spell-check, spelling-correction
Did you mean
The gem that has been saving people from typos since 2014
Stars: ✭ 1,786 (+9822.22%)
Mutual labels:  spellcheck, spell-check, spelling-correction
Symspellpy
Python port of SymSpell
Stars: ✭ 420 (+2233.33%)
Mutual labels:  spellcheck, spell-check, word-segmentation
contextualSpellCheck
✔️Contextual word checker for better suggestions
Stars: ✭ 274 (+1422.22%)
Mutual labels:  spellcheck, spelling-correction, spellchecker
spacy hunspell
✏️ Hunspell extension for spaCy 2.0.
Stars: ✭ 94 (+422.22%)
Mutual labels:  spell-check, spelling-correction, spellchecker
neuspell
NeuSpell: A Neural Spelling Correction Toolkit
Stars: ✭ 524 (+2811.11%)
Mutual labels:  spellcheck, spell-checker, spelling-correction
check-spelling
Spelling checker action
Stars: ✭ 139 (+672.22%)
Mutual labels:  spellcheck, spell-check
Dictionaries
Hunspell dictionaries in UTF-8
Stars: ✭ 591 (+3183.33%)
Mutual labels:  spellcheck, spell-check
Misspell Fixer
Simple tool for fixing common misspellings, typos in source code
Stars: ✭ 154 (+755.56%)
Mutual labels:  spellcheck, spell-check
Symspellcompound
SymSpellCompound: compound aware automatic spelling correction
Stars: ✭ 61 (+238.89%)
Mutual labels:  spellcheck, spell-check
Semantic-Textual-Similarity
Natural Language Processing using NLTK and Spacy
Stars: ✭ 30 (+66.67%)
Mutual labels:  spelling-correction, spellchecker
Wecantspell.hunspell
A port of Hunspell v1 for .NET and .NET Standard
Stars: ✭ 61 (+238.89%)
Mutual labels:  spellcheck, spell-check
Hunspell
The most popular spellchecking library.
Stars: ✭ 1,196 (+6544.44%)
Mutual labels:  spellcheck, spell-check
Spelling
Tools for Spell Checking in R
Stars: ✭ 82 (+355.56%)
Mutual labels:  spellcheck, spell-check
retext-spell
plugin to check spelling
Stars: ✭ 53 (+194.44%)
Mutual labels:  spell-check, spell-checker

WordSegmentationDP
MIT License

Word Segmentation using a Dynamic Programming approach.

For a faster Word Segmentation using a Triangular Matrix approach have a look at WordSegmentationTM.

For a Word Segmentation with Spelling Correction use WordSegmentation and LookupCompound of the SymSpell library.

Examples

- thequickbrownfoxjumpsoverthelazydog
+ the quick brown fox jumps over the lazy dog

- iitwasabrightcolddayinaprilandtheclockswerestrikingthirteen
+ it was a bright cold day in april and the clocks were striking thirteen

- itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness
+ it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness 

Applications

  • Word Segmentation for CJK languages for Indexing Spelling correction, Machine translation, Language understanding, Sentiment analysis
  • Normalizing English compound nouns for search & indexing (e.g. ice box = ice-box = icebox; pig sty = pig-sty = pigsty)
  • Word segmentation für compounds if both original word and split word parts should be indexed.
  • Correction of missing spaces caused by Typing errors.
  • Correction of Conversion errors: spaces between word may get lost e.g. when removing line breaks.
  • Correction of OCR errors: inferior quality of original documents or handwritten text may prevent that all spaces are recognized.
  • Correction of Transmission errors: during the transmission over noisy channels spaces can get lost or spelling errors introduced.
  • Keyword extraction from URL addresses, domain names, table column description or programming variables written without spaces.
  • For password analysis, the extraction of terms from passwords can be required.
  • For Speech recognition, if spaces between words are not properly recognized in spoken language.
  • Automatic CamelCasing of programming variables.
  • Applications beyond Natural Language processing, e.g. segmenting DNA sequence into words

Performance

8 milliseconds for segmenting an 185 char string into 53 words (single core on 2012 Macbook Pro)

Blog Posts: Algorithm, Benchmarks, Applications

Fast Word Segmentation for noisy text
Sub-millisecond compound aware automatic spelling correction
SymSpell vs. BK-tree: 100x faster fuzzy string search & spell checking

Usage of WordSegmentationDP Library


How to use WordSegmentationDP in your project:

WordSegmentationDP targets .NET Standard v2.0 and can be used in:

  1. NET Framework (Windows Forms, WPF, ASP.NET),
  2. NET Core (UWP, ASP.NET Core, Windows, OS X, Linux),
  3. XAMARIN (iOS, OS X, Android) projects.

The SymSpell, Demo, DemoCompound and Benchmark projects can be built with the free Visual Studio Code, which runs on Windows, MacOS and Linux.


Frequency dictionary

Dictionary quality is paramount for word segmentation quality. In order to achieve this two data sources were combined by intersection: Google Books Ngram data which provides representative word frequencies (but contains many entries with spelling errors) and SCOWL — Spell Checker Oriented Word Lists which ensures genuine English vocabulary (but contained no word frequencies required for ranking of suggestions within the same edit distance).

The frequency_dictionary_en_82_765.txt was created by intersecting the two lists mentioned below. By reciprocally filtering only those words which appear in both lists are used. Additional filters were applied and the resulting list truncated to ≈ 80,000 most frequent words.

Dictionary file format

  • Plain text file in UTF-8 encoding.
  • Word and Word Frequency are separated by space or tab. Per default, the word is expected in the first column and the frequency in the second column. But with the termIndex and countIndex parameters in LoadDictionary() the position and order of the values can be changed and selected from a row with more than two values. This allows to augment the dictionary with additional information or to adapt to existing dictionaries without reformatting.
  • Every word-frequency-pair in a separate line. A line is defined as a sequence of characters followed by a line feed ("\n"), a carriage return ("\r"), or a carriage return immediately followed by a line feed ("\r\n").
  • Both dictionary terms and input term are expected to be in lower case.

You can build your own frequency dictionary for your language or your specialized technical domain. Languages with non-latin characters are supported, e.g Cyrillic, Chinese or Georgian.


Changes


WordSegmentationDP is contributed by SeekStorm - the high performance Search as a Service & search API

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].