All Projects → bbalet → Stopwords

bbalet / Stopwords

Licence: other
Removes most frequent words (stop words) from a text content. Based on a Curated list of language statistics.

Programming Languages

go
31211 projects - #10 most used programming language
golang
3204 projects

Projects that are alternatives of or similar to Stopwords

Stringmetric
🎯 String metrics and phonetic algorithms for Scala (e.g. Dice/Sorensen, Hamming, Jaccard, Jaro, Jaro-Winkler, Levenshtein, Metaphone, N-Gram, NYSIIS, Overlap, Ratcliff/Obershelp, Refined NYSIIS, Refined Soundex, Soundex, Weighted Levenshtein).
Stars: ✭ 481 (+479.52%)
Mutual labels:  distance, levenshtein
simetric
String similarity metrics for Elixir
Stars: ✭ 59 (-28.92%)
Mutual labels:  distance, levenshtein
Textdistance
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.
Stars: ✭ 2,575 (+3002.41%)
Mutual labels:  distance, levenshtein
stringosim
String similarity functions, String distance's, Jaccard, Levenshtein, Hamming, Jaro-Winkler, Q-grams, N-grams, LCS - Longest Common Subsequence, Cosine similarity...
Stars: ✭ 47 (-43.37%)
Mutual labels:  distance, levenshtein
levenshtein-edit-distance
Levenshtein edit distance
Stars: ✭ 59 (-28.92%)
Mutual labels:  distance, levenshtein
similar-english-words
Give me a word and I’ll give you an array of words that differ by a single letter.
Stars: ✭ 25 (-69.88%)
Mutual labels:  distance, levenshtein
Whereami
Uses WiFi signals 📶 and machine learning to predict where you are
Stars: ✭ 4,878 (+5777.11%)
Mutual labels:  distance
Node Damerau Levenshtein
Damerau - Levenstein distance function for node
Stars: ✭ 27 (-67.47%)
Mutual labels:  levenshtein
Pyemd
Fast EMD for Python: a wrapper for Pele and Werman's C++ implementation of the Earth Mover's Distance metric
Stars: ✭ 361 (+334.94%)
Mutual labels:  distance
Geolib
Zero dependency library to provide some basic geo functions
Stars: ✭ 3,675 (+4327.71%)
Mutual labels:  distance
Str metrics
Ruby gem (native extension in Rust) providing implementations of various string metrics
Stars: ✭ 68 (-18.07%)
Mutual labels:  levenshtein
Levenshtein
Levenshtein distance and similarity metrics with customizable edit costs and Winkler-like bonus for common prefix.
Stars: ✭ 57 (-31.33%)
Mutual labels:  levenshtein
Distance.js
🚗 Small Library for calculating distances between points.
Stars: ✭ 10 (-87.95%)
Mutual labels:  distance
Nlp xiaojiang
自然语言处理(nlp),小姜机器人(闲聊检索式chatbot),BERT句向量-相似度(Sentence Similarity),XLNET句向量-相似度(text xlnet embedding),文本分类(Text classification), 实体提取(ner,bert+bilstm+crf),数据增强(text augment, data enhance),同义句同义词生成,句子主干提取(mainpart),中文汉语短文本相似度,文本特征工程,keras-http-service调用
Stars: ✭ 954 (+1049.4%)
Mutual labels:  distance
Symspellpy
Python port of SymSpell
Stars: ✭ 420 (+406.02%)
Mutual labels:  levenshtein
Symspellcompound
SymSpellCompound: compound aware automatic spelling correction
Stars: ✭ 61 (-26.51%)
Mutual labels:  levenshtein
Closestmatch
Golang library for fuzzy matching within a set of strings 📃
Stars: ✭ 353 (+325.3%)
Mutual labels:  levenshtein
Pas Coogeo
Pas-CooGeo is coordinate geometry library for Pascal.
Stars: ✭ 25 (-69.88%)
Mutual labels:  distance
Deepdiff
Deep Difference and search of any Python object/data.
Stars: ✭ 985 (+1086.75%)
Mutual labels:  distance
Rapidfuzz
Rapid fuzzy string matching in Python using the Levenshtein Distance
Stars: ✭ 809 (+874.7%)
Mutual labels:  levenshtein

stopwords is a go package that removes stop words from a text content. If instructed to do so, it will remove HTML tags and parse HTML entities. The objective is to prepare a text in view to be used by natural processing algos or text comparison algorithms such as SimHash.

GoDoc Build Status codecov.io Go Report Card

Join the chat at https://gitter.im/bbalet/stopwords

It uses a curated list of the most frequent words used in these languages:

  • Arabic
  • Bulgarian
  • Czech
  • Danish
  • English
  • Finnish
  • French
  • German
  • Hungarian
  • Italian
  • Japanese
  • Khmer
  • Latvian
  • Norwegian
  • Persian
  • Polish
  • Portuguese
  • Romanian
  • Russian
  • Slovak
  • Spanish
  • Swedish
  • Thai
  • Turkish

If the function is used with an unsupported language, it doesn't fail, but will apply english filter to the content.

How to use this package?

You can find an example here https:github.com/bbalet/gorelated where stopwords package is used in conjunction with SimHash algorithm in order to find a list of related content for a static website generator:

import (
      "github.com/bbalet/stopwords"
)

//Example with 2 strings containing P html tags
//"la", "un", etc. are (stop) words without lexical value in French
string1 := []byte("<p>la fin d'un bel après-midi d'été</p>")
string2 := []byte("<p>cet été, nous avons eu un bel après-midi</p>")

//Return a string where HTML tags and French stop words has been removed
cleanContent := stopwords.CleanString(string1, "fr", true)

//Get two (Sim) hash representing the content of each string
hash1 := stopwords.Simhash(string1, "fr", true)
hash2 := stopwords.Simhash(string2, "fr", true)

//Hamming distance between the two strings (diffference between contents)
distance := stopwords.CompareSimhash(hash1, hash2)

//Clean the content of string1 and string2, compute the Levenshtein Distance
stopwords.LevenshteinDistance(string1, string2, "fr", true)

Where fr is the ISO 639-1 code for French (it accepts a BCP 47 tag as well). https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

How to load a custom list of stop words from a file/string?

This package comes with a predefined list of stopwords. However, two functions allow you to use your own list of words:

stopwords.LoadStopWordsFromFile(filePath, langCode, separator)
stopwords.LoadStopWordsFromString(wordsList, langCode, separator)

They will overwrite the predefined words for a given language. You can find an example with the file stopwords.txt

How to overwrite the word segmenter?

If you don't want to strip the Unicode Characters of the 'Number, Decimal Digit' Category, call the function DontStripDigits before using the package :

stopwords.DontStripDigits()

If you want to use your own segmenter, you can overwrite the regular expression:

stopwords.OverwriteWordSegmenter(`[\pL]+`)

Limitations

Please note that this library doesn't break words. If you want to break words prior using stopwords, you need to use another library that provides a binding to ICU library.

These curated lists contain the most used words in various topics, they were not built with a corpus limited to any given specialized topic.

Credits

Most of the lists were built by IR Multilingual Resources at UniNE http://members.unine.ch/jacques.savoy/clef/index.html

License

stopwords is released under the BSD license.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].