Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → igorbrigadir → Stopwords

igorbrigadir / Stopwords

Default English stopword lists from many different sources

Programming Languages

139335 projects - #7 most used programming language

Labels

nlp natural-language-processing

Projects that are alternatives of or similar to Stopwords

Data and software for building the ACL Anthology.

Stars: ✭ 168 (-6.15%)

Mutual labels: natural-language-processing

State of the Art Natural Language Processing

Stars: ✭ 2,518 (+1306.7%)

Mutual labels: natural-language-processing

R package providing annotators and a normalized data model for natural language processing

Stars: ✭ 174 (-2.79%)

Mutual labels: natural-language-processing

Data Science Toolkit

Collection of stats, modeling, and data science tools in Python and R.

Stars: ✭ 169 (-5.59%)

Mutual labels: natural-language-processing

A privacy preserving NLP framework

Stars: ✭ 170 (-5.03%)

Mutual labels: natural-language-processing

Multimodal Sentiment Analysis

Attention-based multimodal fusion for sentiment analysis

Stars: ✭ 172 (-3.91%)

Mutual labels: natural-language-processing

⚡️A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

Stars: ✭ 168 (-6.15%)

Mutual labels: natural-language-processing

My completed implementation solutions for CS224N 2019

Stars: ✭ 178 (-0.56%)

Mutual labels: natural-language-processing

Dive Into Dl Pytorch

本项目将《动手学深度学习》(Dive into Deep Learning)原书中的MXNet实现改为PyTorch实现。

Stars: ✭ 14,234 (+7851.96%)

Mutual labels: natural-language-processing

Web Database Analytics

Web scrapping and related analytics using Python tools

Stars: ✭ 175 (-2.23%)

Mutual labels: natural-language-processing

A frame-semantic parsing system based on a softmax-margin SegRNN.

Stars: ✭ 170 (-5.03%)

Mutual labels: natural-language-processing

Efaqa Corpus Zh

❤️Emotional First Aid Dataset, 心理咨询问答、聊天机器人语料库

Stars: ✭ 170 (-5.03%)

Mutual labels: natural-language-processing

Deep Math Machine Learning.ai

A blog which talks about machine learning, deep learning algorithms and the Math. and Machine learning algorithms written from scratch.

Stars: ✭ 173 (-3.35%)

Mutual labels: natural-language-processing

Simple State-of-the-Art BERT-Based Sentence Classification with Keras / TensorFlow 2. Built with HuggingFace's Transformers.

Stars: ✭ 170 (-5.03%)

Mutual labels: natural-language-processing

fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

Stars: ✭ 2,441 (+1263.69%)

Mutual labels: natural-language-processing

Text vectorization tool to outperform TFIDF for classification tasks

Stars: ✭ 167 (-6.7%)

Mutual labels: natural-language-processing

🚪✊Knock Knock: Get notified when your training ends with only two additional lines of code

Stars: ✭ 2,304 (+1187.15%)

Mutual labels: natural-language-processing

Cookiecutter Spacy Fastapi

Cookiecutter API for creating Custom Skills for Azure Search using Python and Docker

Stars: ✭ 179 (+0%)

Mutual labels: natural-language-processing

Entity linking framework

Stars: ✭ 176 (-1.68%)

Mutual labels: natural-language-processing

Transformers.jl

Julia Implementation of Transformer models

Stars: ✭ 173 (-3.35%)

Mutual labels: natural-language-processing

View All Similar Projects ➔

Default English Stop Words from Different Sources:

Stopword filtering is a common step in preprocessing text for various purposes. This is a list of several different stopword lists extracted from various search engines, libraries, and articles. There's a surprising number of different lists.

At the moment it's just English stopwords.

file	size	source	description
None	0	⇱	No stop word removal.
Sphinx	0	⇱	Sphinx is an open source search server. Top google search for sphinx stopwords also leads to two manually compiled lists http://astellar.com/2011/12/stopwords-for-sphinx-search/ which are based on the blog author's posts.
EBSCOhost	24	⇱	The stop words used in EBSCOhost medical databases MEDLINE and CINAHL
CoreNLP (Hardcoded)	28	⇱	Hardcoded in src/edu/stanford/nlp/coref/data/WordLists.java and the same in src/edu/stanford/nlp/dcoref/Dictionaries.java
Ranks NL (Google)	32	⇱	The short stopwords list below is based on what we believed to be Google stopwords a decade ago, based on words that were ignored if you would search for them in combination with another word. (ie. as in the phrase "a keyword").
Lucene, Solr, Elastisearch	33	⇱	(NOTE: Some config files have extra 's' and 't' as stopwords.) An unmodifiable set containing some common English words that are not usually useful for searching.
MySQL (InnoDB)	36	⇱	A word that is used by default as a stopword for FULLTEXT indexes on InnoDB tables. Not used if you override the default stopword processing with either the innodb_ft_server_stopword_table or the innodb_ft_user_stopword_table option.
Ovid (Medical information services)	39	⇱	Words of little intrinsic meaning that occur too frequently to be useful in searching text are known as "stopwords." You cannot search for the following stopwords by themselves, but you can include them within phrases.
Bow (libbow, rainbow, arrow, crossbow)	48	⇱	Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering. Short list hardcoded. Also includes 524 SMART derived list, same as MALLET. See http://www.cs.cmu.edu/~mccallum/bow/rainbow/
LingPipe	76	⇱	An EnglishStopTokenizerFactory applies an English stop list to a contained base tokenizer factory
Vowpal Wabbit (doc2lda)	83	⇱	Stopwords used in LDA example
Text Analytics 101	85	⇱	Minimal list compiled by Kavita Ganesan consisting of determiners, coordinating conjunctions and prepositions http://text-analytics101.rxnlp.com/2014/10/all-about-stop-words-for-text-mining.html
LexisNexis®	100	⇱	“The following are 'noise words' and are never searchable: EVER HARDLY HENCE INTO NOR WERE VIZ. Others are 'noisy keywords' and are searchable by enclosing them in quotes.”
Okapi (gsl.cacm)	108	⇱	Cacm specific stoplist from Okapi
TextFixer	119	⇱	From textfixer.com Linked from Wiki page on Stop words.
DKPro	127	⇱	Postgresql (Snowball derived)
Postgres	127	⇱	“Stop words are words that are very common, appear in almost every document, and have no discrimination value.”
PubMed Help	133	⇱	Listed in PubMed Help pages.
CoreNLP (Acronym)	150	⇱	A set of words that should be considered stopwords for the acronym matcher
NLTK	153	⇱	According to email Van Rij. Sbergen (1979) "Information retrieval" (Butterworths, London). It's slightly expanded from postgres postgresql.txt which was borrowed from snowball presumably.
Spark ML lib	153	⇱	(Note: Same as NLTK) They were obtained from postgres The English list has been augmented
MongoDB	174	⇱	Commit says 'Changed stop words files to the snowball stop lists'
Quanteda	174	⇱	Has SMART and Snowball Default Lists. Source
Ranks NL (Default)	174	⇱	(Note: Same as Default Snowball Stoplist, but RanksNL frequently cited as source) “This list is used in [Ranks NL] Page Analyzer and Article Analyzer for English text, when you let it use the default stopwords list.”
Snowball (Original)	174	⇱	Default Snowball Stoplist.
Xapian	174	⇱	(Note: uses Snowball Stopwords) “It has been traditional in setting up IR systems to discard the very commonest words of a language - the stopwords - during indexing.”
R `tm`	174	⇱	R `tm` package uses snowball list and also has SMART.
99webTools	183	⇱	“Stop Words are words which do not contain important significance to be used in Search Queries. Most search engine filters these words from search query before performing search, this improves performance.”
Deeplearning4J	194	⇱	DL4J Stopwords are in 2 places - stopwords and stopwords.txt. Probably derived from snowball. Some unusual entires eg: `----s`.
Reuters Web of Science™	211	⇱	“Stopwords are common, frequently used words such as articles (a, an, the), prepositions (of, in, for, through), and pronouns (it, their, his) that cannot be searched as individual words in the Topic and Title fields. If you include a stopword in a phrase, the stopword is interpreted as a word placeholder.”
Function Words (Cook 1988)	221	⇱	“This list of 225 items was compiled for practical purposes some time ago as data for a computer parser for student English. Paper
Okapi (gsl.sample)	222	⇱	This Okapi is the BM25 Okapi. (Note: Included stopword text file is from all “F” “H” terms, as defined by defs.h) The GSL file contains terms that are to be dealt with in a special way by the indexing process. Each type is defined by a class code.
Snowball (Expanded)	227	⇱	NOTE: This Includes the extra words mentioned in comments “An English stop word list. Many of the forms below are quite rare (e.g. 'yourselves') but included for completeness.”
DataScienceDojo	250	⇱	Used in a real-time sentiment AzureML demo for a meetup
CoreNLP (stopwords.txt)	257	⇱	Note: "a", "an", "the", "and", "or", "but", "nor" hardcoded in StopList.java also includes punctuation (!!, -lrb- …)
OkapiFramework	262	⇱	THIS IS NOT Okapi of BM25! (At least I don't think so) This list used in Okapi FRAMEWORK this Okapi is the Localization and Translation Okapi.
Azure Gallery	310	⇱	Slightly modified glasgow list.
ATIRE (NCBI Medline)	313	⇱	NCBI wrd_stop stop word list of 313 terms extracted from Medline. Its use is unrestricted. The list can be downloaded from here
Go	317	⇱	Go stopwords library. This is the glasgow list without 'computer' 'i' 'thick' - has 'thickv'
scikit-learn	318	⇱	Uses Glasgow list, but without the word “computer”
Glasgow IR	319	⇱	Linguistic resources from Glasgow Information Retrieval group. Lots of copies and edits of this one. Eg: xpo6 has mistakes – has quote instead of 'lf' eg: herse" instead of herself - comes up as one of the top results in google search.
xpo6	319	⇱	Used in Humboldt Diglital Library and Network and documented in blogpost. Likely derived from Glasgow list.
spaCy	326	⇱	Improved list from Stone, Denis, Kwantes (2010) Paper
Gensim	337	⇱	Same as spaCy (Improved list from Stone, Denis, Kwantes (2010))
Okapi (Expanded gsl.cacm)	339	⇱	Expanded cacm list from Okapi
C99 and TextTiling	371	⇱	UIMA wrapper for the java implementations of the segmentation algorithms C99 and TextTiling, written by Freddy Choi
Galago (inquery)	418	⇱	The core/src/main/resources/stopwords/inquery list is same as Indri default.
Indri	418	⇱	Part of Lemur Project
Onix & Lextek	429	⇱	This stopword list is probably the most widely used stopword list. It covers a wide number of stopwords without getting too aggressive and including too many words which a user might search upon. This wordlist contains 429 words.
GATE (Keyphrase Extraction)	452	⇱	Stopwords used in GATE Keyphrase Extraction Algorithm
Zettair	469	⇱	Zettair is a compact and fast text search engine designed and written by the Search Engine Group at RMIT University. It was once known as Lucy.
Okapi (Expanded gsl.sample)	474	⇱	Same as okapi_sample.txt but with “I” terms (not default Okapi behaviour! but may be useful)
Taporware	485	⇱	TAPoRware Project, McMaster University - modified Glasgow list – includes numbers 0 to 100, and 1990 to 2020 (for dates presumably) also punctuation
Voyant (Taporware)	488	⇱	Voyant uses taporware list by default, includes extra thou, thee, thy – presumably for Shakespeare corpus. Trombone repo also has Glasgow and SMART in resources.
MALLET	524	⇱	Default MALLET stopword list. (Based on SMART I think) See Docs
Weka	526	⇱	Like Bow (Rainbow, which is SMART) but with extra ll ve added to avoid words like you'll,I've etc. Almost exactly the same as mallet.txt
MySQL (MyISAM)	543	⇱	MyISAM and InnoDB use different stoplists. Taken from SMART but modified
Galago (rmstop)	565	⇱	Includes some punctuation, utf8 characters, www, http, org, net, youtube, wikipedia
Kevin Bougé	571	⇱	Multilang lists compiled by Kevin Bougé. English is SMART.
SMART	571	⇱	SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System is an information retrieval system developed at Cornell University in the 1960s.
ROUGE	598	⇱	Extended SMART list used in ROUGE 1.5.5 Summary Evaluation Toolkit – includes extra words: reuters, ap, news, tech, index, 3 letter days of the week and months.
tonybsk_1.txt	635	⇱	Unknown origin - I lost the reference.
Sphinx Search Ultimate	665	⇱	An extension for Sphinx has this list.
Ranks NL (Large)	667	⇱	A very long list from ranks.nl
tonybsk_6.txt	671	⇱	Unknown origin - I lost the reference.
Terrier	733	⇱	Terrier Retrieval Engine “Stopword list to load can be loaded from the stopwords.filename property.”
ATIRE (Puurula)	988	⇱	Included in ATIRE See Paper
Alir3z4	1298	⇱	List of common stop words in various languages. The English list looks like merged from several sources.

Notes:

File format: 1 word per line. Unix newlines \n, end with a blank line. utf8 encoded.
Case & Punctuation was preserved as presented, except when all UPPERCASE - these were lowercased.
Where multiple versions of lists exist in code, the latest stable version was used.
Exact duplicates included (eg: Snowball, MongoDB, Quanteda).
Source URL is where the word list came from. Sometimes listed words do not match what's in the software.
Description includes a note or how the page, help manuals, or code comments describe stopwords.
There are way too many other blog posts and pages that list english stopwords, and many more are hardcoded in different implementations. I tried finding the most prominent ones (well known tools, or linked from Wiki, or first result on Google, or from IR / NLP researchers).
build.py generates this file with table above from en_stopwords.csv

TODO:

Visualise differences and overlaps
Find and cite original papers that introduced specific lists
Influence on retrieval: How much can be attributed to just stopwords. Is it significant? Lets find out.

See Also:

https://en.wikipedia.org/wiki/Stop_words
http://members.unine.ch/jacques.savoy/clef/
http://research.nii.ac.jp/ntcir/tools/tools-en.html
http://www.cs.uml.edu/~haim/teaching/iws/tirsaa/sources/text_utilities.html
http://text-analytics101.rxnlp.com/2014/10/all-about-stop-words-for-text-mining.html
https://github.com/lintool/IR-Reproducibility/tree/master/systems
http://www.umiacs.umd.edu/~oard/teaching/734/fall15/software.html
Galago also has a "stop phrase" list: https://sourceforge.net/p/lemur/galago/ci/default/tree/core/src/main/resources/stopwords/stopStructure
SMART FTP Mirror: http://ftp.gnome.org/mirror/archive/ftp.sunet.se/pub/databases/full-text/smart/
Multiple language stopwords (EN already one of the above in table): https://sites.google.com/site/kevinbouge/stopwords-lists
More for multiple languages (EN already one of the above in table): https://code.google.com/archive/p/stop-words/
Stopwords for 50 languages in json (EN is SMART): https://github.com/6/stopwords-json

Contributing:

Have you got a favourite stopword list that's different to what's here? Send a pull request with your list as a text file, 1 word per line in en/ folder and a new row in en_stopwords.csv

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 179

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (3) 🔗