All Projects → quanteda → Stopwords

quanteda / Stopwords

Licence: other
Multilingual Stopword Lists in R

Programming Languages

r
7636 projects

Projects that are alternatives of or similar to Stopwords

Text mining resources
Resources for learning about Text Mining and Natural Language Processing
Stars: ✭ 358 (+302.25%)
Mutual labels:  text-analysis
Homer
Homer, a text analyser in Python, can help make your text more clear, simple and useful for your readers.
Stars: ✭ 607 (+582.02%)
Mutual labels:  text-analysis
Javascript Text Expander
Expands texts as you type, naturally
Stars: ✭ 58 (-34.83%)
Mutual labels:  text-analysis
Open Semantic Search
Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)
Stars: ✭ 386 (+333.71%)
Mutual labels:  text-analysis
Awesome Sentiment Analysis
Repository with all what is necessary for sentiment analysis and related areas
Stars: ✭ 459 (+415.73%)
Mutual labels:  text-analysis
Rezonator
Rezonator: Dynamics of human engagement
Stars: ✭ 25 (-71.91%)
Mutual labels:  text-analysis
Giveme5w1h
Extraction of the journalistic five W and one H questions (5W1H) from news articles: who did what, when, where, why, and how?
Stars: ✭ 316 (+255.06%)
Mutual labels:  text-analysis
Orange3 Text
🍊 📄 Text Mining add-on for Orange3
Stars: ✭ 83 (-6.74%)
Mutual labels:  text-analysis
Meta
A Modern C++ Data Sciences Toolkit
Stars: ✭ 600 (+574.16%)
Mutual labels:  text-analysis
Ore
An R interface to the Onigmo regular expression library
Stars: ✭ 54 (-39.33%)
Mutual labels:  text-analysis
Jekyll
Jekyll-based static site for The Programming Historian
Stars: ✭ 387 (+334.83%)
Mutual labels:  text-analysis
Php Text Analysis
PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP language
Stars: ✭ 410 (+360.67%)
Mutual labels:  text-analysis
Doctopics
Various examples of topic modeling and other text analysis
Stars: ✭ 32 (-64.04%)
Mutual labels:  text-analysis
Python Course
Tutorial and introduction into programming with Python for the humanities and social sciences
Stars: ✭ 370 (+315.73%)
Mutual labels:  text-analysis
Lexisnexistools
📰 Working with newspaper data from 'LexisNexis'
Stars: ✭ 59 (-33.71%)
Mutual labels:  text-analysis
Artificial Adversary
🗣️ Tool to generate adversarial text examples and test machine learning models against them
Stars: ✭ 348 (+291.01%)
Mutual labels:  text-analysis
Articleparse
Heuristic text extraction from news sites in Python3
Stars: ✭ 6 (-93.26%)
Mutual labels:  text-analysis
R Text Data
List of textual data sources to be used for text mining in R
Stars: ✭ 85 (-4.49%)
Mutual labels:  text-analysis
Awesome Customer Analytics
A curated list of awesome customer analytics content
Stars: ✭ 79 (-11.24%)
Mutual labels:  text-analysis
Biomedicus
Code for the old version of BioMedICUS, for the new version see the biomedicus3 repository.
Stars: ✭ 45 (-49.44%)
Mutual labels:  text-analysis

stopwords: the R package

CRAN Version R build status codecov Downloads Total Downloads

R package providing “one-stop shopping” (or should that be “one-shop stopping”?) for stopword lists in R, for multiple languages and sources. No longer should text analysis or NLP packages bake in their own stopword lists or functions, since this package can accommodate them all, and is easily extended.

Created by David Muhr, and extended in cooperation with Kenneth Benoit and Kohei Watanabe.

Installation

# from CRAN
install.packages("stopwords")

# Or get the development version from GitHub:
# install.packages("devtools")
devtools::install_github("quanteda/stopwords")

Usage

head(stopwords::stopwords("de", source = "snowball"), 20)
##  [1] "aber"    "alle"    "allem"   "allen"   "aller"   "alles"   "als"    
##  [8] "also"    "am"      "an"      "ander"   "andere"  "anderem" "anderen"
## [15] "anderer" "anderes" "anderm"  "andern"  "anderr"  "anders"

head(stopwords::stopwords("ja", source = "marimo"), 20)
##  [1] "私"       "僕"       "自分"     "自身"     "我々"     "私達"    
##  [7] "あなた"   "彼"       "彼女"     "彼ら"     "彼女ら"   "あれ"    
## [13] "それ"     "これ"     "あれら"   "あれらの" "それら"   "それらの"
## [19] "これら"   "これらの"

For compatibility with the former quanteda::stopwords():

head(stopwords::stopwords("german"), 20)
##  [1] "aber"    "alle"    "allem"   "allen"   "aller"   "alles"   "als"    
##  [8] "also"    "am"      "an"      "ander"   "andere"  "anderem" "anderen"
## [15] "anderer" "anderes" "anderm"  "andern"  "anderr"  "anders"

Explore sources and languages:

# list all sources
stopwords::stopwords_getsources()
## [1] "snowball"      "stopwords-iso" "misc"          "smart"        
## [5] "marimo"        "ancient"       "nltk"          "perseus"

# list languages for a specific source
stopwords::stopwords_getlanguages("snowball")
##  [1] "da" "de" "en" "es" "fi" "fr" "hu" "ir" "it" "nl" "no" "pt" "ro" "ru" "sv"

Languages available

The following coverage of languages is currently available, by source. Note that the inclusiveness of the stopword lists will vary by source, and the number of languages covered by a stopword list does not necessarily mean that the source is better than one with more limited coverage. (There may be many reasons to prefer the default “snowball” source over the “stopwords-iso” source, for instance.)

The following languages are currently available:

Language Code snowball marimo nltk stopwords-iso Other
Afrikaans af
Arabic ar misc
Armenian hy
Azerbaijani az
Basque eu
Bengali bn
Breton br
Bulgarian bg
Catalan ca misc
Chinese zh misc
Croatian hr
Czech cs
Danish da
Dutch nl
English en smart
Esperanto eo
Estonian et
Finnish fi
French fr
Galician gl
German de
Greek el misc
Greek (ancient) grc ancient, perseus
Gujarati gu misc
Hausa ha
Hebrew he
Hindi hi
Hungarian hu
Indonesian id
Irish ga
Italian it
Japanese ja
Kazakh kk
Korean ko
Kurdish ku
Latin la ancient, perseus
Lithuanian lt
Latvian lv
Malay ms
Marathi mr
Nepali mr
Norwegian no
Persian fa
Polish pl
Portuguese pt
Romanian ro
Russian ru
Slovak sk
Slovenian sl
Somali so
Southern Sotho st
Spanish es
Swahili sw
Swedish sv
Thai th
Tagalog tl
Tajik tg
Turkish tr
Ukrainian uk
Urdu ur
Vietnamese vi
Yoruba yo
Zulu zu

Basic usage

head(stopwords::stopwords("de", source = "snowball"), 20)
##  [1] "aber"    "alle"    "allem"   "allen"   "aller"   "alles"   "als"    
##  [8] "also"    "am"      "an"      "ander"   "andere"  "anderem" "anderen"
## [15] "anderer" "anderes" "anderm"  "andern"  "anderr"  "anders"

head(stopwords::stopwords("de", source = "stopwords-iso"), 20)
##  [1] "a"           "ab"          "aber"        "ach"         "acht"       
##  [6] "achte"       "achten"      "achter"      "achtes"      "ag"         
## [11] "alle"        "allein"      "allem"       "allen"       "aller"      
## [16] "allerdings"  "alles"       "allgemeinen" "als"         "also"

For compatibility with the former quanteda::stopwords():

head(stopwords::stopwords("german"), 20)
##  [1] "aber"    "alle"    "allem"   "allen"   "aller"   "alles"   "als"    
##  [8] "also"    "am"      "an"      "ander"   "andere"  "anderem" "anderen"
## [15] "anderer" "anderes" "anderm"  "andern"  "anderr"  "anders"

Explore sources and languages:

# list all sources
stopwords::stopwords_getsources()
## [1] "snowball"      "stopwords-iso" "misc"          "smart"        
## [5] "marimo"        "ancient"       "nltk"          "perseus"

# list languages for a specific source
stopwords::stopwords_getlanguages("snowball")
##  [1] "da" "de" "en" "es" "fi" "fr" "hu" "ir" "it" "nl" "no" "pt" "ro" "ru" "sv"

Modifying stopword lists

It is now possible to edit your own stopword lists, using the interactive editor, with functions from the quanteda package (>= v2.02). For instance to edit the English stopword list for the Snowball source:

# edit the English stopwords
my_stopwords <- quanteda::char_edit(stopwords("en", source = "snowball"))

To edit stopwords whose underlying structure is a list, such as the “marimo” source, we can use the list_edit() function:

# edit the English stopwords
my_stopwordlist <- quanteda::list_edit(stopwords("en", source = "marimo", simplify = FALSE))

Finally, it’s possible to remove stopwords using pattern matching. The default is the easy-to-use “glob” style matching, which is equivalent to fixed matching when no wildcard characters are used. So to remove personal pronouns from the English Snowball word list, for instance, this would work:

library("quanteda", warn.conflicts = FALSE)
## Package version: 2.9.9000
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
posspronouns <- stopwords::data_stopwords_marimo$en$pronoun$possessive
posspronouns
## [1] "my"    "our"   "your"  "his"   "her"   "its"   "their"

stopwords("en", source = "snowball") %>%
  head(n = 10)
##  [1] "i"         "me"        "my"        "myself"    "we"        "our"      
##  [7] "ours"      "ourselves" "you"       "your"

See the difference when we remove them – “my”, “ours”, and “your” are gone:

stopwords("en", source = "snowball") %>%
  head(n = 10) %>%
  char_remove(pattern = posspronouns)
## [1] "i"         "me"        "myself"    "we"        "ours"      "ourselves"
## [7] "you"

There is no char_add(), since it’s just as easy to use c() for this, but there is a char_keep() for positive selection rather than removal.

Adding stopwords to your own package

In v2.2, we’ve removed the function use_stopwords() because the dependency on usethis added too many downstream package dependencies, and stopwords is meant to be a lightweight package.

However it is very easy to add a re-export for stopwords() to your package by adding this file as stopwords.R:

#' Stopwords
#'
#' @description
#' Return a character vector of stopwords.
#' See \code{stopwords::\link[stopwords:stopwords]{stopwords()}} for details.
#' @usage stopwords(language = "en", source = "snowball")
#' @name stopwords
#' @importFrom stopwords stopwords
#' @export
NULL

and add stopwords to the list of Imports: in your DESCRIPTION file.

Contributing

Additional sources can be defined and contributed by adding new data objects, as follows:

  1. Data object. Create a named list of characters, in UTF-8 format, consisting of the stopwords for each language. The ISO-639-1 language code will form the name of the list element, and the values of each element will be the character vector of stopwords for literal matches. The data object should follow the package naming convention, and be called data_stopwords_newsource, where newsource is replaced by the name of the new source.

  2. Documentation. The new source should be clearly documented, especially the source from which was taken.

License

This package as well as the source repositories are licensed under MIT.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].