Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → dfalbel → Ptstem

dfalbel / Ptstem

Licence: other

Stemming Algorithms for the Portuguese Language

Programming Languages

7636 projects

Labels

stemmer

Projects that are alternatives of or similar to Ptstem

Arabicstemmer

Assem's Arabic Light Stemmer is a snowball-based stemming algorithm for Arabic aimed mainly to improve search.

Stars: ✭ 102 (+684.62%)

Mutual labels: stemmer

lorca

Natural Language Processing for Spanish in Node.js. Stemmer, sentiment analysis, readability, tf-idf with batteries, concordance and more!

Stars: ✭ 95 (+630.77%)

Mutual labels: stemmer

Lunr Languages

A collection of languages stemmers and stopwords for Lunr Javascript library

Stars: ✭ 296 (+2176.92%)

Mutual labels: stemmer

Cadmium

Natural Language Processing (NLP) library for Crystal

Stars: ✭ 172 (+1223.08%)

Mutual labels: stemmer

stemmify

Ruby module that converts a word to its approximate root form with the Porter stemmer. For example, observing and observation reduce to observ.

Stars: ✭ 54 (+315.38%)

Mutual labels: stemmer

lancaster-stemmer

Lancaster stemming algorithm

Stars: ✭ 22 (+69.23%)

Mutual labels: stemmer

Php Stemmer

Native PHP Stemmer

Stars: ✭ 84 (+546.15%)

Mutual labels: stemmer

Snowball

Snowball version of the Porter stemmer for the Lithuanian language.

Stars: ✭ 5 (-61.54%)

Mutual labels: stemmer

perstem

Persian stemmer and morphological analyzer

Stars: ✭ 18 (+38.46%)

Mutual labels: stemmer

Ruby Stemmer

Expose libstemmer_c to Ruby

Stars: ✭ 254 (+1853.85%)

Mutual labels: stemmer

sastrawijs

Indonesian language stemmer. Javascript port of PHP Sastrawi project.

Stars: ✭ 30 (+130.77%)

Mutual labels: stemmer

hunspell

High-Performance Stemmer, Tokenizer, and Spell Checker for R

Stars: ✭ 101 (+676.92%)

Mutual labels: stemmer

CISTEM

Stemmer for German

Stars: ✭ 33 (+153.85%)

Mutual labels: stemmer

Stemmer

An English (Porter2) stemming implementation in Elixir.

Stars: ✭ 134 (+930.77%)

Mutual labels: stemmer

Awesome Persian Nlp Ir

Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources

Stars: ✭ 460 (+3438.46%)

Mutual labels: stemmer

Stemmer

Fast Porter stemmer implementation

Stars: ✭ 86 (+561.54%)

Mutual labels: stemmer

PersianStemmer-Python

Stars: ✭ 43 (+230.77%)

Mutual labels: stemmer

Akarata

Indonesian stemmer - Pustaka JavaScript untuk mengambil kata dasar dari kata berimbuhan pada bahasa Indonesia.

Stars: ✭ 26 (+100%)

Mutual labels: stemmer

Word forms

Accurately generate all possible forms of an English word e.g "election" --> "elect", "electoral", "electorate" etc.

Stars: ✭ 463 (+3461.54%)

Mutual labels: stemmer

gwizo

Simple Go implementation of the Porter Stemmer algorithm with powerful features.

Stars: ✭ 26 (+100%)

Mutual labels: stemmer

View All Similar Projects ➔

ptstem

Stemming Algorithms for the Portuguese Language

This packages wraps 3 stemming algorithms for the portuguese language available in R. It unifies the API for the stemmers and provides easy stemming completion.

Installing

You can install directly from Github using:

devtools::install_github("dfalbel/ptstem")

or from CRAN using:

install.packages("ptstem")

Using

Consider the following text, extracted from Stemming in Wikipedia

text <- "Em morfologia linguística e recuperação de informação a stemização (do inglês, stemming) é
o processo de reduzir palavras flexionadas (ou às vezes derivadas) ao seu tronco (stem), base ou
raiz, geralmente uma forma da palavra escrita. O tronco não precisa ser idêntico à raiz morfológica
da palavra; ele geralmente é suficiente que palavras relacionadas sejam mapeadas para o mesmo
tronco, mesmo se este tronco não for ele próprio uma raiz válida. O estudo de algoritmos para
stemização tem sido realizado em ciência da computação desde a década de 60. Vários motores de
buscas tratam palavras com o mesmo tronco como sinônimos como um tipo de expansão de consulta, em
um processo de combinação."

This will use the rslp algorithm to stem the text.

library(ptstem)
ptstem(text, algorithm = "rslp", complete = FALSE)
#> [1] "Em morfolog linguis e recuper de inform a stemiz (do ingl, stemming) é\no process de reduz palavr flexion (ou às vez deriv) ao seu tronc (st), bas ou\nraiz, geral uma form da palavr escrit. O tronc nao precis ser ident à raiz morfolog\nda palavr; ele geral é sufici que palavr relacion sej mape par o mesm\ntronc, mesm se est tronc nao for ele propri uma raiz val. O estud de algoritm par\nstemiz tem sid realiz em cienc da comput desd a dec de 60. Vari motor de\nbusc trat palavr com o mesm tronc com sinon com um tip de expans de consult, em\num process de combin."

You can complete stemmed words using the argument complete = T.

ptstem(text, algorithm = "rslp", complete = TRUE)

The other implemented algorithms are:

hunspell: the same algorithm used in OpenOffice corrector. (available via hunspell package)
porter: available via SnowballC package.

You can stem using those algorithms by changing the algorithm argument in ptstem function.

library(ptstem)
ptstem(text, algorithm = "hunspell")
#> [1] "Em morfologia linguística e recuperação de informação a stemização (do inglês, stemização) é\no processo de reduzir palavras flexionadas (ou às vezes derivadas) ao seu tronco (stemização), base ou\nraiz, geralmente uma forma da palavras escrita. O tronco não precisa ser idêntico à raiz morfologia\nda palavras; ele geralmente é suficiente que palavras relacionadas ser mapeadas para o mesmo\ntronco, mesmo se este tronco não for ele próprio uma raiz válida. O estudo de algoritmos para\nstemização tem ser realizado em ciência da computação desde a década de 60. Vários motores de\nbuscas tratam palavras com o mesmo tronco como sinônimos como um tipo de expansão de consulta, em\num processo de combinação."
ptstem(text, algorithm = "porter")
#> [1] "Em morfologia linguística e recuperação de informação a stemização (do inglês, stemming) é\no processo de reduzir palavras flexionadas (ou às vezes derivadas) ao seu tronco (stem), base ou\nraiz, geralmente uma forma da palavras escrita. O tronco não precisa ser idêntico à raiz morfológica\nda palavras; ele geralmente é suficiente que palavras relacionadas sejam mapeadas para o mesmo\ntronco, mesmo se este tronco não for ele próprio uma raiz válida. O estudo de algoritmos para\nstemização tem sido realizado em ciência da computação desde a década de 60. Vários motores de\nbuscas tratam palavras com o mesmo tronco com sinônimos com um tipo de expansão de consulta, em\num processo de combinação."

Performance

The goal of stemming algorithms is to group related words and to separate unrelated words. With this in mind, you can talk about two kinds of possible errors when stemming:

Understemming: Related words were not grouped because you didn't stem enought.
Overstemming: Unrelated words were grouped because you removed a large part of the word when stemming.

To measure these errors the function performance was implemented. It returns a data.frame with 3 columns. The name of the stemmer and 2 metrics:

UI: the undersampling index. It's the proportion of related words that were not grouped.
OI: the overstemming index. It's the proportion of unrelated words that were grouped.

Remember that OI is 0 if you don't stem. So I think the true objective of a stemming algorithm is to reduce UI without augmenting OI too much.

ptstem package provides a dataset of grouped words for the portuguese language (found in this link). It's in this dataset that performance function calculates the metrics described above.

See results:

performance()
#>                 .id         UI         OI
#> 1              rslp 0.08540752 0.04929234
#> 2          hunspell 0.12835530 0.03221083
#> 3            porter 0.13958028 0.03221083
#> 4 modified-hunspell 0.05466081 0.06295754

This is not the only approach for measuring performance of the those algorithms. The article Assessing the impact of Stemming Accuracy on Information Retrieval – A multilingual perspective describes various ways to analyse stemming performance.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 13

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (3) 🔗