All Projects β†’ LeonieWeissweiler β†’ CISTEM

LeonieWeissweiler / CISTEM

Licence: MIT License
Stemmer for German

Programming Languages

c
50402 projects - #5 most used programming language
C++
36643 projects - #6 most used programming language
javascript
184084 projects - #8 most used programming language
python
139335 projects - #7 most used programming language
swift
15916 projects
perl
6916 projects

Projects that are alternatives of or similar to CISTEM

wiktionary-de-parser
Extract data from German Wiktionary XML files. Allows you to add your own extraction methods πŸš€
Stars: ✭ 22 (-33.33%)
Mutual labels:  german, german-language
ArabicProcessingCog
A Python package that do stemming, tokenization, sentence breaking, segmentation, normalization, POS tagging for Arabic language.
Stars: ✭ 19 (-42.42%)
Mutual labels:  segmentation, computational-linguistics
deep-learning-german-tts
Thorsten-Voice: A free to use, offline working, high quality german TTS voice should be available for every project without any license struggling.
Stars: ✭ 268 (+712.12%)
Mutual labels:  german, deutsch
OleanderStemmingLibrary
Porter stemming library (C++)
Stars: ✭ 37 (+12.12%)
Mutual labels:  stemming, stemming-algorithm
german-tutorial
εΎ·θ―­ι›ΆεŸΊη‘€ζ•™η¨‹
Stars: ✭ 52 (+57.58%)
Mutual labels:  german, german-language
awesome-made-by-germans
πŸ‡©πŸ‡ͺ The best open source projects that were made and mainly contributed by German developers
Stars: ✭ 170 (+415.15%)
Mutual labels:  german, deutsch
lancaster-stemmer
Lancaster stemming algorithm
Stars: ✭ 22 (-33.33%)
Mutual labels:  stemmer, stemming
pointnet2-pytorch
A clean PointNet++ segmentation model implementation. Support batch of samples with different number of points.
Stars: ✭ 45 (+36.36%)
Mutual labels:  segmentation
Baysor
Bayesian Segmentation of Spatial Transcriptomics Data
Stars: ✭ 53 (+60.61%)
Mutual labels:  segmentation
PersianStemmer-Python
PersianStemmer-Python
Stars: ✭ 43 (+30.3%)
Mutual labels:  stemmer
paywallr
πŸ”“ Web extension for reading articles locked behind paywalls of over 50 german newspapers, e.g. Frankfurter Allgemeine Zeitung, Leipziger Volkszeitung & Hamburger Abendblatt
Stars: ✭ 63 (+90.91%)
Mutual labels:  german
tensorrt-examples
TensorRT Examples (TensorRT, Jetson Nano, Python, C++)
Stars: ✭ 31 (-6.06%)
Mutual labels:  segmentation
sembei
🍘 単θͺžεˆ†ε‰²γ‚’η΅Œη”±γ—γͺγ„ε˜θͺžεŸ‹γ‚θΎΌγΏ 🍘
Stars: ✭ 14 (-57.58%)
Mutual labels:  computational-linguistics
Visual-Transformer-Paper-Summary
Summary of Transformer applications for computer vision tasks.
Stars: ✭ 51 (+54.55%)
Mutual labels:  segmentation
adaptive-segmentation-mask-attack
Pre-trained model, code, and materials from the paper "Impact of Adversarial Examples on Deep Learning Models for Biomedical Image Segmentation" (MICCAI 2019).
Stars: ✭ 50 (+51.52%)
Mutual labels:  segmentation
Semantic-Aware-Attention-Based-Deep-Object-Co-segmentation
Semantic Aware Attention Based Deep Object Co-segmentation
Stars: ✭ 61 (+84.85%)
Mutual labels:  segmentation
GENADEV OS
An AArch64 hobbyist OS for the Raspberry Pi 3 B+
Stars: ✭ 14 (-57.58%)
Mutual labels:  german
Twelveish
πŸ•› Twelveish - Android Wear/Wear OS Watch Face
Stars: ✭ 29 (-12.12%)
Mutual labels:  german
cluster tools
Distributed segmentation for bio-image-analysis
Stars: ✭ 26 (-21.21%)
Mutual labels:  segmentation
XNet
CNN implementation for medical X-Ray image segmentation
Stars: ✭ 71 (+115.15%)
Mutual labels:  segmentation

CISTEM

license

CISTEM is a stemming algorithm for the German language, developed by Leonie Weißweiler and Alexander Fraser. This repository contains official implementations in a variety of programming languages. At the moment, the following languages are available:

  • Python
  • Java
  • C++
  • C
  • Javascript
  • Go
  • Haskell
  • Perl
  • Swift

The code for each language encludes a method for stemming as well as one for segmentation, which returns the stripped suffix as well as the stem.

Performance

We performed a comparative analysis of six publicly available German stemmers, where CISTEM achieved the best results for f-measure and state-of-the-art results for runtime.

Gold standards

The gold_standards folder contains the two gold standards we used for evaluation. Each file is utf-8 text file with each line containing all the stems of one cluster separated by a single space. Note that we do not supply a reference stem for each cluster, as we measure stemming performance as the ability to group words with the same meaning, which is more relevant for information retrieval purposes than the absolute stem. If you use these gold standards in your own research, please cite our paper: Bibtex

More information on how we evaluated runtimes and stemming quality can be found in our paper:

Leonie Weißweiler, Alexander Fraser (2017). Developing a Stemmer for German Based on a Comparative Analysis of Publicly Available Stemmers. In Proceedings of the German Society for Computational Linguistics and Language Technology (GSCL), to appear.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].