All Projects → bnosac → Udpipe

bnosac / Udpipe

Licence: mpl-2.0
R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit

Programming Languages

r
7636 projects

Projects that are alternatives of or similar to Udpipe

Rdrpostagger
R package for Ripple Down Rules-based Part-Of-Speech Tagging (RDRPOS). On more than 45 languages.
Stars: ✭ 31 (-80.62%)
Mutual labels:  natural-language-processing, pos-tagging, r-package
Tokenizers
Fast, Consistent Tokenization of Natural Language Text
Stars: ✭ 161 (+0.63%)
Mutual labels:  tokenizer, text-mining, r-package
Python nlp tutorial
This repository provides everything to get started with Python for Text Mining / Natural Language Processing (NLP)
Stars: ✭ 72 (-55%)
Mutual labels:  natural-language-processing, text-mining
Lda Topic Modeling
A PureScript, browser-based implementation of LDA topic modeling.
Stars: ✭ 91 (-43.12%)
Mutual labels:  natural-language-processing, text-mining
Cogcomp Nlpy
CogComp's light-weight Python NLP annotators
Stars: ✭ 115 (-28.12%)
Mutual labels:  natural-language-processing, text-mining
Spark Nkp
Natural Korean Processor for Apache Spark
Stars: ✭ 50 (-68.75%)
Mutual labels:  natural-language-processing, text-mining
Thot
Thot toolkit for statistical machine translation
Stars: ✭ 53 (-66.87%)
Mutual labels:  tokenizer, natural-language-processing
Kadot
Kadot, the unsupervised natural language processing library.
Stars: ✭ 108 (-32.5%)
Mutual labels:  tokenizer, natural-language-processing
Tidytext
Text mining using tidy tools ✨📄✨
Stars: ✭ 975 (+509.38%)
Mutual labels:  natural-language-processing, text-mining
Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Stars: ✭ 132 (-17.5%)
Mutual labels:  tokenizer, natural-language-processing
Awesome Hungarian Nlp
A curated list of NLP resources for Hungarian
Stars: ✭ 121 (-24.37%)
Mutual labels:  natural-language-processing, text-mining
Awesome Nlp
📖 A curated list of resources dedicated to Natural Language Processing (NLP)
Stars: ✭ 12,626 (+7791.25%)
Mutual labels:  natural-language-processing, text-mining
Greynir
The greynir.is natural language processing website for Icelandic
Stars: ✭ 47 (-70.62%)
Mutual labels:  tokenizer, natural-language-processing
Py Nltools
A collection of basic python modules for spoken natural language processing
Stars: ✭ 46 (-71.25%)
Mutual labels:  tokenizer, natural-language-processing
How To Mine Newsfeed Data And Extract Interactive Insights In Python
A practical guide to topic mining and interactive visualizations
Stars: ✭ 61 (-61.87%)
Mutual labels:  natural-language-processing, text-mining
Gsoc2018 3gm
💫 Automated codification of Greek Legislation with NLP
Stars: ✭ 36 (-77.5%)
Mutual labels:  natural-language-processing, text-mining
Pytorch Pos Tagging
A tutorial on how to implement models for part-of-speech tagging using PyTorch and TorchText.
Stars: ✭ 96 (-40%)
Mutual labels:  natural-language-processing, pos-tagging
Hands On Natural Language Processing With Python
This repository is for my students of Udemy. You can find all lecture codes along with mentioned files for reading in here. So, feel free to clone it and if you have any problem just raise a question.
Stars: ✭ 146 (-8.75%)
Mutual labels:  natural-language-processing, text-mining
Metasra Pipeline
MetaSRA: normalized sample-specific metadata for the Sequence Read Archive
Stars: ✭ 33 (-79.37%)
Mutual labels:  natural-language-processing, text-mining
Scattertext
Beautiful visualizations of how language differs among document types.
Stars: ✭ 1,722 (+976.25%)
Mutual labels:  natural-language-processing, text-mining

udpipe - R package for Tokenization, Tagging, Lemmatization and Dependency Parsing Based on UDPipe

This repository contains an R package which is an Rcpp wrapper around the UDPipe C++ library (http://ufal.mff.cuni.cz/udpipe, https://github.com/ufal/udpipe).

  • UDPipe provides language-agnostic tokenization, tagging, lemmatization and dependency parsing of raw text, which is an essential part in natural language processing.
  • The techniques used are explained in detail in the paper: "Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe", available at http://ufal.mff.cuni.cz/~straka/papers/2017-conll_udpipe.pdf. In that paper, you'll also find accuracies on different languages and process flow speed (measured in words per second).

General

The udpipe R package was designed with the following things in mind when building the Rcpp wrapper around the UDPipe C++ library:

  • Give R users simple access in order to easily tokenize, tag, lemmatize or perform dependency parsing on text in any language
  • Provide easy access to pre-trained annotation models
  • Allow R users to easily construct your own annotation model based on data in CONLL-U format as provided in more than 100 treebanks available at http://universaldependencies.org
  • Don't rely on Python or Java so that R users can easily install this package without configuration hassle
  • No external R package dependencies except the strict necessary (Rcpp and data.table, no tidyverse)

Installation & License

The package is available under the Mozilla Public License Version 2.0. Installation can be done as follows. Please visit the package documentation at https://bnosac.github.io/udpipe/en and look at the R package vignettes for further details.

install.packages("udpipe")
vignette("udpipe-tryitout", package = "udpipe")
vignette("udpipe-annotation", package = "udpipe")
vignette("udpipe-universe", package = "udpipe")
vignette("udpipe-usecase-postagging-lemmatisation", package = "udpipe")
# An overview of keyword extraction techniques: https://bnosac.github.io/udpipe/docs/doc7.html
vignette("udpipe-usecase-topicmodelling", package = "udpipe")
vignette("udpipe-parallel", package = "udpipe")
vignette("udpipe-train", package = "udpipe")

For installing the development version of this package: remotes::install_github("bnosac/udpipe", build_vignettes = TRUE)

Example

Currently the package allows you to do tokenisation, tagging, lemmatization and dependency parsing with one convenient function called udpipe

library(udpipe)
udmodel <- udpipe_download_model(language = "dutch")
udmodel

    language                                                                             file_model
dutch-alpino C:/Users/Jan/Dropbox/Work/RForgeBNOSAC/BNOSAC/udpipe/dutch-alpino-ud-2.5-191206.udpipe

x <- udpipe(x = "Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.",
            object = udmodel)
x
 doc_id paragraph_id sentence_id start end term_id token_id     token     lemma  upos                                        xpos                               feats head_token_id      dep_rel            misc
   doc1            1           1     1   2       1        1        Ik        ik  PRON                VNW|pers|pron|nomin|vol|1|ev      Case=Nom|Person=1|PronType=Prs             2        nsubj            <NA>
   doc1            1           1     4   7       2        2      ging      gaan  VERB                               WW|pv|verl|ev Number=Sing|Tense=Past|VerbForm=Fin             0         root            <NA>
   doc1            1           1     9  10       3        3        op        op   ADP                                     VZ|init                                <NA>             4         case            <NA>
   doc1            1           1    12  15       4        4      reis      reis  NOUN                  N|soort|ev|basis|zijd|stan              Gender=Com|Number=Sing             2          obl            <NA>
   doc1            1           1    17  18       5        5        en        en CCONJ                                    VG|neven                                <NA>             7           cc            <NA>
   doc1            1           1    20  21       6        6        ik        ik  PRON                VNW|pers|pron|nomin|vol|1|ev      Case=Nom|Person=1|PronType=Prs             7        nsubj            <NA>
   doc1            1           1    23  25       7        7       nam     nemen  VERB                               WW|pv|verl|ev Number=Sing|Tense=Past|VerbForm=Fin             2         conj            <NA>
   doc1            1           1    27  29       8        8       mee       mee   ADP                                      VZ|fin                                <NA>             7 compound:prt   SpaceAfter=No
   doc1            1           1    30  30       9        9         :         : PUNCT                                         LET                                <NA>             7        punct            <NA>
...

Pre-trained models

Pre-trained models build on Universal Dependencies treebanks are made available for more than 65 languages based on 101 treebanks, namely:

afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, buryat-bdt, catalan-ancora, chinese-gsd, chinese-gsdsimp, classical_chinese-kyoto, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, estonian-ewt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, german-hdt, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, italian-twittiro, italian-vit, japanese-gsd, kazakh-ktb, korean-gsd, korean-kaist, kurmanji-mg, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-alksnis, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, old_russian-torot, persian-seraji, polish-lfg, polish-pdb, polish-sz, portuguese-bosque, portuguese-br, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, sanskrit-ufal, scottish_gaelic-arcosg, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, upper_sorbian-ufal, urdu-udtb, uyghur-udt, vietnamese-vtb, wolof-wtb.

These have been made available easily to users of the package by using udpipe_download_model

How good are these models?

Train your own models based on CONLL-U data

The package also allows you to build your own annotation model. For this, you need to provide data in CONLL-U format. These are provided for many languages at https://universaldependencies.org, mostly under the CC-BY-SA license. How this is done is detailed in the package vignette.

vignette("udpipe-train", package = "udpipe")

Support in text mining

Need support in text mining? Contact BNOSAC: http://www.bnosac.be

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].