Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → nlp-uoregon → Trankit

nlp-uoregon / Trankit

Licence: apache-2.0

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Programming Languages

python

139335 projects - #7 most used programming language

Labels

machine-learning pytorch nlp natural-language-processing artificial-intelligence deeplearning language-model multilingual

Projects that are alternatives of or similar to Trankit

Spago

Self-contained Machine Learning and Natural Language Processing library in Go

Stars: ✭ 854 (+174.6%)

Mutual labels: artificial-intelligence, natural-language-processing, deeplearning, language-model

Learn Data Science For Free

This repositary is a combination of different resources lying scattered all over the internet. The reason for making such an repositary is to combine all the valuable resources in a sequential manner, so that it helps every beginners who are in a search of free and structured learning resource for Data Science. For Constant Updates Follow me in …

Stars: ✭ 4,757 (+1429.58%)

Mutual labels: artificial-intelligence, natural-language-processing, deeplearning

Coursera Natural Language Processing Specialization

Programming assignments from all courses in the Coursera Natural Language Processing Specialization offered by deeplearning.ai.

Stars: ✭ 39 (-87.46%)

Mutual labels: artificial-intelligence, natural-language-processing, deeplearning

Lazynlp

Library to scrape and clean web pages to create massive datasets.

Stars: ✭ 1,985 (+538.26%)

Mutual labels: artificial-intelligence, natural-language-processing, language-model

Ai Series

Stars: ✭ 702 (+125.72%)

Mutual labels: artificial-intelligence, natural-language-processing, deeplearning

Fixy

Amacımız Türkçe NLP literatüründeki birçok farklı sorunu bir arada çözebilen, eşsiz yaklaşımlar öne süren ve literatürdeki çalışmaların eksiklerini gideren open source bir yazım destekleyicisi/denetleyicisi oluşturmak. Kullanıcıların yazdıkları metinlerdeki yazım yanlışlarını derin öğrenme yaklaşımıyla çözüp aynı zamanda metinlerde anlamsal analizi de gerçekleştirerek bu bağlamda ortaya çıkan yanlışları da fark edip düzeltebilmek.

Stars: ✭ 165 (-46.95%)

Mutual labels: artificial-intelligence, natural-language-processing, deeplearning

Graphbrain

Language, Knowledge, Cognition

Stars: ✭ 294 (-5.47%)

Mutual labels: artificial-intelligence, natural-language-processing

Catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.

Stars: ✭ 224 (-27.97%)

Mutual labels: artificial-intelligence, natural-language-processing

Datascience

Curated list of Python resources for data science.

Stars: ✭ 3,051 (+881.03%)

Mutual labels: artificial-intelligence, deeplearning

Articutapi

API of Articut 中文斷詞 (兼具語意詞性標記)：「斷詞」又稱「分詞」，是中文資訊處理的基礎。Articut 不用機器學習，不需資料模型，只用現代白話中文語法規則，即能達到 SIGHAN 2005 F1-measure 94% 以上，Recall 96% 以上的成績。

Stars: ✭ 252 (-18.97%)

Mutual labels: artificial-intelligence, natural-language-processing

Text Classification

Text Classification through CNN, RNN & HAN using Keras

Stars: ✭ 216 (-30.55%)

Mutual labels: artificial-intelligence, deeplearning

query completion

Personalized Query Completion

Stars: ✭ 24 (-92.28%)

Mutual labels: deeplearning, language-model

Fakenewscorpus

A dataset of millions of news articles scraped from a curated list of data sources.

Stars: ✭ 255 (-18.01%)

Mutual labels: artificial-intelligence, natural-language-processing

My Awesome Ai Bookmarks

Curated list of my reads, implementations and core concepts of Artificial Intelligence, Deep Learning, Machine Learning by best folk in the world.

Stars: ✭ 223 (-28.3%)

Mutual labels: artificial-intelligence, deeplearning

Ai Job Resume

AI 算法岗简历模板

Stars: ✭ 219 (-29.58%)

Mutual labels: artificial-intelligence, natural-language-processing

Prodigy Recipes

🍳 Recipes for the Prodigy, our fully scriptable annotation tool

Stars: ✭ 229 (-26.37%)

Mutual labels: artificial-intelligence, natural-language-processing

Aidl kb

A Knowledge Base for the FB Group Artificial Intelligence and Deep Learning (AIDL)

Stars: ✭ 219 (-29.58%)

Mutual labels: artificial-intelligence, natural-language-processing

few-shot-lm

The source code of "Language Models are Few-shot Multilingual Learners" (MRL @ EMNLP 2021)

Stars: ✭ 32 (-89.71%)

Mutual labels: multilingual, language-model

Awesome Ai Awesomeness

A curated list of awesome awesomeness about artificial intelligence

Stars: ✭ 268 (-13.83%)

Mutual labels: artificial-intelligence, natural-language-processing

Lda

LDA topic modeling for node.js

Stars: ✭ 262 (-15.76%)

Mutual labels: artificial-intelligence, natural-language-processing

View All Similar Projects ➔

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit is a light-weight Transformer-based Python Toolkit for multilingual Natural Language Processing (NLP). It provides a trainable pipeline for fundamental NLP tasks over 100 languages, and 90 downloadable pretrained pipelines for 56 languages.

Trankit outperforms the current state-of-the-art multilingual toolkit Stanza (StanfordNLP) in many tasks over 90 Universal Dependencies v2.5 treebanks of 56 different languages while still being efficient in memory usage and speed, making it usable for general users.

In particular, for English, Trankit is significantly better than Stanza on sentence segmentation (+7.22%) and dependency parsing (+3.92% for UAS and +4.37% for LAS). For Arabic, our toolkit substantially improves sentence segmentation performance by 16.16% while Chinese observes 12.31% and 12.72% improvement of UAS and LAS for dependency parsing. Detailed comparison between Trankit, Stanza, and other popular NLP toolkits (i.e., spaCy, UDPipe) in other languages can be found here on our documentation page.

We also created a Demo Website for Trankit, which is hosted at: http://nlp.uoregon.edu/trankit

Our technical paper for Trankit will be presented at the EACL 2021 conference as a demonstration. Please cite the paper if you use Trankit in your research.

@inproceedings{nguyen2021trankit,
      title={Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing}, 
      author={Minh Van Nguyen, Viet Lai, Amir Pouran Ben Veyseh and Thien Huu Nguyen},
      booktitle="Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations",
      year={2021}
}

Installation

Trankit can be easily installed via one of the following methods:

Using pip

pip install trankit

The command would install Trankit and all dependent packages automatically. Note that, due to this issue relating to adapter-transformers which is an extension of the transformers library, users may need to uninstall transformers before installing trankit to avoid potential conflicts.

From source

git clone https://github.com/nlp-uoregon/trankit.git
cd trankit
pip install -e .

This would first clone our github repo and install Trankit.

Usage

Trankit can process inputs which are untokenized (raw) or pretokenized strings, at both sentence and document level. Currently, Trankit supports the following tasks:

Sentence segmentation.
Tokenization.
Multi-word token expansion.
Part-of-speech tagging.
Morphological feature tagging.
Dependency parsing.
Named entity recognition.

Initialize a pretrained pipeline

The following code shows how to initialize a pretrained pipeline for English; it is instructed to run on GPU, automatically download pretrained models, and store them to the specified cache directory. Trankit will not download pretrained models if they already exist.

from trankit import Pipeline

# initialize a multilingual pipeline
p = Pipeline(lang='english', gpu=True, cache_dir='./cache')

Perform all tasks on the input

After initializing a pretrained pipeline, it can be used to process the input on all tasks as shown below. If the input is a sentence, the tag is_sent must be set to True.

from trankit import Pipeline

p = Pipeline(lang='english', gpu=True, cache_dir='./cache')

######## document-level processing ########
untokenized_doc = '''Hello! This is Trankit.'''
pretokenized_doc = [['Hello', '!'], ['This', 'is', 'Trankit', '.']]

# perform all tasks on the input
processed_doc1 = p(untokenized_doc)
processed_doc2 = p(pretokenized_doc)

######## sentence-level processing ####### 
untokenized_sent = '''This is Trankit.'''
pretokenized_sent = ['This', 'is', 'Trankit', '.']

# perform all tasks on the input
processed_sent1 = p(untokenized_sent, is_sent=True)
processed_sent2 = p(pretokenized_sent, is_sent=True)

Note that, although pretokenized inputs can always be processed, using pretokenized inputs for languages that require multi-word token expansion such as Arabic or French might not be the correct way. Please check out the column Requires MWT expansion? of this table to see if a particular language requires multi-word token expansion or not.
For more detailed examples, please check out our documentation page.

Multilingual usage

In case we want to process inputs of different languages, we need to initialize a multilingual pipeline.

from trankit import Pipeline

# initialize a multilingual pipeline
p = Pipeline(lang='english', gpu=True, cache_dir='./cache')

langs = ['arabic', 'chinese', 'dutch']
for lang in langs:
    p.add(lang)

# tokenize an English input
p.set_active('english')
en = p.tokenize('Rich was here before the scheduled time.')

# get ner tags for an Arabic input
p.set_active('arabic')
ar = p.ner('وكان كنعان قبل ذلك رئيس جهاز الامن والاستطلاع للقوات السورية العاملة في لبنان.')

In this example, .set_active() is used to switch between languages.

Building a customized pipeline

Training customized pipelines is easy with Trankit via the class TPipeline. Below we show how we can train a token and sentence splitter on customized data.

from trankit import TPipeline

tp = TPipeline(training_config={
    'task': 'tokenize',
    'save_dir': './saved_model',
    'train_txt_fpath': './train.txt',
    'train_conllu_fpath': './train.conllu',
    'dev_txt_fpath': './dev.txt',
    'dev_conllu_fpath': './dev.conllu'
    }
)

trainer.train()

Detailed guidelines for training and loading a customized pipeline can be found here

To-do list

Language Identification

Acknowledgements

We use XLM-Roberta and Adapters as our shared multilingual encoder for different tasks and languages. The AdapterHub is used to implement our plug-and-play mechanism with Adapters. To speed up the development process, the implementations for the MWT expander and the lemmatizer are adapted from Stanza.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 311

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (3) 🔗