All Projects → nlp-uoregon → Trankit

nlp-uoregon / Trankit

Licence: apache-2.0
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Trankit

Spago
Self-contained Machine Learning and Natural Language Processing library in Go
Stars: ✭ 854 (+174.6%)
Mutual labels:  artificial-intelligence, natural-language-processing, deeplearning, language-model
Learn Data Science For Free
This repositary is a combination of different resources lying scattered all over the internet. The reason for making such an repositary is to combine all the valuable resources in a sequential manner, so that it helps every beginners who are in a search of free and structured learning resource for Data Science. For Constant Updates Follow me in …
Stars: ✭ 4,757 (+1429.58%)
Mutual labels:  artificial-intelligence, natural-language-processing, deeplearning
Coursera Natural Language Processing Specialization
Programming assignments from all courses in the Coursera Natural Language Processing Specialization offered by deeplearning.ai.
Stars: ✭ 39 (-87.46%)
Mutual labels:  artificial-intelligence, natural-language-processing, deeplearning
Lazynlp
Library to scrape and clean web pages to create massive datasets.
Stars: ✭ 1,985 (+538.26%)
Mutual labels:  artificial-intelligence, natural-language-processing, language-model
Ai Series
📚 [.md & .ipynb] Series of Artificial Intelligence & Deep Learning, including Mathematics Fundamentals, Python Practices, NLP Application, etc. 💫 人工智能与深度学习实战,数理统计篇 | 机器学习篇 | 深度学习篇 | 自然语言处理篇 | 工具实践 Scikit & Tensoflow & PyTorch 篇 | 行业应用 & 课程笔记
Stars: ✭ 702 (+125.72%)
Mutual labels:  artificial-intelligence, natural-language-processing, deeplearning
Fixy
Amacımız Türkçe NLP literatüründeki birçok farklı sorunu bir arada çözebilen, eşsiz yaklaşımlar öne süren ve literatürdeki çalışmaların eksiklerini gideren open source bir yazım destekleyicisi/denetleyicisi oluşturmak. Kullanıcıların yazdıkları metinlerdeki yazım yanlışlarını derin öğrenme yaklaşımıyla çözüp aynı zamanda metinlerde anlamsal analizi de gerçekleştirerek bu bağlamda ortaya çıkan yanlışları da fark edip düzeltebilmek.
Stars: ✭ 165 (-46.95%)
Mutual labels:  artificial-intelligence, natural-language-processing, deeplearning
Graphbrain
Language, Knowledge, Cognition
Stars: ✭ 294 (-5.47%)
Mutual labels:  artificial-intelligence, natural-language-processing
Catalyst
🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
Stars: ✭ 224 (-27.97%)
Mutual labels:  artificial-intelligence, natural-language-processing
Datascience
Curated list of Python resources for data science.
Stars: ✭ 3,051 (+881.03%)
Mutual labels:  artificial-intelligence, deeplearning
Articutapi
API of Articut 中文斷詞 (兼具語意詞性標記):「斷詞」又稱「分詞」,是中文資訊處理的基礎。Articut 不用機器學習,不需資料模型,只用現代白話中文語法規則,即能達到 SIGHAN 2005 F1-measure 94% 以上,Recall 96% 以上的成績。
Stars: ✭ 252 (-18.97%)
Mutual labels:  artificial-intelligence, natural-language-processing
Text Classification
Text Classification through CNN, RNN & HAN using Keras
Stars: ✭ 216 (-30.55%)
Mutual labels:  artificial-intelligence, deeplearning
query completion
Personalized Query Completion
Stars: ✭ 24 (-92.28%)
Mutual labels:  deeplearning, language-model
Fakenewscorpus
A dataset of millions of news articles scraped from a curated list of data sources.
Stars: ✭ 255 (-18.01%)
Mutual labels:  artificial-intelligence, natural-language-processing
My Awesome Ai Bookmarks
Curated list of my reads, implementations and core concepts of Artificial Intelligence, Deep Learning, Machine Learning by best folk in the world.
Stars: ✭ 223 (-28.3%)
Mutual labels:  artificial-intelligence, deeplearning
Ai Job Resume
AI 算法岗简历模板
Stars: ✭ 219 (-29.58%)
Mutual labels:  artificial-intelligence, natural-language-processing
Prodigy Recipes
🍳 Recipes for the Prodigy, our fully scriptable annotation tool
Stars: ✭ 229 (-26.37%)
Mutual labels:  artificial-intelligence, natural-language-processing
Aidl kb
A Knowledge Base for the FB Group Artificial Intelligence and Deep Learning (AIDL)
Stars: ✭ 219 (-29.58%)
Mutual labels:  artificial-intelligence, natural-language-processing
few-shot-lm
The source code of "Language Models are Few-shot Multilingual Learners" (MRL @ EMNLP 2021)
Stars: ✭ 32 (-89.71%)
Mutual labels:  multilingual, language-model
Awesome Ai Awesomeness
A curated list of awesome awesomeness about artificial intelligence
Stars: ✭ 268 (-13.83%)
Mutual labels:  artificial-intelligence, natural-language-processing
Lda
LDA topic modeling for node.js
Stars: ✭ 262 (-15.76%)
Mutual labels:  artificial-intelligence, natural-language-processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit is a light-weight Transformer-based Python Toolkit for multilingual Natural Language Processing (NLP). It provides a trainable pipeline for fundamental NLP tasks over 100 languages, and 90 downloadable pretrained pipelines for 56 languages.

Trankit outperforms the current state-of-the-art multilingual toolkit Stanza (StanfordNLP) in many tasks over 90 Universal Dependencies v2.5 treebanks of 56 different languages while still being efficient in memory usage and speed, making it usable for general users.

In particular, for English, Trankit is significantly better than Stanza on sentence segmentation (+7.22%) and dependency parsing (+3.92% for UAS and +4.37% for LAS). For Arabic, our toolkit substantially improves sentence segmentation performance by 16.16% while Chinese observes 12.31% and 12.72% improvement of UAS and LAS for dependency parsing. Detailed comparison between Trankit, Stanza, and other popular NLP toolkits (i.e., spaCy, UDPipe) in other languages can be found here on our documentation page.

We also created a Demo Website for Trankit, which is hosted at: http://nlp.uoregon.edu/trankit

Our technical paper for Trankit will be presented at the EACL 2021 conference as a demonstration. Please cite the paper if you use Trankit in your research.

@inproceedings{nguyen2021trankit,
      title={Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing}, 
      author={Minh Van Nguyen, Viet Lai, Amir Pouran Ben Veyseh and Thien Huu Nguyen},
      booktitle="Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations",
      year={2021}
}

Installation

Trankit can be easily installed via one of the following methods:

Using pip

pip install trankit

The command would install Trankit and all dependent packages automatically. Note that, due to this issue relating to adapter-transformers which is an extension of the transformers library, users may need to uninstall transformers before installing trankit to avoid potential conflicts.

From source

git clone https://github.com/nlp-uoregon/trankit.git
cd trankit
pip install -e .

This would first clone our github repo and install Trankit.

Usage

Trankit can process inputs which are untokenized (raw) or pretokenized strings, at both sentence and document level. Currently, Trankit supports the following tasks:

  • Sentence segmentation.
  • Tokenization.
  • Multi-word token expansion.
  • Part-of-speech tagging.
  • Morphological feature tagging.
  • Dependency parsing.
  • Named entity recognition.

Initialize a pretrained pipeline

The following code shows how to initialize a pretrained pipeline for English; it is instructed to run on GPU, automatically download pretrained models, and store them to the specified cache directory. Trankit will not download pretrained models if they already exist.

from trankit import Pipeline

# initialize a multilingual pipeline
p = Pipeline(lang='english', gpu=True, cache_dir='./cache')

Perform all tasks on the input

After initializing a pretrained pipeline, it can be used to process the input on all tasks as shown below. If the input is a sentence, the tag is_sent must be set to True.

from trankit import Pipeline

p = Pipeline(lang='english', gpu=True, cache_dir='./cache')

######## document-level processing ########
untokenized_doc = '''Hello! This is Trankit.'''
pretokenized_doc = [['Hello', '!'], ['This', 'is', 'Trankit', '.']]

# perform all tasks on the input
processed_doc1 = p(untokenized_doc)
processed_doc2 = p(pretokenized_doc)

######## sentence-level processing ####### 
untokenized_sent = '''This is Trankit.'''
pretokenized_sent = ['This', 'is', 'Trankit', '.']

# perform all tasks on the input
processed_sent1 = p(untokenized_sent, is_sent=True)
processed_sent2 = p(pretokenized_sent, is_sent=True)

Note that, although pretokenized inputs can always be processed, using pretokenized inputs for languages that require multi-word token expansion such as Arabic or French might not be the correct way. Please check out the column Requires MWT expansion? of this table to see if a particular language requires multi-word token expansion or not.
For more detailed examples, please check out our documentation page.

Multilingual usage

In case we want to process inputs of different languages, we need to initialize a multilingual pipeline.

from trankit import Pipeline

# initialize a multilingual pipeline
p = Pipeline(lang='english', gpu=True, cache_dir='./cache')

langs = ['arabic', 'chinese', 'dutch']
for lang in langs:
    p.add(lang)

# tokenize an English input
p.set_active('english')
en = p.tokenize('Rich was here before the scheduled time.')

# get ner tags for an Arabic input
p.set_active('arabic')
ar = p.ner('وكان كنعان قبل ذلك رئيس جهاز الامن والاستطلاع للقوات السورية العاملة في لبنان.')

In this example, .set_active() is used to switch between languages.

Building a customized pipeline

Training customized pipelines is easy with Trankit via the class TPipeline. Below we show how we can train a token and sentence splitter on customized data.

from trankit import TPipeline

tp = TPipeline(training_config={
    'task': 'tokenize',
    'save_dir': './saved_model',
    'train_txt_fpath': './train.txt',
    'train_conllu_fpath': './train.conllu',
    'dev_txt_fpath': './dev.txt',
    'dev_conllu_fpath': './dev.conllu'
    }
)

trainer.train()

Detailed guidelines for training and loading a customized pipeline can be found here

To-do list

  • Language Identification

Acknowledgements

We use XLM-Roberta and Adapters as our shared multilingual encoder for different tasks and languages. The AdapterHub is used to implement our plug-and-play mechanism with Adapters. To speed up the development process, the implementations for the MWT expander and the lemmatizer are adapted from Stanza.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].