All Projects → megagonlabs → ginza-transformers

megagonlabs / ginza-transformers

Licence: MIT license
Use custom tokenizers in spacy-transformers

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to ginza-transformers

nlp workshop odsc europe20
Extensive tutorials for the Advanced NLP Workshop in Open Data Science Conference Europe 2020. We will leverage machine learning, deep learning and deep transfer learning to learn and solve popular tasks using NLP including NER, Classification, Recommendation \ Information Retrieval, Summarization, Classification, Language Translation, Q&A and T…
Stars: ✭ 127 (+746.67%)
Mutual labels:  transformers, spacy
anonymisation
Anonymization of legal cases (Fr) based on Flair embeddings
Stars: ✭ 85 (+466.67%)
Mutual labels:  transformers, spacy
converse
Conversational text Analysis using various NLP techniques
Stars: ✭ 147 (+880%)
Mutual labels:  transformers, spacy
jax-models
Unofficial JAX implementations of deep learning research papers
Stars: ✭ 108 (+620%)
Mutual labels:  transformers
Transformer-in-PyTorch
Transformer/Transformer-XL/R-Transformer examples and explanations
Stars: ✭ 21 (+40%)
Mutual labels:  transformers
uniformer-pytorch
Implementation of Uniformer, a simple attention and 3d convolutional net that achieved SOTA in a number of video classification tasks, debuted in ICLR 2022
Stars: ✭ 90 (+500%)
Mutual labels:  transformers
question generator
An NLP system for generating reading comprehension questions
Stars: ✭ 188 (+1153.33%)
Mutual labels:  transformers
spacymoji
💙 Emoji handling and meta data for spaCy with custom extension attributes
Stars: ✭ 174 (+1060%)
Mutual labels:  spacy
danish transformers
A collection of Danish Transformers
Stars: ✭ 30 (+100%)
Mutual labels:  transformers
clip-italian
CLIP (Contrastive Language–Image Pre-training) for Italian
Stars: ✭ 113 (+653.33%)
Mutual labels:  transformers
Basic-UI-for-GPT-J-6B-with-low-vram
A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.
Stars: ✭ 90 (+500%)
Mutual labels:  transformers
text
Using Transformers from HuggingFace in R
Stars: ✭ 66 (+340%)
Mutual labels:  transformers
spacy conll
Pipeline component for spaCy (and other spaCy-wrapped parsers such as spacy-stanza and spacy-udpipe) that adds CoNLL-U properties to a Doc and its sentences and tokens. Can also be used as a command-line tool.
Stars: ✭ 60 (+300%)
Mutual labels:  spacy
NER-and-Linking-of-Ancient-and-Historic-Places
An NER tool for ancient place names based on Pleiades and Spacy.
Stars: ✭ 26 (+73.33%)
Mutual labels:  spacy
Quora QuestionPairs DL
Kaggle Competition: Using deep learning to solve quora's question pairs problem
Stars: ✭ 54 (+260%)
Mutual labels:  spacy
NLP Quickbook
NLP in Python with Deep Learning
Stars: ✭ 516 (+3340%)
Mutual labels:  spacy
TransQuest
Transformer based translation quality estimation
Stars: ✭ 85 (+466.67%)
Mutual labels:  transformers
topic modelling financial news
Topic modelling on financial news with Natural Language Processing
Stars: ✭ 51 (+240%)
Mutual labels:  spacy
KnowledgeEditor
Code for Editing Factual Knowledge in Language Models
Stars: ✭ 86 (+473.33%)
Mutual labels:  transformers
oreilly-bert-nlp
This repository contains code for the O'Reilly Live Online Training for BERT
Stars: ✭ 19 (+26.67%)
Mutual labels:  transformers

ginza-transformers: Use custom tokenizers in spacy-transformers

The ginza-transformers is a simple extension of the spacy-transformers to use the custom tokenizers (defined outside of huggingface/transformers) in transformer pipeline component of spaCy v3. The ginza-transformers also provides the ability to download the models from Hugging Face Hub automatically at run time.

Fallback mechanisms

There are two fallback tricks in ginza-transformers.

Cutom tokenizer fallbacking

Loading a custom tokenizer specified in components.transformer.model.tokenizer_config.tokenizer_class attribute of config.cfg of a spaCy language model package, as follows.

  • ginza-transformers initially tries to import a tokenizer class with the standard manner of huggingface/transformers (via AutoTokenizer.from_pretrained())
  • If a ValueError raised from AutoTokenizer.from_pretrained(), the fallback logic of ginza-transformers tries to import the class via importlib.import_module with the tokenizer_class value

Model loading at run time

Downloading the model files published in Hugging Face Hub at run time, as follows.

  • ginza-transformers initially tries to load local model directory (i.e. /${local_spacy_model_dir}/transformer/model/)
  • If OSError raised, the first fallback logic passes a model name specified in components.transformer.model.name attribute of config.cfg to AutoModel.from_pretrained() with local_files_only=True option, which means the first fallback logic will immediately look in the local cache and will not reference the Hugging Face Hub at this point
  • If OSError raised from the first fallback logic, the second fallback logic executes AutoModel.from_pretrained() without local_files_only option, which means the second fallback logic will search specified model name in the Hugging Face Hub

How to use

Before executing spacy train command, make sure that spaCy is working with cuda suppot, and then install this package like:

pip install -U ginza-transformers

You need to use config.cfg with a different setting when performing the analysis than the spacy train.

Setting for training phase

Here is an example of spaCy's config.cfg for training phase. With this config, ginza-transformers employs SudachiTra as a transformer tokenizer and use megagonlabs/tansformers-ud-japanese-electra-base-discriminator as a pretrained transformer model. The attributes of the training phase that differ from the defaults of spacy-transformers model are as follows:

[components.transformer.model]
@architectures = "ginza-transformers.TransformerModel.v1"
name = "megagonlabs/transformers-ud-japanese-electra-base-discriminator"

[components.transformer.model.tokenizer_config]
use_fast = false
tokenizer_class = "sudachitra.tokenization_electra_sudachipy.ElectraSudachipyTokenizer"
do_lower_case = false
do_word_tokenize = true
do_subword_tokenize = true
word_tokenizer_type = "sudachipy"
subword_tokenizer_type = "wordpiece"
word_form_type = "dictionary_and_surface"

[components.transformer.model.tokenizer_config.sudachipy_kwargs]
split_mode = "A"
dict_type = "core"

Setting for analysis phases

Here is an example of config.cfg for analysis phase. This config references megagonlabs/tansformers-ud-japanese-electra-base-ginza. The transformer model specified at components.transformer.model.name would be downloaded from the Hugging Face Hub at run time. The attributes of the analysis phase that differ from the training phase are as follows:

[components.transformer]
factory = "transformer_custom"

[components.transformer.model]
name = "megagonlabs/transformers-ud-japanese-electra-base-ginza"
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].