megagonlabs / ginza-transformers

Licence: MIT license

Use custom tokenizers in spacy-transformers

Programming Languages

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to ginza-transformers

Extensive tutorials for the Advanced NLP Workshop in Open Data Science Conference Europe 2020. We will leverage machine learning, deep learning and deep transfer learning to learn and solve popular tasks using NLP including NER, Classification, Recommendation \ Information Retrieval, Summarization, Classification, Language Translation, Q&A and T…

Stars: ✭ 127 (+746.67%)

Mutual labels: transformers, spacy

anonymisation

Anonymization of legal cases (Fr) based on Flair embeddings

Stars: ✭ 85 (+466.67%)

Mutual labels: transformers, spacy

converse

Conversational text Analysis using various NLP techniques

Stars: ✭ 147 (+880%)

Mutual labels: transformers, spacy

jax-models

Unofficial JAX implementations of deep learning research papers

Stars: ✭ 108 (+620%)

Mutual labels: transformers

Transformer-in-PyTorch

Transformer/Transformer-XL/R-Transformer examples and explanations

Stars: ✭ 21 (+40%)

Mutual labels: transformers

uniformer-pytorch

Implementation of Uniformer, a simple attention and 3d convolutional net that achieved SOTA in a number of video classification tasks, debuted in ICLR 2022

Stars: ✭ 90 (+500%)

Mutual labels: transformers

question generator

An NLP system for generating reading comprehension questions

Stars: ✭ 188 (+1153.33%)

Mutual labels: transformers

spacymoji

💙 Emoji handling and meta data for spaCy with custom extension attributes

Stars: ✭ 174 (+1060%)

Mutual labels: spacy

danish transformers

A collection of Danish Transformers

Stars: ✭ 30 (+100%)

Mutual labels: transformers

clip-italian

CLIP (Contrastive Language–Image Pre-training) for Italian

Stars: ✭ 113 (+653.33%)

Mutual labels: transformers

Basic-UI-for-GPT-J-6B-with-low-vram

A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

Stars: ✭ 90 (+500%)

Mutual labels: transformers

text

Using Transformers from HuggingFace in R

Stars: ✭ 66 (+340%)

Mutual labels: transformers

spacy conll

Pipeline component for spaCy (and other spaCy-wrapped parsers such as spacy-stanza and spacy-udpipe) that adds CoNLL-U properties to a Doc and its sentences and tokens. Can also be used as a command-line tool.

Stars: ✭ 60 (+300%)

Mutual labels: spacy

NER-and-Linking-of-Ancient-and-Historic-Places

An NER tool for ancient place names based on Pleiades and Spacy.

Stars: ✭ 26 (+73.33%)

Mutual labels: spacy

Quora QuestionPairs DL

Kaggle Competition: Using deep learning to solve quora's question pairs problem

Stars: ✭ 54 (+260%)

Mutual labels: spacy

NLP Quickbook

NLP in Python with Deep Learning

Stars: ✭ 516 (+3340%)

Mutual labels: spacy

TransQuest

Transformer based translation quality estimation

Stars: ✭ 85 (+466.67%)

Mutual labels: transformers

topic modelling financial news

Topic modelling on financial news with Natural Language Processing

Stars: ✭ 51 (+240%)

Mutual labels: spacy

KnowledgeEditor

Code for Editing Factual Knowledge in Language Models

Stars: ✭ 86 (+473.33%)

Mutual labels: transformers

oreilly-bert-nlp

This repository contains code for the O'Reilly Live Online Training for BERT

Stars: ✭ 19 (+26.67%)

Mutual labels: transformers

View All Similar Projects ➔

ginza-transformers: Use custom tokenizers in spacy-transformers

The ginza-transformers is a simple extension of the spacy-transformers to use the custom tokenizers (defined outside of huggingface/transformers) in transformer pipeline component of spaCy v3. The ginza-transformers also provides the ability to download the models from Hugging Face Hub automatically at run time.

Fallback mechanisms

There are two fallback tricks in ginza-transformers.

Cutom tokenizer fallbacking

Loading a custom tokenizer specified in components.transformer.model.tokenizer_config.tokenizer_class attribute of config.cfg of a spaCy language model package, as follows.

ginza-transformers initially tries to import a tokenizer class with the standard manner of huggingface/transformers (via AutoTokenizer.from_pretrained())
If a ValueError raised from AutoTokenizer.from_pretrained(), the fallback logic of ginza-transformers tries to import the class via importlib.import_module with the tokenizer_class value

Model loading at run time

Downloading the model files published in Hugging Face Hub at run time, as follows.

ginza-transformers initially tries to load local model directory (i.e. /${local_spacy_model_dir}/transformer/model/)
If OSError raised, the first fallback logic passes a model name specified in components.transformer.model.name attribute of config.cfg to AutoModel.from_pretrained() with local_files_only=True option, which means the first fallback logic will immediately look in the local cache and will not reference the Hugging Face Hub at this point
If OSError raised from the first fallback logic, the second fallback logic executes AutoModel.from_pretrained() without local_files_only option, which means the second fallback logic will search specified model name in the Hugging Face Hub

How to use

Before executing spacy train command, make sure that spaCy is working with cuda suppot, and then install this package like:

pip install -U ginza-transformers

You need to use config.cfg with a different setting when performing the analysis than the spacy train.

Setting for training phase

Here is an example of spaCy's config.cfg for training phase. With this config, ginza-transformers employs SudachiTra as a transformer tokenizer and use megagonlabs/tansformers-ud-japanese-electra-base-discriminator as a pretrained transformer model. The attributes of the training phase that differ from the defaults of spacy-transformers model are as follows:

[components.transformer.model]
@architectures = "ginza-transformers.TransformerModel.v1"
name = "megagonlabs/transformers-ud-japanese-electra-base-discriminator"

[components.transformer.model.tokenizer_config]
use_fast = false
tokenizer_class = "sudachitra.tokenization_electra_sudachipy.ElectraSudachipyTokenizer"
do_lower_case = false
do_word_tokenize = true
do_subword_tokenize = true
word_tokenizer_type = "sudachipy"
subword_tokenizer_type = "wordpiece"
word_form_type = "dictionary_and_surface"

[components.transformer.model.tokenizer_config.sudachipy_kwargs]
split_mode = "A"
dict_type = "core"

Setting for analysis phases

Here is an example of config.cfg for analysis phase. This config references megagonlabs/tansformers-ud-japanese-electra-base-ginza. The transformer model specified at components.transformer.model.name would be downloaded from the Hugging Face Hub at run time. The attributes of the analysis phase that differ from the training phase are as follows:

[components.transformer]
factory = "transformer_custom"

[components.transformer.model]
name = "megagonlabs/transformers-ud-japanese-electra-base-ginza"

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

megagonlabs / ginza-transformers

Programming Languages

Labels

Projects that are alternatives of or similar to ginza-transformers

ginza-transformers: Use custom tokenizers in spacy-transformers

Fallback mechanisms

Cutom tokenizer fallbacking

Model loading at run time

How to use

Setting for training phase

Setting for analysis phases