All Projects → kwonmha → Bert Vocab Builder

kwonmha / Bert Vocab Builder

Builds wordpiece(subword) vocabulary compatible for Google Research's BERT

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Bert Vocab Builder

Fastnlp
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.
Stars: ✭ 2,441 (+1205.35%)
Mutual labels:  natural-language-processing
Sentence Similarity
This repository contains various ways to calculate sentence vector similarity using NLP models
Stars: ✭ 182 (-2.67%)
Mutual labels:  natural-language-processing
Glad
Global-Locally Self-Attentive Dialogue State Tracker
Stars: ✭ 185 (-1.07%)
Mutual labels:  natural-language-processing
Cs224n 2019
My completed implementation solutions for CS224N 2019
Stars: ✭ 178 (-4.81%)
Mutual labels:  natural-language-processing
Deeptoxic
top 1% solution to toxic comment classification challenge on Kaggle.
Stars: ✭ 180 (-3.74%)
Mutual labels:  natural-language-processing
Recurrent Convolutional Neural Network Text Classifier
My (slightly modified) Keras implementation of the Recurrent Convolutional Neural Network (RCNN) described here: http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9745.
Stars: ✭ 182 (-2.67%)
Mutual labels:  natural-language-processing
Web Database Analytics
Web scrapping and related analytics using Python tools
Stars: ✭ 175 (-6.42%)
Mutual labels:  natural-language-processing
Deep Generative Models For Natural Language Processing
DGMs for NLP. A roadmap.
Stars: ✭ 185 (-1.07%)
Mutual labels:  natural-language-processing
Kb Infobot
A dialogue bot for information access
Stars: ✭ 181 (-3.21%)
Mutual labels:  natural-language-processing
Dkpro Core
Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
Stars: ✭ 184 (-1.6%)
Mutual labels:  natural-language-processing
Cookiecutter Spacy Fastapi
Cookiecutter API for creating Custom Skills for Azure Search using Python and Docker
Stars: ✭ 179 (-4.28%)
Mutual labels:  natural-language-processing
Nlp profiler
A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.
Stars: ✭ 181 (-3.21%)
Mutual labels:  natural-language-processing
Texar
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/
Stars: ✭ 2,236 (+1095.72%)
Mutual labels:  natural-language-processing
Nel
Entity linking framework
Stars: ✭ 176 (-5.88%)
Mutual labels:  natural-language-processing
Id Nlp Resource
A list of Indonesian NLP resources.
Stars: ✭ 185 (-1.07%)
Mutual labels:  natural-language-processing
Cleannlp
R package providing annotators and a normalized data model for natural language processing
Stars: ✭ 174 (-6.95%)
Mutual labels:  natural-language-processing
Bert Sklearn
a sklearn wrapper for Google's BERT model
Stars: ✭ 182 (-2.67%)
Mutual labels:  natural-language-processing
Deepinterests
深度有趣
Stars: ✭ 2,232 (+1093.58%)
Mutual labels:  natural-language-processing
Neuralqa
NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT
Stars: ✭ 185 (-1.07%)
Mutual labels:  natural-language-processing
Hntitlenator
Test your HN title against a neural network
Stars: ✭ 184 (-1.6%)
Mutual labels:  natural-language-processing

Vocabulary builder for BERT

Modified, simplified version of text_encoder_build_subword.py and its dependencies included in tensor2tensor library, making its output fits to google research's open-sourced BERT project.


Although google opened pre-trained BERT and training scripts, they didn't open source to generate wordpiece(subword) vocabulary matches to vocab.txt in opened model.
And the libraries they suggested to use were not compatible with their tokenization.py of BERT as they mentioned.
So I modified text_encoder_build_subword.py of tensor2tensor library that is one of the suggestions google mentioned to generate wordpiece vocabulary.

Modifications

  • Original SubwordTextEncoder adds "_" at the end of subwords appear on the first position of words. So I changed to add "_" at the beginning of subwords that follow other subwords, using _my_escape_token() function, and later substitued "_" with "##"

  • Generated vocabulary contains all characters and all characters having "##" in front of them. For example, a and ##a.

  • Made standard special characters like [email protected]~ and special tokens used for BERT, ex : [SEP], [CLS], [MASK], [UNK] to be added.

  • Removed irrelevant classes in text_encoder.py, commented unused functions some of which seem to exist for decoding, and removed mlperf_log module to make this project independent to tensor2tensor library.

Requirement

The environment I made this project in consists of :

  • python3.6
  • tensorflow 1.11

Basic usage

python subword_builder.py \
--corpus_filepattern "{corpus_for_vocab}" \
--output_filename {name_of_vocab}
--min_count {minimum_subtoken_counts}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].