Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → kwonmha → Bert Vocab Builder

kwonmha / Bert Vocab Builder

Builds wordpiece(subword) vocabulary compatible for Google Research's BERT

Programming Languages

139335 projects - #7 most used programming language

Labels

natural-language-processing

Projects that are alternatives of or similar to Bert Vocab Builder

fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

Stars: ✭ 2,441 (+1205.35%)

Mutual labels: natural-language-processing

Sentence Similarity

This repository contains various ways to calculate sentence vector similarity using NLP models

Stars: ✭ 182 (-2.67%)

Mutual labels: natural-language-processing

Global-Locally Self-Attentive Dialogue State Tracker

Stars: ✭ 185 (-1.07%)

Mutual labels: natural-language-processing

My completed implementation solutions for CS224N 2019

Stars: ✭ 178 (-4.81%)

Mutual labels: natural-language-processing

top 1% solution to toxic comment classification challenge on Kaggle.

Stars: ✭ 180 (-3.74%)

Mutual labels: natural-language-processing

Recurrent Convolutional Neural Network Text Classifier

My (slightly modified) Keras implementation of the Recurrent Convolutional Neural Network (RCNN) described here: http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9745.

Stars: ✭ 182 (-2.67%)

Mutual labels: natural-language-processing

Web Database Analytics

Web scrapping and related analytics using Python tools

Stars: ✭ 175 (-6.42%)

Mutual labels: natural-language-processing

Deep Generative Models For Natural Language Processing

DGMs for NLP. A roadmap.

Stars: ✭ 185 (-1.07%)

Mutual labels: natural-language-processing

A dialogue bot for information access

Stars: ✭ 181 (-3.21%)

Mutual labels: natural-language-processing

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.

Stars: ✭ 184 (-1.6%)

Mutual labels: natural-language-processing

Cookiecutter Spacy Fastapi

Cookiecutter API for creating Custom Skills for Azure Search using Python and Docker

Stars: ✭ 179 (-4.28%)

Mutual labels: natural-language-processing

A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.

Stars: ✭ 181 (-3.21%)

Mutual labels: natural-language-processing

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Stars: ✭ 2,236 (+1095.72%)

Mutual labels: natural-language-processing

Entity linking framework

Stars: ✭ 176 (-5.88%)

Mutual labels: natural-language-processing

Id Nlp Resource

A list of Indonesian NLP resources.

Stars: ✭ 185 (-1.07%)

Mutual labels: natural-language-processing

R package providing annotators and a normalized data model for natural language processing

Stars: ✭ 174 (-6.95%)

Mutual labels: natural-language-processing

a sklearn wrapper for Google's BERT model

Stars: ✭ 182 (-2.67%)

Mutual labels: natural-language-processing

深度有趣

Stars: ✭ 2,232 (+1093.58%)

Mutual labels: natural-language-processing

NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

Stars: ✭ 185 (-1.07%)

Mutual labels: natural-language-processing

Test your HN title against a neural network

Stars: ✭ 184 (-1.6%)

Mutual labels: natural-language-processing

View All Similar Projects ➔

Vocabulary builder for BERT

Modified, simplified version of text_encoder_build_subword.py and its dependencies included in tensor2tensor library, making its output fits to google research's open-sourced BERT project.

Although google opened pre-trained BERT and training scripts, they didn't open source to generate wordpiece(subword) vocabulary matches to vocab.txt in opened model.
And the libraries they suggested to use were not compatible with their tokenization.py of BERT as they mentioned.
So I modified text_encoder_build_subword.py of tensor2tensor library that is one of the suggestions google mentioned to generate wordpiece vocabulary.

Modifications

Original SubwordTextEncoder adds "_" at the end of subwords appear on the first position of words. So I changed to add "_" at the beginning of subwords that follow other subwords, using _my_escape_token() function, and later substitued "_" with "##"
Generated vocabulary contains all characters and all characters having "##" in front of them. For example, a and ##a.
Made standard special characters like [email protected]~ and special tokens used for BERT, ex : [SEP], [CLS], [MASK], [UNK] to be added.
Removed irrelevant classes in text_encoder.py, commented unused functions some of which seem to exist for decoding, and removed mlperf_log module to make this project independent to tensor2tensor library.

Requirement

The environment I made this project in consists of :

python3.6
tensorflow 1.11

Basic usage

python subword_builder.py \
--corpus_filepattern "{corpus_for_vocab}" \
--output_filename {name_of_vocab}
--min_count {minimum_subtoken_counts}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 187

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (5) 🔗