All Projects → XufengXufengXufeng → Electra_with_tensorflow

XufengXufengXufeng / Electra_with_tensorflow

Licence: other
This is an implementation of electra according to the paper {ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators}

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Electra with tensorflow

Thulac Python
An Efficient Lexical Analyzer for Chinese
Stars: ✭ 1,619 (+12353.85%)
Mutual labels:  chinese-nlp
Weatherbot
一个基于 Rasa 的中文天气情况问询机器人(chatbot), 带 Web UI 界面
Stars: ✭ 186 (+1330.77%)
Mutual labels:  chinese-nlp
ChineseNounPhraseExtraction
使用词性模板抽取中文语料中的名词短语
Stars: ✭ 18 (+38.46%)
Mutual labels:  chinese-nlp
Gossiping Chinese Corpus
PTT 八卦版問答中文語料
Stars: ✭ 137 (+953.85%)
Mutual labels:  chinese-nlp
Fastnlp
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.
Stars: ✭ 2,441 (+18676.92%)
Mutual labels:  chinese-nlp
Nlp4han
中文自然语言处理工具集【断句/分词/词性标注/组块/句法分析/语义分析/NER/N元语法/HMM/代词消解/情感分析/拼写检查】
Stars: ✭ 206 (+1484.62%)
Mutual labels:  chinese-nlp
Zhopenie
Chinese Open Information Extraction (Tree-based Triple Relation Extraction Module)
Stars: ✭ 98 (+653.85%)
Mutual labels:  chinese-nlp
facenet-pytorch-glint360k
A PyTorch implementation of the 'FaceNet' paper for training a facial recognition model with Triplet Loss using the glint360k dataset. A pre-trained model using Triplet Loss is available for download.
Stars: ✭ 186 (+1330.77%)
Mutual labels:  pretrained-model
Thuctc
An Efficient Chinese Text Classifier
Stars: ✭ 179 (+1276.92%)
Mutual labels:  chinese-nlp
Fengshenbang-LM
Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。
Stars: ✭ 1,813 (+13846.15%)
Mutual labels:  chinese-nlp
Segmentit
任何 JS 环境可用的中文分词包,fork from leizongmin/node-segment
Stars: ✭ 139 (+969.23%)
Mutual labels:  chinese-nlp
G2pc
g2pC: A Context-aware Grapheme-to-Phoneme Conversion module for Chinese
Stars: ✭ 155 (+1092.31%)
Mutual labels:  chinese-nlp
Fancy Nlp
NLP for human. A fast and easy-to-use natural language processing (NLP) toolkit, satisfying your imagination about NLP.
Stars: ✭ 233 (+1692.31%)
Mutual labels:  chinese-nlp
Chinese Chatbot
中文聊天机器人,基于10万组对白训练而成,采用注意力机制,对一般问题都会生成一个有意义的答复。已上传模型,可直接运行,跑不起来直播吃键盘。
Stars: ✭ 124 (+853.85%)
Mutual labels:  chinese-nlp
trt pose hand
Real-time hand pose estimation and gesture classification using TensorRT
Stars: ✭ 137 (+953.85%)
Mutual labels:  pretrained-model
Chinese nlu by using rasa nlu
使用 RASA NLU 来构建中文自然语言理解系统(NLU)| Use RASA NLU to build a Chinese Natural Language Understanding System (NLU)
Stars: ✭ 99 (+661.54%)
Mutual labels:  chinese-nlp
Lac
百度NLP:分词,词性标注,命名实体识别,词重要性
Stars: ✭ 2,792 (+21376.92%)
Mutual labels:  chinese-nlp
Chinese-Minority-PLM
CINO: Pre-trained Language Models for Chinese Minority (少数民族语言预训练模型)
Stars: ✭ 133 (+923.08%)
Mutual labels:  chinese-nlp
VideoTransformer-pytorch
PyTorch implementation of a collections of scalable Video Transformer Benchmarks.
Stars: ✭ 159 (+1123.08%)
Mutual labels:  pretrained-model
esapp
An unsupervised Chinese word segmentation tool.
Stars: ✭ 13 (+0%)
Mutual labels:  chinese-nlp

Electra_with_tensorflow

This is an implementation of electra according to the paper {ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators}

Things to know before you read the project:

  1. This is a very raw project. Too rough to use for production. It's not well organized and tested, so not good for research either. It may just provide some ideas when you want to implement electra.

  2. There are some differences between my implement and the original electra paper:

2.1 I don't have any powerful computing resources, so i haven't used matrix multiplication for masking. Simplly put, for each batch i use a randomed sample size. Each batch has same number of token being masked.

  1. As you may probably tell, this project is not polished. There maybe some errors that I haven't found. I haven't used the datasets that were used in the paper. I used a chinese dataset, so there is no reference regarding how well the model should work on the dataset. All in all, it's not well experimented project, I suggest my fellow viewers don't dive too much in this project, if you want to make a production ready application.

  2. My writing with tensorflow is not standard, as you may tell. I use many functional programming. I do this because I didn't read tensorflow user guide thoroughly, also because I feel comfortable writing functions. I think data types more complex than the prime types are mutable and the tensorflow layers feel more complex than dictionary, so I just write functions with no test at all. That's probably why errors may happen running my project.

how to run this project

the environment

I use tensorflow official image for version 1.14; with docker just

docker pull tensorflow/tensorflow:1.14.0-gpu-py3-jupyter

the data

the entrance to the program is data. I don't want to be crue, but you really have to write functions to format your data into one tokenized sentence (token ids seperated with comma) per line txt file. That's the train.txt format.

the configure.yml

datafiles:

  • "data/baike_qa_valid.json" # my raw valid data location. remember you should use your own data formater functions.

  • "data/baike_qa_train.json" # my raw train data location.

char2id_loc: "data/processed_data/char2id.json" # this is char2id file after formatting (data processing), this could be word2id, depending on how you tokenize your raw data.

id2char_loc: "data/processed_data/id2char.json" # this is id2char file after formatting

train_data_loc: "data/processed_data/train.txt" # this is the formatted train data. from here you can tell how rough this project and how lazy i am, as i don't even produce the valid data.

embedding_size: 100 # this is the embedding size

generator_size: 50 # this is the generator hidden size, which is also the discriminator hidden size.

gn_blocks: 1 # this the number of the generator transformer block.

seq_length: 512 # this is the max sequence length

gn_heads: 4 # this is generator head count.

gff_filter_size: 150 # this is generator feed forward filter size.

g_dev: "/CPU:0" # this is the device I use, I once had a GPU, but later i lost it.

dn_blocks: 3 # this is the number of the discriminator transformer block

dn_heads: 6 # this is the discriminator head count.

dff_filter_size: 300 # this is the discriminator feed forward filter size.

d_dev: "/CPU:0" # this is the same GPU loss story.

d_factor: 50 # this is the factor that is used to amp the discriminator loss.

learning_rate: 1e-3 # this is the learning rate.

max_len: 512 # this is the max sequence length again. This duplication is a result of my lazyness not a well thought action.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].