All Projects → wss1996 → Name-disambiguation

wss1996 / Name-disambiguation

Licence: Apache-2.0 license
同名论文消歧的工程化方案(参考2019智源-aminer人名消歧竞赛第一名方案)

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Name-disambiguation

word-embeddings-from-scratch
Creating word embeddings from scratch and visualize them on TensorBoard. Using trained embeddings in Keras.
Stars: ✭ 22 (+29.41%)
Mutual labels:  word2vec
GE-FSG
Graph Embedding via Frequent Subgraphs
Stars: ✭ 39 (+129.41%)
Mutual labels:  word2vec
stackoverflow-semantic-search
Word2Vec encodings based search engine for Stackoverflow questions
Stars: ✭ 23 (+35.29%)
Mutual labels:  word2vec
Word2VecAndTsne
Scripts demo-ing how to train a Word2Vec model and reduce its vector space
Stars: ✭ 45 (+164.71%)
Mutual labels:  word2vec
yap
Yet Another (natural language) Parser
Stars: ✭ 40 (+135.29%)
Mutual labels:  disambiguation
word2vec-movies
Bag of words meets bags of popcorn in Python 3 中文教程
Stars: ✭ 54 (+217.65%)
Mutual labels:  word2vec
Simple-Sentence-Similarity
Exploring the simple sentence similarity measurements using word embeddings
Stars: ✭ 99 (+482.35%)
Mutual labels:  word2vec
wordmap
Visualize large text collections with WebGL
Stars: ✭ 23 (+35.29%)
Mutual labels:  word2vec
skip-gram-Chinese
skip-gram for Chinese word2vec base on tensorflow
Stars: ✭ 20 (+17.65%)
Mutual labels:  word2vec
Word-Embeddings-and-Document-Vectors
An evaluation of word-embeddings for classification
Stars: ✭ 32 (+88.24%)
Mutual labels:  word2vec
Word2Vec-iOS
Word2Vec iOS port
Stars: ✭ 23 (+35.29%)
Mutual labels:  word2vec
two-stream-cnn
A two-stream convolutional neural network for learning abitrary similarity functions over two sets of training data
Stars: ✭ 24 (+41.18%)
Mutual labels:  word2vec
asm2vec
An unofficial implementation of asm2vec as a standalone python package
Stars: ✭ 127 (+647.06%)
Mutual labels:  word2vec
Vaaku2Vec
Language Modeling and Text Classification in Malayalam Language using ULMFiT
Stars: ✭ 68 (+300%)
Mutual labels:  word2vec
acl2017 document clustering
code for "Determining Gains Acquired from Word Embedding Quantitatively Using Discrete Distribution Clustering" ACL 2017
Stars: ✭ 21 (+23.53%)
Mutual labels:  word2vec
grad-cam-text
Implementation of Grad-CAM for text.
Stars: ✭ 37 (+117.65%)
Mutual labels:  word2vec
hyperstar
Hyperstar: Negative Sampling Improves Hypernymy Extraction Based on Projection Learning.
Stars: ✭ 24 (+41.18%)
Mutual labels:  word2vec
Emotion-recognition-from-tweets
A comprehensive approach on recognizing emotion (sentiment) from a certain tweet. Supervised machine learning.
Stars: ✭ 17 (+0%)
Mutual labels:  word2vec
receiptdID
Receipt.ID is a multi-label, multi-class, hierarchical classification system implemented in a two layer feed forward network.
Stars: ✭ 22 (+29.41%)
Mutual labels:  word2vec
doc2vec-api
document embedding and machine learning script for beginners
Stars: ✭ 92 (+441.18%)
Mutual labels:  word2vec

sci 作者人名消歧工程化方案(无评测)

基于论文关系嵌入和idf加权的论文语义嵌入实现论文作者的人名消歧。该方案在测试数据集上取得了较好的评测效果。利用唐杰等人论文中的100个人名测试集得到的评测F1得分为0.7049 好于目大部分的开源方案。在自制数据集上面取得的评测得分为0.859

主要依赖

  • Python 3.6
  • gensim
  • lmdb
  • sklearn.cluster.DBSCAN
  • nltk
  • linux 16核64g
  • 具体的依赖包在requirements.txt
pip install -r requirements.txt

注意:运行该项目将消耗150GB以上的硬盘空间(主要用于存储中间数据,包括特征保存以及论文表征嵌入)。整个流程将花费2-3个工作日。建议您在Linux服务器上运行该项目。

如何运行

Utils(实用程序)

我们封装了一些实用程序,主要包括:

  • cache.py (lmdb数据库的增删改查模块)
  • data_utils.py (json,pkl等格式文件的加载生成模块)
  • embedding.py (词向量训练以及论文嵌入模块)
  • macro_pairwise_f1.py (成对宏平均F1评测模块)
  • settings.py (必要的文件夹创建模块)

data (数据源)

数据源主要是来自于hive导出的sci论文必要的论文数据以及论文id和作者关系记录表。(这些数据表通过一系列sql获取) 具体的表名为:

  • jingxinwei.t_018_sci_disamb_string_precess (1800多万论文元数据表,主要包括id,title,keyword,author,source等字段)

  • jingxinwei.t_018_uid_name_org_preccess_50 (论文数量大于50的作者名与论文id记录:3000多万条)

  • jingxinwei.t_018_uid_name_org_preccess_10_50 ((论文数量大于10小于50的作者名与论文id记录:3000多万条))

代码运行步骤(主程序)

  • name_author_precess_sci.ipynb (打开该脚本并执行,将作者名与论文id记录转化为字典格式的同名论文集)
  • python preprocessing.py (对1800多万sci论文的预处理,词向量训练,idf加权获取论文语义嵌入并将相关结果导入lmdb)
  • python sci_disamb_bynetwork.py (多进程消歧的主程序)
  • json_tocsv.ipynb (打开脚本并执行,将消歧结果打上作者唯一标识:author_id,并且转化为csv关系记录)
  • Evaluation_sci/evaluation_sci_10.ipynb (用于sci的10个自制数据集结果评测,评测F1得分为0.859)

其他脚本说明

  • sci_paper_tojson.ipynb (将sci同名论文集制作成单个json文件,由于后面利用lmdb实现了多进程存取,因此该步骤不需要了)
  • sci_train_word2vec_alldata.ipynb (1800万sci论文的词向量训练,该部分包含在preprocessing.py中)
  • sci_to_lmdb.ipynb (将1800万论文信息,预处理特征,论文嵌入结果导入lmdb)

参考的程序链接

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].