All Projects → iesl → stance

iesl / stance

Licence: Apache-2.0 license
Learned string similarity for entity names using optimal transport.

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to stance

strutil
Golang metrics for calculating string similarity and other string utility functions
Stars: ✭ 114 (+322.22%)
Mutual labels:  string-distance, string-matching, string-similarity
beda
Beda is a golang library for detecting how similar a two string
Stars: ✭ 34 (+25.93%)
Mutual labels:  string-distance, string-matching, string-similarity
record-linkage-resources
Resources for tackling record linkage / deduplication / data matching problems
Stars: ✭ 67 (+148.15%)
Mutual labels:  record-linkage, entity-resolution
strsim
string similarity based on Dice's coefficient in go
Stars: ✭ 39 (+44.44%)
Mutual labels:  string-matching, string-similarity
splink
Implementation of Fellegi-Sunter's canonical model of record linkage in Apache Spark, including EM algorithm to estimate parameters
Stars: ✭ 181 (+570.37%)
Mutual labels:  record-linkage, entity-resolution
fuzzywuzzy
Fuzzy string matching for PHP
Stars: ✭ 60 (+122.22%)
Mutual labels:  string-distance, string-matching
Merge-Machine
Merge Dirty Data with Clean Reference Tables
Stars: ✭ 35 (+29.63%)
Mutual labels:  record-linkage, entity-resolution
Dedupe
🆔 A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
Stars: ✭ 3,241 (+11903.7%)
Mutual labels:  record-linkage, entity-resolution
entity-embed
PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.
Stars: ✭ 96 (+255.56%)
Mutual labels:  record-linkage, entity-resolution
Levenshtein
The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity
Stars: ✭ 38 (+40.74%)
Mutual labels:  string-matching, string-similarity
TeamReference
Team reference for Competitive Programming. Algorithms implementations very used in the ACM-ICPC contests. Latex template to build your own team reference.
Stars: ✭ 29 (+7.41%)
Mutual labels:  string-matching
string-similarity-js
Lightweight string similarity function for javascript
Stars: ✭ 29 (+7.41%)
Mutual labels:  string-similarity
seqalign
Collection of sequence alignment algorithms.
Stars: ✭ 20 (-25.93%)
Mutual labels:  string-distance
FastFuzzyStringMatcherDotNet
A BK tree implementation for fast fuzzy string matching
Stars: ✭ 23 (-14.81%)
Mutual labels:  string-matching
wildmatch
Simple string matching with questionmark- and star-wildcard operator
Stars: ✭ 37 (+37.04%)
Mutual labels:  string-matching
whatis
WhatIs.this: simple entity resolution through Wikipedia
Stars: ✭ 18 (-33.33%)
Mutual labels:  entity-resolution
anonaddy
Mobile app for AnonAddy.com.
Stars: ✭ 50 (+85.19%)
Mutual labels:  aliases
conciliator
OpenRefine reconciliation services for VIAF, ORCID, and Open Library + framework for creating more.
Stars: ✭ 95 (+251.85%)
Mutual labels:  entity-resolution
stringbench
String matching algorithm benchmark
Stars: ✭ 31 (+14.81%)
Mutual labels:  string-matching
AwesomeStanceLearning
The page lists recent research developments in the area of Stance Learning.
Stars: ✭ 42 (+55.56%)
Mutual labels:  stance

Stance

Similiarity of Transport Aligned Neural Character Encodings

Optimal Transport-based Alignment of Learned Character Representations for String Similarity Derek Tam, Nicholas Monath, Ari Kobren, Aaron Traylor, Rajarshi Das, Andrew McCallum. Association for Computational Linguistics (ACL). 2019.

Dependencies

Python 3.6
Pytorch 0.4
numpy 1.13.3
scikit-learn 0.21.1
cython
nose

Dataset

The datasets are at this google drive link [Updated 9/7/19] and the data directory should be put under the top directory stance

Training files are of the form query \t positive \t negative. For example,

William Paget, 1st Baron Paget \t William Lord Paget \t William George Stevens 
William Paget, 1st Baron Paget \t William Lord Paget \t William Tighe  
William Paget, 1st Baron Paget \t William Lord Paget \t Edward Paget    

Dev and Test files are of the form query \t candidate \t label where label is 1 (if candidate is alias of query) or 0 (if candidate is not alias of query). For example,

peace agreement peace negotiation       1      
peace agreement interim peace treaty    1      
peace agreement Peace Accord    1  

Setup

First, install the baselines by running source bin/install_baseline.sh (from https://github.com/mblondel/soft-dtw)

For each session, run source bin/setup.sh to set environment variables.

If running on your own dataset, create the vocab for a dataset by running bin/make_vocab.sh with the training file, vocab file name, tokenizer, and miniumum count as arguments. For example, sh bin/make_vocab.sh data/artist/artist.train data/artist/artist.vocab Char 5. Vocab files are provided for the datasets we released.

* Note creating the vocab only has to be done once per dataset.

Training Models

First create a config JSON file (sample file at config/artist/STANCE.json).

Then, train the model by running bin/run/train_model.sh with the config JSON file as an argument. For example, sh bin/run/train_mode.sh config/artist/stance.json

See below for how to grid search train models

Evaluating Models

There are two options:

  1. evaluating the model on the entire test file (can take a long time to run)

    • For the first option, run bin/run/eval_model.sh, passing in the experiment directory as the argument. For example, sh bin/run/eval_model.sh exp_out/artist/Stance/Char/2019-05-30-10-36-55/.
  2. sharding the test file and evaluate the model in parallel

    • For the second option, first shard the test file by running bin/shard_test.sh and passing in the test file and number of shards as arguments. For example, sh bin/shard_test.sh data/disease/disease.test 10 0.

      * This only has to be done once per dataset

    • Then, setup a script by running src/main/eval/setup_parallel_test.py that will evaluate the model on each shard in parallel, passing in the experiment directory, number of shards, and gpu type as arguments. The experiment directory has to be the configuration directory with the best model when using grid search. For example, python src/main/eval/setup_parallel_test.py -e exp_out/artist/Stance/Char/2019-05-30-10-36-55 -n 10 -g 1080ti-short

      * The script assumes a slurm manager

    • Finally, run the script which will be at exp_out/{dataset}/{model}/{tokenizer}/{timestamp}/parallel_test.sh. For example, sh exp_out/artist/Stance/Char/2019-05-30-10-36-55/parallel_test.sh.

  3. Calculate the score on the shards

    • Run src/main/eval/score_shards.py. The experiment directory has to be the same experiment directory passed into src/main/eval/setup_parallel_test.py earlier. For example, python src/main/eval/score_shards.py -e exp_out/artist/Stance/Char/2019-05-30-10-36-55 The test scores will appear in exp_out/{dataset}/{model}/{tokenizer}/{timestamp}/test_scores.json

Grid Search Train Models

First, create a grid search config JSON file (sample file at config/artist/grid_search_STANCE.json)

Then, create a script to train each model configuration in parallel by running src/main/setup/setup_grid_search_train.py with the grid search config file and gpu type as arguments. For example, python src/main/setup/setup_grid_search_train.py -c config/artist/grid_search_STANCE.json -g gpu.

* The script assumes a slurm manager

Finally, run the script, which wil be at exp_out/{dataset}/{model}/{tokenizer}/{timestamp}/grid_search_config.sh. For example, sh exp_out/artist/Stance/Char/2019-05-30-15-08-47/grid_search_config.sh.

Citing

Please cite:

@inproceedings{tam2019optimal,
    title = "Optimal Transport-based Alignment of Learned Character Representations for String Similarity",
    author = "Tam, Derek  and
      Monath, Nicholas  and
      Kobren, Ari  and
      Traylor, Aaron  and
      Das, Rajarshi  and
      McCallum, Andrew",
    booktitle = "Association for Computational Linguistics (ACL)",
    year = "2019"
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].