All Projects → Garrafao → LSCDetection

Garrafao / LSCDetection

Licence: GPL-3.0 license
Data Sets and Models for Evaluation of Lexical Semantic Change Detection

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to LSCDetection

cade
Compass-aligned Distributional Embeddings. Align embeddings from different corpora
Stars: ✭ 29 (+70.59%)
Mutual labels:  embeddings, lexical-semantics, semantic-change
deep-char-cnn-lstm
Deep Character CNN LSTM Encoder with Classification and Similarity Models
Stars: ✭ 20 (+17.65%)
Mutual labels:  embeddings, semantic-similarity
shinTB
Textboxes : Image Text Detection Model : python package (tensorflow)
Stars: ✭ 90 (+429.41%)
Mutual labels:  detection
deep-scite
🚣 A simple recommendation engine (by way of convolutions and embeddings) written in TensorFlow
Stars: ✭ 20 (+17.65%)
Mutual labels:  embeddings
watsor
Object detection for video surveillance
Stars: ✭ 203 (+1094.12%)
Mutual labels:  detection
Object-Detection-And-Tracking
Target detection in the first frame and Tracking target by SiamRPN.
Stars: ✭ 33 (+94.12%)
Mutual labels:  detection
codesnippetsearch
Neural bag of words code search implementation using PyTorch and data from the CodeSearchNet project.
Stars: ✭ 67 (+294.12%)
Mutual labels:  embeddings
cflow-ad
Official PyTorch code for WACV 2022 paper "CFLOW-AD: Real-Time Unsupervised Anomaly Detection with Localization via Conditional Normalizing Flows"
Stars: ✭ 138 (+711.76%)
Mutual labels:  detection
camera.ui
NVR like user Interface for RTSP capable cameras
Stars: ✭ 99 (+482.35%)
Mutual labels:  detection
Archived-SANSA-ML
SANSA Machine Learning Layer
Stars: ✭ 39 (+129.41%)
Mutual labels:  embeddings
towhee
Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.
Stars: ✭ 821 (+4729.41%)
Mutual labels:  embeddings
navec
Compact high quality word embeddings for Russian language
Stars: ✭ 118 (+594.12%)
Mutual labels:  embeddings
SpatiallyAdaptiveInference-Detection
Spatially Adaptive Inference with Stochastic Feature Sampling and Interpolation, ECCV 2020 Oral
Stars: ✭ 55 (+223.53%)
Mutual labels:  detection
Deep-Learning-Experiments-implemented-using-Google-Colab
Colab Compatible FastAI notebooks for NLP and Computer Vision Datasets
Stars: ✭ 16 (-5.88%)
Mutual labels:  embeddings
yolov5-deepsort-tensorrt
A c++ implementation of yolov5 and deepsort
Stars: ✭ 207 (+1117.65%)
Mutual labels:  detection
etiketai
Etiketai is an online tool designed to label images, useful for training AI models
Stars: ✭ 63 (+270.59%)
Mutual labels:  detection
UBA
UEBA Solution for Insider Security. This repo is archived. Thanks!
Stars: ✭ 36 (+111.76%)
Mutual labels:  detection
datastories-semeval2017-task6
Deep-learning model presented in "DataStories at SemEval-2017 Task 6: Siamese LSTM with Attention for Humorous Text Comparison".
Stars: ✭ 20 (+17.65%)
Mutual labels:  embeddings
entity-network
Tensorflow implementation of "Tracking the World State with Recurrent Entity Networks" [https://arxiv.org/abs/1612.03969] by Henaff, Weston, Szlam, Bordes, and LeCun.
Stars: ✭ 58 (+241.18%)
Mutual labels:  embeddings
eewids
Easily Expandable Wireless Intrusion Detection System
Stars: ✭ 25 (+47.06%)
Mutual labels:  detection

LSCDetection

General

Data Sets and Models for Evaluation of Lexical Semantic Change Detection.

If you use this software for academic research, please cite these papers:

Also make sure you give appropriate credit to the below-mentioned software this repository depends on.

Parts of the code rely on DISSECT, gensim, numpy, scikit-learn, scipy, VecMap.

Usage

The scripts should be run directly from the main directory. If you wish to do otherwise, you may have to change the path you add to the path attribute in sys.path.append('./modules/') in the scripts. All scripts can be run directly from the command line:

python3 representations/count.py <corpDir> <outPath> <windowSize>

e.g.

python3 representations/count.py corpora/test/corpus1/ test_matrix1 1

The usage of each script can be understood by running it with help option -h, e.g.:

python3 representations/count.py -h

We recommend you to run the scripts within a virtual environment with Python 3.7.4. Install the required packages running pip install -r requirements.txt. (See also error sources.)

Models

A standard model of LSC detection executes three consecutive steps:

  1. learn semantic representations from corpora (representations/)
  2. align representations (alignment/)
  3. measure change (measures/)

As an example, consider a very simple model (CNT+CI+CD) going through these steps:

  1. learn count vectors from each corpus to compare (representations/count.py)
  2. align them by intersecting their columns (alignment/ci_align.py)
  3. measure change with cosine distance (measures/cd.py)

You can apply this model to the testing data using the following commands:

    python3 representations/count.py corpora/test/corpus1/ test_matrix1 1
    python3 representations/count.py corpora/test/corpus2/ test_matrix2 1

    python3 alignment/ci_align.py test_matrix1 test_matrix2 test_matrix1_aligned test_matrix2_aligned

    python3 measures/cd.py -s testsets/test/targets.tsv test_matrix1_aligned test_matrix2_aligned test_results.tsv

Input Format: All the scripts in this repository can handle two types of matrix input formats:

To learn more about how matrices are loaded and stored check out modules/utils_.py.

The scripts assume a corpus format of one sentence per line in UTF-8 encoded (optionally zipped) text files. You can specify either a file path or a folder. In the latter case the scripts will iterate over all files in the folder.

Pre-Training

Pre-training can be utilzed when working with small target corpora or if additional semantic information, not contained in the target corpus, is desired.

To pre-train SGNS models use representations/sgns.py to create embeddings on the chosen pre-training corpus (saved as a .model file). Afterwards alignment/sgns_vi.py or alignment/sgns_vi_l2normalize.py may be used to refine the pre-trained model on the target corpus. See Alignment for differences between the two scripts.

Semantic Representations

Name Code Type Comment
Count representations/count.py VSM
PPMI representations/ppmi.py VSM
SVD representations/svd.py VSM
RI representations/ri.py VSM
SGNS representations/sgns.py VSM
SCAN repository TPM - different corpus input format

Table: VSM=Vector Space Model, TPM=Topic Model

Alignment

Name Code Applicability Comment
CI alignment/ci_align.py Count, PPMI
SRV alignment/srv_align.py RI - consider using more powerful TRIPY
OP alignment/map_embeddings.py SVD, RI, SGNS - drawn from VecMap
- for OP- and OP+ see scripts/
VI alignment/sgns_vi.py SGNS - bug fixes 27/12/19 (see script for details)
alignment/sgns_vi_l2normalize.py SGNS - additional length normalization between initialization and training, improvments over VI detailed in Kaiser et al. 2021
WI alignment/wi.py Count, PPMI, SVD, RI, SGNS - consider using the more advanced Temporal Referencing

Measures

Name Code Applicability Comment
CD measures/cd.py Count, PPMI, SVD, RI, SGNS
LND measures/lnd.py Count, PPMI, SVD, RI, SGNS
JSD - SCAN
FD measures/freq.py from corpus - log-transform with measures/trsf.py
- get difference with measures/diff.py
TD measures/typs.py Count as above
HD measures/entropy.py Count as above

Post-Processing

Name Code Applicability Comment
SOT postprocessing/sot.py VSM
MC+PCR postprocessing/pcr.py VSM

Parameter Settings

Find detailed notes on model performances and optimal parameter settings in these papers.

Evaluation

The evaluation framework of this repository is based on the comparison of a set of target words across two corpora. Hence, models can be evaluated on a triple (dataset, corpus1, corpus2), where the dataset provides gold values for the change of target words between corpus1 and corpus2.

Datasets

Dataset Language Corpus 1 Corpus 2 Download Comment
DURel German DTA18 DTA19 Dataset, Corpora - version from Schlechtweg et al. (2019) at testsets/durel/
SURel German SDEWAC COOK Dataset, Corpora - version from Schlechtweg et al. (2019) at testsets/surel/
SemCor LSC English SEMCOR1 SEMCOR2 Dataset, Corpora
SemEval Eng English CCOHA 1810-1860 CCOHA 1960-2010 Dataset, Corpora
SemEval Ger German DTA 1800-1899 BZND 1946-1990 Dataset, Corpora
SemEval Lat Latin LatinISE -200-0 LatinISE 0-2000 Dataset, Corpora
SemEval Swe Swedish Kubhist2 1790-1830 Kubhist2 1895-1903 Dataset, Corpora
RuSemShift1 Russian RNC 1682-1916 RNC 1918-1990 Dataset, Corpora
RuSemShift2 Russian RNC 1918-1990 RNC 1991-2016 Dataset, Corpora
RuShiftEval1 Russian RNC 1682-1916 RNC 1918-1990 Dataset, Corpora
RuShiftEval2 Russian RNC 1918-1990 RNC 1991-2016 Dataset, Corpora
RuShiftEval3 Russian RNC 1682-1916 RNC 1991-2016 Dataset, Corpora
DIACR-Ita Italian Unità 1945-1970 Unità 1990-2014 Dataset, Corpora

We provide several evaluation pipelines, downloading the corpora and evaluating the models on (most of) the above-mentioned datasets, see pipelines.

Metrics

Name Code Applicability Comment
Spearman correlation evaluation/spr.py DURel, SURel, SemCor LSC, SemEval*, Ru* - outputs rho (column 3) and p-value (column 4)
Average Precision evaluation/ap.py SemCor LSC, SemEval*, DIACR-Ita - outputs AP (column 3) and random baseline (column 4)

Consider uploading your results for DURel as a submission to the shared task Lexical Semantic Change Detection in German, for SemEval* to SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection and for RuShiftEval to RuShiftEval.

Pipelines

Under scripts/ you find an example of a full evaluation pipeline for the models on two small test corpora. Assuming you are working on a UNIX-based system, first make the scripts executable with

chmod 755 scripts/*.sh

Then run

bash -e scripts/run_test.sh

The script first reads the two gzipped test corpora corpora/test/corpus1/ and corpora/test/corpus2/. Then it produces model predictions for the targets in testsets/test/targets.tsv and writes them under results/. It finally writes the Spearman correlation between each model's predictions and the gold rank (testsets/test/gold.tsv) under the respective folder in results/. Note that the gold values for the test data are meaningless, as they were randomly assigned.

We also provide a script for each dataset running all the models on it including necessary downloads. For this run either of

bash -e scripts/run_durel.sh
bash -e scripts/run_surel.sh
bash -e scripts/run_semcor.sh
bash -e scripts/run_semeval*.sh

You may want to change the parameters in scripts/parameters_durel.sh, etc. (e.g. vector dimensionality, iterations), as running the scripts on the full parameter set may take several days and require a large amount of disk space.

Important Changes

  • September 1, 2019: Python scripts were updated from Python 2 to Python 3.
  • December 27, 2019: bug fixes in alignment/sgns_vi.py (see script for details)
  • March 23, 2020: updates in representations/ri.py and alignment/srv_align.py (see scripts for details)

Error Sources

  • if you are on a Windows system and get error messages like [bash] $'\r': command not found, consider removing trailing '\r' characters with sed -i 's/\r$//' scripts/*.sh

BibTex

@inproceedings{Schlechtwegetal19,
	title = {{A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains}},
	author = {Dominik Schlechtweg and Anna H\"{a}tty and Marco del Tredici and Sabine {Schulte im Walde}},
    booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
	year = {2019},
	address = {Florence, Italy},
	publisher = {Association for Computational Linguistics},
	pages = {732--746},
    doi = {10.18653/v1/P19-1072}
}
@inproceedings{Kaiser2021effects,
    title = "Effects of Pre- and Post-Processing on type-based Embeddings in Lexical Semantic Change Detection",
    author = "Kaiser, Jens and Kurtyigit, Sinan and Kotchourko, Serge and Schlechtweg, Dominik",
    booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics",
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics"
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].