All Projects → synapse-developpement → Discovery

synapse-developpement / Discovery

Licence: other
Mining Discourse Markers for Unsupervised Sentence Representation Learning

Programming Languages

Jupyter Notebook
11667 projects
shell
77523 projects

Projects that are alternatives of or similar to Discovery

DiscourseSenser
Sense Disambiguation of Connectives for PDTB-Style Discourse Parsing
Stars: ✭ 13 (-72.92%)
Mutual labels:  discourse-analysis, discourse-parsing, pdtb
Opencog
A framework for integrated Artificial Intelligence & Artificial General Intelligence (AGI)
Stars: ✭ 2,132 (+4341.67%)
Mutual labels:  unsupervised-learning, natural-language-inference, natural-language-understanding
discopy
End-to-end shallow discourse parser
Stars: ✭ 16 (-66.67%)
Mutual labels:  discourse-analysis, discourse-parsing, pdtb
Awesome Sentence Embedding
A curated list of pretrained sentence and word embedding models
Stars: ✭ 1,973 (+4010.42%)
Mutual labels:  unsupervised-learning, sentence-embeddings
nlp-notebooks
A collection of natural language processing notebooks.
Stars: ✭ 19 (-60.42%)
Mutual labels:  natural-language-inference, natural-language-understanding
Gluon Nlp
NLP made easy
Stars: ✭ 2,344 (+4783.33%)
Mutual labels:  natural-language-inference, natural-language-understanding
Nlp Recipes
Natural Language Processing Best Practices & Examples
Stars: ✭ 5,783 (+11947.92%)
Mutual labels:  natural-language-inference, natural-language-understanding
Fill-the-GAP
[ACL-WS] 4th place solution to gendered pronoun resolution challenge on Kaggle
Stars: ✭ 13 (-72.92%)
Mutual labels:  natural-language-inference, natural-language-understanding
TextFeatureSelection
Python library for feature selection for text features. It has filter method, genetic algorithm and TextFeatureSelectionEnsemble for improving text classification models. Helps improve your machine learning models
Stars: ✭ 42 (-12.5%)
Mutual labels:  natural-language-inference, natural-language-understanding
WSDM-Cup-2019
[ACM-WSDM] 3rd place solution at WSDM Cup 2019, Fake News Classification on Kaggle.
Stars: ✭ 62 (+29.17%)
Mutual labels:  natural-language-inference, natural-language-understanding
T-CorEx
Implementation of linear CorEx and temporal CorEx.
Stars: ✭ 31 (-35.42%)
Mutual labels:  unsupervised-learning
kmeans
A simple implementation of K-means (and Bisecting K-means) clustering algorithm in Python
Stars: ✭ 18 (-62.5%)
Mutual labels:  unsupervised-learning
Deep-Association-Learning
Tensorflow Implementation on Paper [BMVC2018]Deep Association Learning for Unsupervised Video Person Re-identification
Stars: ✭ 68 (+41.67%)
Mutual labels:  unsupervised-learning
machine-learning
Programming Assignments and Lectures for Andrew Ng's "Machine Learning" Coursera course
Stars: ✭ 83 (+72.92%)
Mutual labels:  unsupervised-learning
adenine
ADENINE: A Data ExploratioN PipelINE
Stars: ✭ 15 (-68.75%)
Mutual labels:  unsupervised-learning
Manhattan-LSTM
Keras and PyTorch implementations of the MaLSTM model for computing Semantic Similarity.
Stars: ✭ 28 (-41.67%)
Mutual labels:  natural-language-understanding
ConveRT
Dual Encoders for State-of-the-art Natural Language Processing.
Stars: ✭ 44 (-8.33%)
Mutual labels:  natural-language-understanding
PlanSum
[AAAI2021] Unsupervised Opinion Summarization with Content Planning
Stars: ✭ 25 (-47.92%)
Mutual labels:  unsupervised-learning
awesome-contrastive-self-supervised-learning
A comprehensive list of awesome contrastive self-supervised learning papers.
Stars: ✭ 748 (+1458.33%)
Mutual labels:  unsupervised-learning
DRNET
PyTorch implementation of the NIPS 2017 paper - Unsupervised Learning of Disentangled Representations from Video
Stars: ✭ 45 (-6.25%)
Mutual labels:  unsupervised-learning

Discovery : Mining Discourse Markers for Unsupervised Sentence Representation Learning

This is a data/code release accompanying this paper:

  • Title: "Mining Discourse Markers for Unsupervised Sentence Representation Learning"
  • Authors: Damien Sileo, Tim Van de Cruys, Camille Pradel and Philippe Muller
  • https://www.aclweb.org/anthology/N19-1351/
  • Presented at NAACL 2019

Contents

The Discovery datasets consists of adjacent sentence pairs (s1,s2) with a discourse marker (y) that occured at the beginning of s2. They were extracted from the depcc web corpus.

Markers prediction can be used in order to train a sentence encoders. Discourse markers can be considered as noisy labels for various semantic tasks, such as entailment (y=therefore), subjectivity analysis (y=personally) or sentiment analysis (y=sadly), similarity (y=similarly), typicality, (y=curiously) ...

The specificity of this dataset is the diversity of the markers, since previously used data used only ~10 imbalanced classes. In this repository, you can find:

  • a list of the 174 discourse markers we obtained
  • a Base version of our dataset with 1.74 million pairs (10k exemples per marker)
  • a Big version with 3.4 million pairs
  • a Hard version with 1.74 million pairs where the connective couldn't be predicted with a fasttext linear model

Examples from the Discovery dataset:

s1 s2 y
The motivations for playing are vastly different , and yet Spin the Bottle manages to meet the needs of all its players . It is a well crafted game . truly,
Prefiguring The General many years later , Bernard liked nothing better than to cock a snoot at the law . Men working on a bog , less than a mile from the Kirwan farm , dug up a human torso . eventually,
Think a certain vertical market or knowledge about multilocations ' unique needs . Ernest 's strength lay in the multilocation arena and gives Birch a new capability . indeed,
@ Sklivvz : but you are implicitly using one such interpretation yourself . One that tells you that it 's unphysical to ask anything except measurements . namely,
Perhaps the Jeanneau 's are a bargain compared to similarly capable boats from B or C. . Seattle , the prices for the 36 and 39 went down about 20G , a 39 now sells for a bit more than the 36 did . locally,

Instructions

NOW AVAILABLE ON HUGGINGFACE DATASETS LIBRARY (GLUE-COMPATIBLE FORMAT):

import datasets
datasets.load_dataset("discovery","discovery")

Run the bash get_data.bash in data You can also download it directly from this link: https://drive.google.com/file/d/1yOJvkrYbGED9yFrSgo7297jW_47e55g6/view?usp=sharing

demo.ipynb shows an example of how to read the data and export it in a different format

Citation

@inproceedings{sileo-etal-2019-mining,
    title = "Mining Discourse Markers for Unsupervised Sentence Representation Learning",
    author = "Sileo, Damien  and
      Van De Cruys, Tim  and
      Pradel, Camille  and
      Muller, Philippe",
    booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/N19-1351",
    pages = "3477--3486",
    abstract = "Current state of the art systems in NLP heavily rely on manually annotated datasets, which are expensive to construct. Very little work adequately exploits unannotated data {--} such as discourse markers between sentences {--} mainly because of data sparseness and ineffective extraction methods. In the present work, we propose a method to automatically discover sentence pairs with relevant discourse markers, and apply it to massive amounts of data. Our resulting dataset contains 174 discourse markers with at least 10k examples each, even for rare markers such as {``}coincidentally{''} or {``}amazingly{''}. We use the resulting data as supervision for learning transferable sentence embeddings. In addition, we show that even though sentence representation learning through prediction of discourse marker yields state of the art results across different transfer tasks, it{'}s not clear that our models made use of the semantic relation between sentences, thus leaving room for further improvements.",
}
magnitude
The list of markers we used. PDTB markers are black, markers discovered in our work are red

Contact

For further information, you can contact:

damien dot sileo at gmail dot com

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].