All Projects → AbhilashaRavichander → PrivacyQA_EMNLP

AbhilashaRavichander / PrivacyQA_EMNLP

Licence: MIT license
PrivacyQA, a resource to support question-answering over privacy policies.

Projects that are alternatives of or similar to PrivacyQA EMNLP

COCO-LM
[NeurIPS 2021] COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining
Stars: ✭ 109 (+354.17%)
Mutual labels:  natural-language-understanding
Fill-the-GAP
[ACL-WS] 4th place solution to gendered pronoun resolution challenge on Kaggle
Stars: ✭ 13 (-45.83%)
Mutual labels:  natural-language-understanding
viky-ai
Natural Language Processing platform. Allows to extract information from unstructured text.
Stars: ✭ 38 (+58.33%)
Mutual labels:  natural-language-understanding
hyperdome
the safest place to reach out
Stars: ✭ 26 (+8.33%)
Mutual labels:  privacy-enhancing-technologies
MHPC-Natural-Language-Processing-Lectures
This is the second part of the Deep Learning Course for the Master in High-Performance Computing (SISSA/ICTP).)
Stars: ✭ 33 (+37.5%)
Mutual labels:  natural-language-understanding
privapi
Detect Sensitive REST API communication using Deep Neural Networks
Stars: ✭ 42 (+75%)
Mutual labels:  privacy-enhancing-technologies
Catalyst
🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
Stars: ✭ 224 (+833.33%)
Mutual labels:  natural-language-understanding
HElib
HElib is an open-source software library that implements homomorphic encryption. It supports the BGV scheme with bootstrapping and the Approximate Number CKKS scheme. HElib also includes optimizations for efficient homomorphic evaluation, focusing on effective use of ciphertext packing techniques and on the Gentry-Halevi-Smart optimizations.
Stars: ✭ 2,913 (+12037.5%)
Mutual labels:  privacy-enhancing-technologies
OKD-Reading-List
Papers for Open Knowledge Discovery
Stars: ✭ 102 (+325%)
Mutual labels:  natural-language-understanding
bert extension tf
BERT Extension in TensorFlow
Stars: ✭ 29 (+20.83%)
Mutual labels:  natural-language-understanding
FUTURE
A private, free, open-source search engine built on a P2P network
Stars: ✭ 19 (-20.83%)
Mutual labels:  natural-language-understanding
auto-gfqg
Automatic Gap-Fill Question Generation
Stars: ✭ 17 (-29.17%)
Mutual labels:  natural-language-understanding
gpc-optmeowt
Browser extension for opting out from the sale and sharing of personal information per the California Consumer Privacy Act and other privacy laws
Stars: ✭ 75 (+212.5%)
Mutual labels:  privacy-enhancing-technologies
conclave
Query compiler for secure multi-party computation.
Stars: ✭ 86 (+258.33%)
Mutual labels:  privacy-enhancing-technologies
corpusexplorer2.0
Korpuslinguistik war noch nie so einfach...
Stars: ✭ 16 (-33.33%)
Mutual labels:  natural-language-understanding
GLUE-bert4keras
基于bert4keras的GLUE基准代码
Stars: ✭ 59 (+145.83%)
Mutual labels:  natural-language-understanding
mobiletrackers
A repository of telemetry domains and URLs used by mobile location tracking, user profiling, targeted marketing and aggressive ads libraries.
Stars: ✭ 118 (+391.67%)
Mutual labels:  privacy-enhancing-technologies
shifting
A privacy-focused list of alternatives to mainstream services to help the competition.
Stars: ✭ 31 (+29.17%)
Mutual labels:  privacy-enhancing-technologies
linguistics problems
Natural language processing in examples and games
Stars: ✭ 23 (-4.17%)
Mutual labels:  natural-language-understanding
Luci
Logical Unity for Communicational Interactivity
Stars: ✭ 25 (+4.17%)
Mutual labels:  natural-language-understanding

Question Answering for Privacy Policies

This repository contains the PrivacyQA dataset described in the EMNLP 2019 paper, Question Answering for Privacy Policies: Combining Computational and Legal Perspectives. PrivacyQA is a corpus consisting of 1750 questions about the contents of privacy policies, paired with expert annotations. The goal of this effort is to kickstart the development of question-answering methods for this domain, to address the (unrealistic) expectation that a large population should be reading many policies per day.

The data has been partitioned into a train and test set. The same split has been used in the experiments reported in the paper. You can also download all the relevant data from here.

The data is in a tab seperated format with the following fields:

  1. Folder : Physical location of app data and metadata.
  2. DocID : Unique identifier for privacy policy
  3. QueryID : Unique identifier for query
  4. SentID : Unique identifier for sentence
  5. Split : Train or test split
  6. Query : Text field consisting of crowdsourced question against policy
  7. Segment : Sentence from privacy policy
  8. Train- Label : {Relevant, Irrelevant} Test- Ann1, Ann2, Ann3, Ann4, Ann5 and Ann6: {Relevant, Irrelevant, None}

None: annotation should not be considered. Relevant: Segment is relevant for query. Irrelevant: segment is irrelevant for query. In addition, the test file contains a meta-annotation of if a segment was considered relevant by any annotator under 'Any_Relevant'.

Additionally, we also include annotations of each user query with applicable OPP-115 categories. The categories are sourced from the OPP-115 Corpus annotation scheme (Wilson et al., 2016), and the annotations for both train and test splits can be found in the meta-annotations folder. Each column corresponding to an OPP category contains a "1" if a category is considered relevant to the question as described in the paper, and "0" otherwise. A brief description of OPP-115 categories is as follows:

  1. First Party Collection/Use: What, why and how information is collected by the service provider
  2. Third Party Sharing/Collection: What, why and how information shared with or collected by third parties
  3. Data Security: Protection measures for user information
  4. Data Retention: How long user information will be stored
  5. User Choice/Control: Control options available to users
  6. User Access, Edit and Deletion: If/how users can access, edit or delete information
  7. Policy Change: Informing users if policy information has been changed
  8. International and Specific Audiences: Practices pertaining to a specific group of users
  9. Other: General text, contact information or practices not covered by other categories.

If you make use of this dataset in your research, we ask that you please cite our paper:

@inproceedings{ravichander-etal-2019-question,
    title = "Question Answering for Privacy Policies: Combining Computational and Legal Perspectives",
    author = "Ravichander, Abhilasha  and
      Black, Alan W  and
      Wilson, Shomir  and
      Norton, Thomas  and
      Sadeh, Norman",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1500",
    doi = "10.18653/v1/D19-1500",
    pages = "4949--4959",
    abstract = "Privacy policies are long and complex documents that are difficult for users to read and understand. Yet, they have legal effects on how user data can be collected, managed and used. Ideally, we would like to empower users to inform themselves about the issues that matter to them, and enable them to selectively explore these issues. We present PrivacyQA, a corpus consisting of 1750 questions about the privacy policies of mobile applications, and over 3500 expert annotations of relevant answers. We observe that a strong neural baseline underperforms human performance by almost 0.3 F1 on PrivacyQA, suggesting considerable room for improvement for future systems. Further, we use this dataset to categorically identify challenges to question answerability, with domain-general implications for any question answering system. The PrivacyQA corpus offers a challenging corpus for question answering, with genuine real world utility.",
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].