All Projects → uhh-lt → Taxi

uhh-lt / Taxi

Licence: apache-2.0
TAXI: a Taxonomy Induction Method based on Lexico-Syntactic Patterns, Substrings and Focused Crawling

Projects that are alternatives of or similar to Taxi

Notebook playground
Notebooks for playing around with datasets etc.
Stars: ✭ 21 (-22.22%)
Mutual labels:  jupyter-notebook
Imagenetmultilabel
Fine-grained ImageNet annotations
Stars: ✭ 22 (-18.52%)
Mutual labels:  jupyter-notebook
Mambo
Stars: ✭ 27 (+0%)
Mutual labels:  jupyter-notebook
Movie recommender
MovieLens based recommender system.使用MovieLens数据集训练的电影推荐系统。
Stars: ✭ 914 (+3285.19%)
Mutual labels:  jupyter-notebook
Zillow scraper
Repo for Zillow Web scraper
Stars: ✭ 22 (-18.52%)
Mutual labels:  jupyter-notebook
Stat406
STAT406 @ UBC - "Elements of Statistical Learning"
Stars: ✭ 27 (+0%)
Mutual labels:  jupyter-notebook
Strawberries
Computer vision on 🍓
Stars: ✭ 21 (-22.22%)
Mutual labels:  jupyter-notebook
Oxford Deepnlp 2017
🚀 🎉 ✨ Oxford Deep NLP 2017 Course Materials and Practicals, Solutions
Stars: ✭ 27 (+0%)
Mutual labels:  jupyter-notebook
Facial Landmarking
facial landmarking using dlib
Stars: ✭ 22 (-18.52%)
Mutual labels:  jupyter-notebook
Machine Learning Data Science Reuse
Gathers machine learning and data science techniques for problem solving.
Stars: ✭ 27 (+0%)
Mutual labels:  jupyter-notebook
Nolearn utils
Utilities for nolearn.lasagne
Stars: ✭ 21 (-22.22%)
Mutual labels:  jupyter-notebook
Juliadocs
Contributions to Julia Documentation
Stars: ✭ 21 (-22.22%)
Mutual labels:  jupyter-notebook
World Models Sonic Pytorch
Attempt at reinforcement learning with curiosity for Sonic the Hedgehog games. Number 149 on OpenAI retro contest leaderboard, but more work needed
Stars: ✭ 27 (+0%)
Mutual labels:  jupyter-notebook
Pytorch Struct
Fast, general, and tested differentiable structured prediction in PyTorch
Stars: ✭ 913 (+3281.48%)
Mutual labels:  jupyter-notebook
Ucsandiegox Dse200x Python For Data Science
UCSandDiego Micro Masters Program
Stars: ✭ 27 (+0%)
Mutual labels:  jupyter-notebook
Boston Housing Prices
🏠 Predict the selling price of a new home in Boston, Massachusetts area
Stars: ✭ 21 (-22.22%)
Mutual labels:  jupyter-notebook
Pythondatasciencehandbook
The book was written and tested with Python 3.5, though other Python versions (including Python 2.7) should work in nearly all cases.
Stars: ✭ 31,995 (+118400%)
Mutual labels:  jupyter-notebook
Odsc east 2016
Stars: ✭ 27 (+0%)
Mutual labels:  jupyter-notebook
Conjugate Gradient
Painless conjugate gradient notebooks
Stars: ✭ 27 (+0%)
Mutual labels:  jupyter-notebook
Sdtm mapper
AI SDTM mapping (R for ML, Python, TensorFlow for DL)
Stars: ✭ 27 (+0%)
Mutual labels:  jupyter-notebook

TAXI: a Taxonomy Induction Method based on Lexico-Syntactic Patterns, Substrings and Focused Crawling

This page contains implementation of a method for taxonomy induction that reached the first place in the SemEval 2016 challenge on taxonomy extraction evaluation. The method builds a taxonomy from a domain vocabulary. It extracts hypernyms from substrings and large domain-specific corpora bootstrapped from the input vocabulary. Multiple evaluations based on the SemEval taxonomy extraction datasets of four languages and three domains show state-of-the-art performance of our approach. This page contains implementations of the method including all resources needed to reproduce experiment described in the following paper presented in San Diego at SemEval co-located with the NAACL'2016:

@inproceedings{panchenko2016taxi,
  title={TAXI at SemEval-2016 Task 13: a Taxonomy Induction Method based on Lexico-Syntactic Patterns,  Substrings and Focused Crawling},
  author={Panchenko, Alexander and Faralli, Stefano and  Ruppert, Eugen and Remus, Steffen and  Naets, Hubert and  Fairon, Cedrick and Ponzetto, Simone Paolo and Biemann, Chris},
  booktitle={Proceedings of the 10th International Workshop on Semantic Evaluation},
  year={2016},
  address={San Diego, CA, USA},
  organization={Association for Computational Linguistics}
}

If you would like to refer to the system please use this citation. More information about the approach can be found at the TAXI web site.

taxi

System Requirements

The system was tested on Debian/Ubuntu Linux and Mac OS X. To load all resources in memory you need about 64 Gb of RAM.

Installation

  1. Clone repository:
git clone https://github.com/tudarmstadt-lt/taxi.git
  1. Download resources into the repository (4.4G compressed by gzip):
cd taxi && wget http://panchenko.me/data/joint/taxi/res/resources.tgz && tar xzf resources.tgz
  1. Install dependencies for using pygraphviz:
$ sudo apt-get install python-dev graphviz libgraphviz-dev pkg-config
  1. Install project dependencies:
pip install -r requirements.txt
  1. Setup spaCy. Download the language models for English, Dutch, French and Italian
$ python -m spacy download en
$ python -m spacy download nl
$ python -m spacy download fr
$ python -m spacy download it
  1. Setup NLTK
$ python -m nltk.downloader stopwords
$ python -m nltk.downloader wordnet

Induction of SemEval Taxonomies

Run the semeval.py to reproduce experimental results, e.g.:

For a test run (few resources loaded, quick):

python semeval.py vocabularies/science_en.csv en simple --test

For a normal run (all resources are loaded, requires 64Gb of RAM):

python semeval.py vocabularies/science_en.csv en simple

Afterwards a noisy graph is being created. Clean the output by executing(this example uses the inputfile science_en.csv-relations.csv-taxo-knn1.csv):

./run.sh taxi_output/simple_full/science_en.csv-relations.csv-taxo-knn1.csv

The vocabularies directory contains input terms for different domains and languages. The script lets you reproduce results in the SemEval 2016 Task 13 Taxonomy Extraction Evaluation described in the our paper. This script load hypernyms from the downloaded resources and constructs a taxonomy for every input vocabulary of the SemEval datasets, e.g. English Food domain. Generally, the TAXI approach takes as input a vocabulary and outputs a taxonomy for a linked subset of the terms from this vocabulary. Currently the main purpose of this repository is to ensure reproducibility of the SemEval results. The results taxonomies will be generated next to the corresponding input vocabulary file. If you need to adapt the script for your needs and require help do not hesitate to contact us.

Distributional Semantics

  1. Download the required embeddings:
  • $ wget http://ltdata1.informatik.uni-hamburg.de/taxi/embeddings/embeddings_poincare_wordnet
  • $ wget http://ltdata1.informatik.uni-hamburg.de/taxi/embeddings/own_embeddings_w2v
  • $ wget http://ltdata1.informatik.uni-hamburg.de/taxi/embeddings/own_embeddings_w2v.trainables.syn1neg.npy
  • $ wget http://ltdata1.informatik.uni-hamburg.de/taxi/embeddings/own_embeddings_w2v.wv.vectors.npy
  1. Set the directory path in line 45 of distributional_semantics.py to the directory containing the embeddings download above.

  2. To apply distributional semantics to the generated taxonomy, use the script distributional_semantics.py or the notebook distributional_semantics.py.ipynb

The script can be used with following options:

Option Alternate Description Default Choices
--taxonomy -t Input file containing the taxonomy - -
--mode -m Mode of the algorithm ds ds, root, remove
--domain -d Domain of the taxonomy science science, science_wordnet, food, environment_eurovoc
--exparent -ep Exculde parent while calculating cluster similarity False -
--exfamily -ef Exculde family while calculating cluster similarity False -

Example:
$ python distributional_semantics.py -t taxi_output/simple_full/science_en.csv-relations.csv-taxo-knn1.csv -d food -ep

Visualizing taxonomies

To visualize the taxonomy structures in a .csv file, you must have Networkx and Pygraphviz setup in your environment.

To construct a hierarchical taxonomy structure:
$ python visualize_taxonomy.py --file <csv filename>

The images generated will be very large, so alternatively, the graph can be constructed inside the notebook networkx_graph.ipynb

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].