All Projects → wellcometrust → WellcomeML

wellcometrust / WellcomeML

Licence: MIT license
Repository for Machine Learning utils at the Wellcome Trust

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to WellcomeML

code-transformer
Implementation of the paper "Language-agnostic representation learning of source code from structure and context".
Stars: ✭ 130 (+319.35%)
Mutual labels:  transformers
elastic transformers
Making BERT stretchy. Semantic Elasticsearch with Sentence Transformers
Stars: ✭ 153 (+393.55%)
Mutual labels:  transformers
deepconsensus
DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.
Stars: ✭ 124 (+300%)
Mutual labels:  transformers
Chinese-Minority-PLM
CINO: Pre-trained Language Models for Chinese Minority (少数民族语言预训练模型)
Stars: ✭ 133 (+329.03%)
Mutual labels:  transformers
transformer generalization
The official repository for our paper "The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers". We significantly improve the systematic generalization of transformer models on a variety of datasets using simple tricks and careful considerations.
Stars: ✭ 58 (+87.1%)
Mutual labels:  transformers
wechsel
Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.
Stars: ✭ 39 (+25.81%)
Mutual labels:  transformers
transformers-lightning
A collection of Models, Datasets, DataModules, Callbacks, Metrics, Losses and Loggers to better integrate pytorch-lightning with transformers.
Stars: ✭ 45 (+45.16%)
Mutual labels:  transformers
lightning-transformers
Flexible components pairing 🤗 Transformers with Pytorch Lightning
Stars: ✭ 551 (+1677.42%)
Mutual labels:  transformers
pysentimiento
A Python multilingual toolkit for Sentiment Analysis and Social NLP tasks
Stars: ✭ 274 (+783.87%)
Mutual labels:  transformers
Text-Summarization
Abstractive and Extractive Text summarization using Transformers.
Stars: ✭ 38 (+22.58%)
Mutual labels:  transformers
RETRO-pytorch
Implementation of RETRO, Deepmind's Retrieval based Attention net, in Pytorch
Stars: ✭ 473 (+1425.81%)
Mutual labels:  transformers
text-classification-transformers
Easy text classification for everyone : Bert based models via Huggingface transformers (KR / EN)
Stars: ✭ 32 (+3.23%)
Mutual labels:  transformers
Transformers-Tutorials
This repository contains demos I made with the Transformers library by HuggingFace.
Stars: ✭ 2,828 (+9022.58%)
Mutual labels:  transformers
BottleneckTransformers
Bottleneck Transformers for Visual Recognition
Stars: ✭ 231 (+645.16%)
Mutual labels:  transformers
label-studio-transformers
Label data using HuggingFace's transformers and automatically get a prediction service
Stars: ✭ 117 (+277.42%)
Mutual labels:  transformers
awesome-huggingface
🤗 A list of wonderful open-source projects & applications integrated with Hugging Face libraries.
Stars: ✭ 436 (+1306.45%)
Mutual labels:  transformers
transformers-interpret
Model explainability that works seamlessly with 🤗 transformers. Explain your transformers model in just 2 lines of code.
Stars: ✭ 861 (+2677.42%)
Mutual labels:  transformers
bert-squeeze
🛠️ Tools for Transformers compression using PyTorch Lightning ⚡
Stars: ✭ 56 (+80.65%)
Mutual labels:  transformers
long-short-transformer
Implementation of Long-Short Transformer, combining local and global inductive biases for attention over long sequences, in Pytorch
Stars: ✭ 103 (+232.26%)
Mutual labels:  transformers
anonymisation
Anonymization of legal cases (Fr) based on Flair embeddings
Stars: ✭ 85 (+174.19%)
Mutual labels:  transformers

Build Status codecov GitHub PyPI docs

WellcomeML utils

This package contains common utility functions for usual tasks at the Wellcome Trust, in particular functionalities for processing, embedding and classifying text data. This includes

  • An intuitive sklearn-like API wrapping text vectorizers, such as Doc2vec, Bert, Scibert
  • Common API for off-the-shelf classifiers to allow quick iteration (e.g. Frequency Vectorizer, Bert, Scibert, basic CNN, BiLSTM, SemanticSimilarity)
  • Utils to download and convert academic text datasets for benchmark
  • Utils to download data from the EPMC API

For more information read the official docs.

1. Quickstart

Installing from PyPi

pip install wellcomeml

This will install the "vanilla" package with very little functionality, such as io, dataset download etc.

If space is not a problem, you can install the full package (around 2.2GB):

pip install wellcomeml[all]

The full package is relatively big, therefore we also have fine-grained installations if you only wish to use one specific module. Those are core, transformers, tensorflow, torch, spacy. You can install one, or more of those you want, e.g.:

pip install wellcomeml[tensorflow, core]

To check that your installation allows you to use a specific module, try (for example):

python -c "import wellcomeml.ml.bert_vectorizer"

If you don't have the correct dependencies installed for a module, an error will appear and point you to the right dependencies.

1.1 Installing wellcomeml[all] on windows

Torch has a different installation for windows so it will not get automatically installed with wellcomeml[all]. It needs to be installed first (this is for machines with no CUDA parallel computing platform for those that do look here https://pytorch.org/ for correct installation):

pip install torch==1.5.1+cpu torchvision==0.6.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
pip install wellcomeml[all]

2. Development

2.1 Build local virtualenv

make

2.2 Contributing to the docs

Make changes to the .rst files in /docs (please do not change the ones starting by wellcomeml as those are generated automatically)

Navigate to the root repository and run

make update-docs

Verify that _build/html/index.html has generated correctly and submit a PR.

2.3 Release a new version (and upload to aws s3/pypi/github)

First create a github token, if you haven't one, with artifact write access and export it to the env variables:

export GITHUB_TOKEN=...

The checklist for a new release is:

  • Change wellcomeml/__version__.py
  • Add changelog
  • make dist
  • Verify new package was generated correctly on the pip registry and GitHub releases

2.4 (Optional) Installing from other locations

pip3 install <relative path to this folder>

2.5 Transformers

On OSX, if you get a message complaining about the rust compiler, install and initialise it with:

brew install rustup
rustup-init

3. Example usage of some modules

Examples can be found in the subfolder examples.

4. Troubleshooting

If you experience a problem with installing or using WellcomeML please open an issue. It might be worth setting the logging level to DEBUG export LOGGING_LEVEL=DEBUG which will often expose more information that might be informative to resolve the issue.

5. Extras

Module Description Extras needed
wellcomeml.ml.attention Classes that implement keras layers for attention/self-attention tensorflow
wellcomeml.ml.bert_classifier Classifier to facilitate fine-tuning bert/scibert tensorflow
wellcomeml.ml.bert_semantic_equivalence Classifier to learn semantic equivalence between pairs of documents tensorflow
wellcomeml.ml.bert_vectorizer Text vectorizer based on bert/scibert torch
wellcomeml.ml.bilstm BILSTM Text classifier tensorflow
wellcomeml.ml.clustering Text clustering pipeline NA
wellcomeml.ml.cnn CNN Text Classifier tensorflow
wellcomeml.ml.doc2vec_vectorizer Text vectorizer based on doc2vec NA
wellcomeml.ml.frequency_vectorizer Text vectorizer based on TF-IDF NA
wellcomeml.ml.keras_utils Utils for computing metrics during training tensorflow
wellcomeml.ml.keras_vectorizer Text vectorizer based on Keras tensorflow
wellcomeml.ml.sent2vec_vectorizer Text vectorizer based on Sent2Vec (Requires sent2vec, a non-pypi package)
wellcomeml.ml.similarity_entity_liking A class to find most similar documents to a sentence in a corpus tensorflow
wellcomeml.ml.spacy_classifier A text classifier based on spacy spacy, torch
wellcomeml.ml.spacy_entity_linking Similar to similarity_entity_linking, but uses spacy spacy
wellcomeml.ml.spacy_knowledge_base Creates a knowledge base of entities, based on spacy spacy
wellcomeml.ml.spacy_ner Named entity recognition classifier based on spacy spacy
wellcomeml.ml.transformers_tokenizer Bespoke tokenizer based on transformers Transformers
wellcomeml.ml.vectorizer Abstract class for vectorizers NA
wellcomeml.ml.voting_classifier Meta-classifier based on majority voting NA
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].