All Projects → seyonechithrananda → Bert Loves Chemistry

seyonechithrananda / Bert Loves Chemistry

Licence: mit
bert-loves-chemistry: a repository of HuggingFace models applied on chemical SMILES data for drug design, chemical modelling, etc.

Projects that are alternatives of or similar to Bert Loves Chemistry

Solution Accelerator Many Models
Stars: ✭ 104 (+0.97%)
Mutual labels:  jupyter-notebook
Ossdc Visionbasedacc
Discuss requirments and develop code for #1-mvp-vbacc MVP (see also this channel on ossdc.org Slack)
Stars: ✭ 104 (+0.97%)
Mutual labels:  jupyter-notebook
Satimg
Satellite data processing experiments
Stars: ✭ 104 (+0.97%)
Mutual labels:  jupyter-notebook
Ec2 Spot Workshops
Collection of workshops to demonstrate best practices in using Amazon EC2 Spot Instances. https://aws.amazon.com/ec2/spot/
Stars: ✭ 104 (+0.97%)
Mutual labels:  jupyter-notebook
Practical Ml W Python
Source code for 'Practical Machine Learning with Python' by Dipanjan Sarkar, Raghav Bali, and Tushar Sharma
Stars: ✭ 104 (+0.97%)
Mutual labels:  jupyter-notebook
Sharing isl python
An Introduction to Statistical Learning with Applications in PYTHON
Stars: ✭ 105 (+1.94%)
Mutual labels:  jupyter-notebook
Nlp essentials
Essential and Fundametal aspects of Natural Language Processing with hands-on examples and case-studies
Stars: ✭ 104 (+0.97%)
Mutual labels:  jupyter-notebook
Tensorflow2.0 Examples
🙄 Difficult algorithm, Simple code.
Stars: ✭ 1,397 (+1256.31%)
Mutual labels:  jupyter-notebook
Yabox
Yet another black-box optimization library for Python
Stars: ✭ 103 (+0%)
Mutual labels:  jupyter-notebook
Python Fundamentals
Introductory Python Series for UC Berkeley's D-Lab
Stars: ✭ 104 (+0.97%)
Mutual labels:  jupyter-notebook
Gen Quickstart
Docker file for building Gen and Jupyter notebooks for tutorials and case studies
Stars: ✭ 104 (+0.97%)
Mutual labels:  jupyter-notebook
Pose Interpreter Networks
Real-Time Object Pose Estimation with Pose Interpreter Networks (IROS 2018)
Stars: ✭ 104 (+0.97%)
Mutual labels:  jupyter-notebook
Circle Line Analytics
Stars: ✭ 104 (+0.97%)
Mutual labels:  jupyter-notebook
Intro To Deep Learning
A collection of materials to help you learn about deep learning
Stars: ✭ 103 (+0%)
Mutual labels:  jupyter-notebook
Keras Hello World
Stars: ✭ 104 (+0.97%)
Mutual labels:  jupyter-notebook
Face Id With Medical Masks
Face ID recognition with medical masks
Stars: ✭ 103 (+0%)
Mutual labels:  jupyter-notebook
Dmm
Deep Markov Models
Stars: ✭ 103 (+0%)
Mutual labels:  jupyter-notebook
Cenpy
Explore and download data from Census APIs
Stars: ✭ 104 (+0.97%)
Mutual labels:  jupyter-notebook
Team Learning
主要展示Datawhale的组队学习计划。
Stars: ✭ 1,397 (+1256.31%)
Mutual labels:  jupyter-notebook
Partia Computing Michaelmas
Activities and exercises for the Part IA computing course in Michaelmas Term
Stars: ✭ 104 (+0.97%)
Mutual labels:  jupyter-notebook

ChemBERTa

ChemBERTa: A collection of BERT-like models applied to chemical SMILES data for drug design, chemical modelling, and property prediction. To be presented at Baylearn and the Royal Society of Chemistry's Chemical Science Symposium.

Tutorial
Arxiv Paper
Poster
Abstract
BibTex

License: MIT License

Right now the notebooks are all for the RoBERTa model (a variant of BERT) trained on the task of masked-language modelling (MLM). Training was done over 10 epochs until loss converged to around 0.26 on the ZINC 250k dataset. The model weights for ChemBERTA pre-trained on various datasets (ZINC 100k, ZINC 250k, PubChem 100k, PubChem 250k, PubChem 1M, PubChem 10M) are available using HuggingFace. We expect to continue to release larger models pre-trained on even larger subsets of ZINC, CHEMBL, and PubChem in the near future.

This library is currently primarily a set of notebooks with our pre-training and fine-tuning setup, and will be updated soon with model implementation + attention visualization code, likely after the Arxiv publication. Stay tuned!

I hope this is of use to developers, students and researchers exploring the use of transformers and the attention mechanism for chemistry!

Citing Our Work

Please cite ChemBERTa's ArXiv paper if you have used these models, notebooks, or examples in any way. The link to the BibTex is available here.

Example

You can load the tokenizer + model for MLM prediction tasks using the following code:

from transformers import AutoModelWithLMHead, AutoTokenizer, pipeline

#any model weights from the link above will work here
model = AutoModelWithLMHead.from_pretrained("seyonec/ChemBERTa-zinc-base-v1")
tokenizer = AutoTokenizer.from_pretrained("seyonec/ChemBERTa-zinc-base-v1")

fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

The abstract for this method is detailed here. We expect to release a full paper on Arxiv in end-August.

Todo:

  • [ ] Official DeepChem implementation of ChemBERTa using model API (In progress)
  • [X] Open-source attention visualization suite used in paper (After formal publication - Beginning of September).
  • [x] Release larger pre-trained models, and support for a wider array of property prediction tasks (BBBP, etc). - See HuggingFace
  • [x] Finish writing notebook to train model
  • [x] Finish notebook to preload and run predictions on a single molecule —> test if HuggingFace works
  • [x] Train RoBERTa model until convergence
  • [x] Upload weights onto HuggingFace
  • [x] Create tutorial using evaluation + fine-tuning notebook.
  • [x] Create documentation + writing, visualizations for notebook.
  • [x] Setup PR into DeepChem
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].