All Projects → ProteinQure → cbh21-protein-solubility-challenge

ProteinQure / cbh21-protein-solubility-challenge

Licence: MIT License
Template with code & dataset for the "Structural basis for solubility in protein expression systems" challenge at the Copenhagen Bioinformatics Hackathon 2021.

Programming Languages

python
139335 projects - #7 most used programming language
Dockerfile
14818 projects

Projects that are alternatives of or similar to cbh21-protein-solubility-challenge

Jupyter Dock
Jupyter Dock is a set of Jupyter Notebooks for performing molecular docking protocols interactively, as well as visualizing, converting file formats and analyzing the results.
Stars: ✭ 179 (+1093.33%)
Mutual labels:  protein-structure, protein, drug-discovery
FMol
A simplified drug discovery pipeline -- generating SMILE molecular with AlphaSMILES, predicting protein structure with AlphaFold, and checking the druggability with fpocket/Amber.
Stars: ✭ 13 (-13.33%)
Mutual labels:  protein, drug-discovery
SeqVec
Modelling the Language of Life - Deep Learning Protein Sequences
Stars: ✭ 74 (+393.33%)
Mutual labels:  protein-structure, protein
Biopython
Official git repository for Biopython (originally converted from CVS)
Stars: ✭ 2,936 (+19473.33%)
Mutual labels:  protein-structure, protein
mmterm
View proteins and trajectories in the terminal
Stars: ✭ 87 (+480%)
Mutual labels:  protein-structure, protein
gcWGAN
Guided Conditional Wasserstein GAN for De Novo Protein Design
Stars: ✭ 38 (+153.33%)
Mutual labels:  protein-structure, protein
r3dmol
🧬 An R package for visualizing molecular data in 3D
Stars: ✭ 45 (+200%)
Mutual labels:  protein-structure, protein
deepblast
Neural Networks for Protein Sequence Alignment
Stars: ✭ 29 (+93.33%)
Mutual labels:  protein-structure, protein
lightdock
Protein-protein, protein-peptide and protein-DNA docking framework based on the GSO algorithm
Stars: ✭ 110 (+633.33%)
Mutual labels:  protein-structure, protein
VSCoding-Sequence
VSCode Extension for interactively visualising protein structure data in the editor
Stars: ✭ 41 (+173.33%)
Mutual labels:  protein-structure, protein
DeepCov
Fully convolutional neural networks for protein residue-residue contact prediction
Stars: ✭ 36 (+140%)
Mutual labels:  protein-structure
HackingMap
黑客松現場的專案地圖,即時呈現各專案的進度與成果,促進參加者間的交流互動。
Stars: ✭ 26 (+73.33%)
Mutual labels:  hackathon
awesome-small-molecule-ml
A curated list of resources for machine learning for small-molecule drug discovery
Stars: ✭ 54 (+260%)
Mutual labels:  drug-discovery
fireblogger
Ionic 2 social media microblogging platform built with firebase 3 as backend
Stars: ✭ 54 (+260%)
Mutual labels:  hackathon
Healthify
Healthify - An app to track your daily water intake and sleep and boost your work efficiency. Healthify is built using Kotlin and follows all modern android Development practices and hence is a good learning resource for beginners
Stars: ✭ 37 (+146.67%)
Mutual labels:  hackathon
hPDB
PDB parser in Haskell
Stars: ✭ 20 (+33.33%)
Mutual labels:  protein-structure
HackTheDeep
The 4th Annual American Museum of Natural History Hackathon produced by the BridgeUP: STEM program
Stars: ✭ 35 (+133.33%)
Mutual labels:  hackathon
HackTheSolarSystem
The 5th Annual American Museum of Natural History Hackathon produced by the BridgeUP: STEM program
Stars: ✭ 22 (+46.67%)
Mutual labels:  hackathon
hotspot3d
3D hotspot mutation proximity analysis tool
Stars: ✭ 43 (+186.67%)
Mutual labels:  protein-structure
LastSecondSlides
Use the Google speech-to-text API to generate presentation slides as you talk!
Stars: ✭ 32 (+113.33%)
Mutual labels:  hackathon

Structural basis for solubility in protein expression systems

Twitter Follow GitHub repo size

Large-scale protein production for biotechnology and biopharmaceutical applications rely on high protein solubility in expression systems. Solubility has been measured for a significant fraction of E. coli and S. cerevisiae proteomes and these datasets are routinely used to train predictors of protein solubility in different organisms. Thanks to continued advances in experimental structure-determination and modelling, many of these solubility measurements can now be paired with accurate structural models.

The challenge is mentored by Christopher Ing and Mark Fingerhuth.

Aim of the challenge

It is the objective of this project to use our provided dataset of protein structure and solubility value pairs in order to produce a solubility predictor with comparable accuracy to sequence-based predictors reported in the literature. The provided dataset to be used in this project is created by following the dataset curation procedure described in the SOLart paper, and this hackathon project has a similar aim to this manuscript.

The dataset

The process of generating the dataset is described in the SOLArt manuscript. At a high level, all experimentally tested E. coli and S. cerevisiae proteins were matched through Uniprot IDs to known crystallographic structures or high sequence similarity homology models. After balancing the fold types using CATH, a dataset containing a balanced spread of solubility values was produced. The resulting proteins for the training and testing of these models were prepared and disclosed in the supplemental material of this paper as a list of (Uniprot,PDB,Chain,Solubility) pairs. The PDB files were not included in this work so we had to re-extract them from SWISS-MODEL. Whenever a crystallographic structure was present, it was used, assuming high coverage over the Uniprot sequence. In some cases, the original PDB templates used within the original SOLArt paper had been superceded by improved templates, and we opted to take the highest resolution, highest sequence identity, models in our updated dataset. We stripped away all irrelevant chains and heteroatoms.

If issues are identified with individual structures, please refer to the Uniprot ID and manually investigate the best template. In some cases, we needed to improve structure correctness by modelling missing atoms/residues inside the Chemical Computing Group software MOE on a case-by-case basis.

The dataset can be found in the data/ subdirectory - it is already divided into training/ and test/ data. The training/ data comes with solubility_values.csv and solublity_values.yaml (same content just different format) which both contain the solubility target values for all the PDB files provided in that directory. Note that each PDB file is named after the Uniprot identifier of the respective protein and the protein column in the solubility_values.csv also contains the Uniprot identifiers.

The test/ dataset consists of three different subdirectories (protein structures derived from different organisms and with different approaches) and you should NOT use them for any training. Only the yeast_crystal_structs/ directory contains solubility_values.csv and solublity_values.yaml (same content just different format) files which you can use for some local testing & validation. In order to find out your performance on the entire test dataset you need to use the automated benchmarking system (see below).

Example output

Your code should output a file called predictions.csv in the following format:

protein,solubility
P69829,83
P31133,62

whereby the protein column contains the Uniprot ID (corresponds to the filename of the PDB files) and the solubility column contains the predicted solubility value (can be int or float).

Note, that there are three (!) test subsets but you are expected to submit all the predictions in one file (not three) for the benchmarking system to work.

Automated benchmarking system

The continuous integration script in .github/workflows/ci.yml will automatically build the Dockerfile on every commit to the main branch. This docker image will be published as your hackathon submission to https://biolib.com/<YourTeam>/<TeamName>. For this to work, make sure you set the BIOLIB_TOKEN and BIOLIB_PROJECT_URI accordingly as repository secrets.

To read more about the benchmarking system click here.

Say thanks

Give this repo a star: GitHub Repo stars

Star the ProteinQure org on Github: GitHub Org's stars

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].