All Projects → aqlaboratory → Proteinnet

aqlaboratory / Proteinnet

Licence: mit
Standardized data set for machine learning of protein structure

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Proteinnet

Cluepretrainedmodels
高质量中文预训练模型集合:最先进大模型、最快小模型、相似度专门模型
Stars: ✭ 493 (-25.75%)
Mutual labels:  dataset
Total Text Dataset
Total Text Dataset. It consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.
Stars: ✭ 580 (-12.65%)
Mutual labels:  dataset
Awesome chinese medical nlp
中文医学NLP公开资源整理:术语集/语料库/词向量/预训练模型/知识图谱/命名实体识别/QA/信息抽取/模型/论文/etc
Stars: ✭ 623 (-6.17%)
Mutual labels:  dataset
Cdap
An open source framework for building data analytic applications.
Stars: ✭ 509 (-23.34%)
Mutual labels:  dataset
Nas Bench 201
NAS-Bench-201 API and Instruction
Stars: ✭ 537 (-19.13%)
Mutual labels:  dataset
Cvat
Powerful and efficient Computer Vision Annotation Tool (CVAT)
Stars: ✭ 6,557 (+887.5%)
Mutual labels:  dataset
Tensorflow object tracking video
Object Tracking in Tensorflow ( Localization Detection Classification ) developed to partecipate to ImageNET VID competition
Stars: ✭ 491 (-26.05%)
Mutual labels:  dataset
Devblogs
+2600 developer-related blogs and publications.
Stars: ✭ 637 (-4.07%)
Mutual labels:  dataset
Hate Speech And Offensive Language
Repository for the paper "Automated Hate Speech Detection and the Problem of Offensive Language", ICWSM 2017
Stars: ✭ 543 (-18.22%)
Mutual labels:  dataset
Gensim Data
Data repository for pretrained NLP models and NLP corpora.
Stars: ✭ 622 (-6.33%)
Mutual labels:  dataset
Pokemon.json
Pokemon dataset in JSON.
Stars: ✭ 511 (-23.04%)
Mutual labels:  dataset
Awesome Twitter Data
A list of Twitter datasets and related resources.
Stars: ✭ 533 (-19.73%)
Mutual labels:  dataset
Couplet Dataset
Dataset for couplets. 70万条对联数据库。
Stars: ✭ 589 (-11.3%)
Mutual labels:  dataset
Voice datasets
🔊 A comprehensive list of open-source datasets for voice and sound computing (50+ datasets).
Stars: ✭ 494 (-25.6%)
Mutual labels:  dataset
Esc 50
ESC-50: Dataset for Environmental Sound Classification
Stars: ✭ 631 (-4.97%)
Mutual labels:  dataset
Doccano
Open source annotation tool for machine learning practitioners.
Stars: ✭ 5,600 (+743.37%)
Mutual labels:  dataset
Open stt
Open STT
Stars: ✭ 584 (-12.05%)
Mutual labels:  dataset
Awesome Project Ideas
Curated list of Machine Learning, NLP, Vision, Recommender Systems Project Ideas
Stars: ✭ 6,114 (+820.78%)
Mutual labels:  dataset
Uhttbarcodereference
Universe-HTT barcode reference
Stars: ✭ 634 (-4.52%)
Mutual labels:  dataset
Label Studio
Label Studio is a multi-type data labeling and annotation tool with standardized output format
Stars: ✭ 7,264 (+993.98%)
Mutual labels:  dataset

ProteinNet

ProteinNet is a standardized data set for machine learning of protein structure. It provides protein sequences, structures (secondary and tertiary), multiple sequence alignments (MSAs), position-specific scoring matrices (PSSMs), and standardized training / validation / test splits. ProteinNet builds on the biennial CASP assessments, which carry out blind predictions of recently solved but publicly unavailable protein structures, to provide test sets that push the frontiers of computational methodology. It is organized as a series of data sets, spanning CASP 7 through 12 (covering a ten-year period), to provide a range of data set sizes that enable assessment of new methods in relatively data poor and data rich regimes.

Note that this is a preliminary release. The raw data used for construction of the data sets, as well as the MSAs, are not yet generally available. However, the raw MSA data (4TB) for ProteinNet 12 is available upon request. Transfer requires downloading of a Globus client. See the raw data section for more information.

Motivation

Protein structure prediction is one of the central problems of biochemistry. While the problem is well-studied within the biological and chemical sciences, it is less well represented within the machine learning community. We suspect this is due to two reasons: 1) a high barrier to entry for non-domain experts, and 2) lack of standardization in terms of training / validation / test splits that make fair and consistent comparisons across methods possible. If these two issues are addressed, protein structure prediction can become a major source of innovation in ML research, alongside the canonical tasks of computer vision, NLP, and speech recognition. Much like ImageNet helped spur the development of new computer vision techniques, ProteinNet aims to facilitate ML research on protein structure by providing a standardized data set, and standardized training / validation / test splits, that any group can use with minimal effort to get started.

Approach

Once every two years the CASP assessment is held. During this competition structure predictors from across the globe are presented with protein sequences whose structures have been recently solved but which have not yet been made publicly available. The predictors make blind predictions of these structures, which are then assessed for their accuracy. The CASP structures thus provide a standardized benchmark for how well prediction methods perform at a given moment in time. The basic idea behind ProteinNet is to piggyback on CASP, by using CASP structures as test sets. ProteinNet augments these test sets with training / validation sets that reset the historical record to the conditions preceding each CASP experiment. In particular, ProteinNet restricts the set of sequences (used for building PSSMs and MSAs) and structures to those available prior to the commencement of each CASP. This is critical as standard databases such as BLAST do not maintain historical versions. We use time-reset versions of the UniParc dataset as well as metagenomic sequences from the JGI to build sequence databases for deriving MSAs. ProteinNet further provides carefully split validation sets that range in difficulty from easy (>90% seq. id.), useful for assessing a model's ability to predict minor changes in protein structure such as mutations, to extremely difficult (<10 seq. id.), useful for assessing a model's abiliy to predict entirely new protein folds, as in the CASP Free Modeling (FM) category. In a sense, our validation sets provide a series of transferability challenges to test how well a model can withstand distributional shifts in the data set. We have found that our most difficult validation subsets exceed the difficulty of CASP FM targets.

Download

ProteinNet records are provided in two forms: human- and machine-readable text files that can be used programmatically by any tool, and TensorFlow-specific TFRecord files. More information on the file format can be found in the documentation here.

CASP7 CASP8 CASP9 CASP10 CASP11 CASP12*
Text-based Text-based Text-based Text-based Text-based Text-based
TF Records TF Records TF Records TF Records TF Records TF Records
Secondary Structure Data
ASTRAL entries
PDB entries

* CASP12 test set is incomplete due to embargoed structures. Once the embargo is lifted we will release all structures.

Documentation

PyTorch Parser

ProteinNet includes an official TensorFlow-based parser. Jeppe Hallgren has kindly created a PyTorch-based parser that is available here.

Extensions

SideChainNet extends ProteinNet by adding angle and atomic coordinate information for side chain atoms.

Citation

Please cite the ProteinNet paper in BMC Bioinformatics.

Acknowledgements

Construction of this data set consumed millions of compute hours and was possible thanks to the generous support of the HMS Laboratory of Systems Pharmacology, the Harvard Program in Therapeutic Science, and the Research Computing group at Harvard Medical School. We also thank Martin Steinegger and Milot Mirdita for their extensive help with the MMseqs2 and HHblits software packages, Sergey Ovchinnikov for providing metagenomic sequences, Andriy Kryshtafovych for his assistance with CASP data, and Sean Eddy for his help with the HMMer software package. This data set is hosted by the HMS Research Information Technology Solutions group at Harvard University.

Funding

This work was supported by NIGMS grant P50GM107618 and NCI grant U54-CA225088.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].