All Projects → RK900 → Flu Prediction

RK900 / Flu Prediction

Licence: gpl-3.0
Predicting Future Influenza Virus Sequences with Machine Learning

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Flu Prediction

Nucleus
Python and C++ code for reading and writing genomics data.
Stars: ✭ 657 (+3185%)
Mutual labels:  dna
Machine Learning With Python
Small scale machine learning projects to understand the core concepts . Give a Star 🌟If it helps you. BONUS: Interview Bank coming up..!
Stars: ✭ 821 (+4005%)
Mutual labels:  scikit-learn
Model Describer
model-describer : Making machine learning interpretable to humans
Stars: ✭ 22 (+10%)
Mutual labels:  scikit-learn
Projectlearn Project Based Learning
A curated list of project tutorials for project-based learning.
Stars: ✭ 699 (+3395%)
Mutual labels:  scikit-learn
Galaxy
Data intensive science for everyone.
Stars: ✭ 812 (+3960%)
Mutual labels:  dna
Otto
Otto makes machine learning an intuitive, natural language experience. 🏆 Facebook AI Hackathon winner ⭐️ #1 Trending on MadeWithML.com ⭐️ #4 Trending JavaScript Project on GitHub ⭐️ #15 Trending (All Languages) on GitHub
Stars: ✭ 894 (+4370%)
Mutual labels:  scikit-learn
Featuretools
An open source python library for automated feature engineering
Stars: ✭ 5,891 (+29355%)
Mutual labels:  scikit-learn
Icon2017
Repository for the ICON 2017 hackathon 'multivoxel pattern analysis (MVPA) of fMRI data in Python'
Stars: ✭ 14 (-30%)
Mutual labels:  scikit-learn
Awesome Python Data Science
Probably the best curated list of data science software in Python.
Stars: ✭ 812 (+3960%)
Mutual labels:  scikit-learn
Restez
😴 📂 Create and Query a Local Copy of GenBank in R
Stars: ✭ 22 (+10%)
Mutual labels:  dna
Vg
tools for working with genome variation graphs
Stars: ✭ 710 (+3450%)
Mutual labels:  dna
Python Machine Learning Book 2nd Edition
The "Python Machine Learning (2nd edition)" book code repository and info resource
Stars: ✭ 6,422 (+32010%)
Mutual labels:  scikit-learn
Foxcross
AsyncIO serving for data science models
Stars: ✭ 18 (-10%)
Mutual labels:  scikit-learn
Windows Machine Learning
Samples and Tools for Windows ML.
Stars: ✭ 663 (+3215%)
Mutual labels:  scikit-learn
Gplearn
Genetic Programming in Python, with a scikit-learn inspired API
Stars: ✭ 918 (+4490%)
Mutual labels:  scikit-learn
Hyperparameter hunter
Easy hyperparameter optimization and automatic result saving across machine learning algorithms and libraries
Stars: ✭ 648 (+3140%)
Mutual labels:  scikit-learn
Kmodes
Python implementations of the k-modes and k-prototypes clustering algorithms, for clustering categorical data
Stars: ✭ 822 (+4010%)
Mutual labels:  scikit-learn
Crime Analysis
Association Rule Mining from Spatial Data for Crime Analysis
Stars: ✭ 20 (+0%)
Mutual labels:  scikit-learn
Nolearn
Combines the ease of use of scikit-learn with the power of Theano/Lasagne
Stars: ✭ 940 (+4600%)
Mutual labels:  scikit-learn
Machinelearningstocks
Using python and scikit-learn to make stock predictions
Stars: ✭ 897 (+4385%)
Mutual labels:  scikit-learn

Flu-Prediction

GitHub release Python27 Python34 License Twitter

Predicting Future Flu Virus Strains with Machine Learning. These programs predict future influenza virus strains based on previous trends in flu mutations.

Talks

Check out my talks at PyData and PyGotham.

License

Flu-Prediction is available under the GPLv3 License.

Dependencies

Python 2 or 3 with Numpy, Biopython, and Scikit-learn libraries installed.

To use:

Clone/download the repository. Install the dependencies by doing pip install -r requirements.txt.

Input any HA (hemagglutinin) or NA (neuraminidase) flu protein sequence and it's corresponding child sequence into the program and it will output a predicted offspring of that specific flu strain.

Reading in a FASTA file with Biopython

Use the Biopython library to import a sequence (a FASTA file format). You can use any flu FASTA file of your choosing, or you can use the ones in the Flu-Data folder. The data in the Flu-Data folder contain a wide variety of flu FASTA files, from single flu strains up to 1000 flu strains, which are grouped by flu subtype and protein. Data was obtained from the Influenza Research Database (IRD).

from Bio import SeqIO
sequence = SeqIO.parse('myfasta.fasta','fasta') # put your FASTA file here
parent_fasta = parent.fasta 
parent_seq = parent.seq

child_fasta = parent.fasta 
child_seq = child.seq

Encoding

Then encode it with the Encoding_v2 module:

from Encoding_v2 import encoding
parent = []
for k in range(len(X0)):
    encoded_parent = encoding(parent_seq[k])
    parent.append(encoded_parent)
    
child = []
for l in range(len(y0)):
    encoded_child = encoding(child_seq[l])
    child.append(encoded_child)

This turns the sequence into a list of float64's. Then, give the X and y to the machine learning algorithm. Enter any machine learning algorithm (eg, RandomForestsRegressor, DecisionTreeRegressor, etc.) in the 'algorithm' parts of the code.

Fitting the model

Substitute algorithm for any scikit-learn model of your choosing.

from sklearn.algorithms import algorithm()
alg = algorithm()
alg.fit(X,y)
alg.predict(new_X)

The algorithm I use in this project is a Random Forests Regressor model:

from sklearn.ensemble import RandomForestRegressor()
rfr = RandomForestRegressor() # Specify and parameters in the parenthesis
rfr.fit(X,y)
rfr.predict(new_X)

Computing accuracy using K-Fold cross-validation:

from sklearn import cross_validation
algorithm_scores = cross_validation.cross_val_score(algorithm,X,y,cv=2)
print 'Algorithm Trees',algorithm_scores
print("Average Accuracy: %0.2f (+/- %0.2f)" % (algorithm_scores.mean()*100, algorithm_scores.std() *100))

Computing accuracy using R2 (for linear models):

from sklearn import metrics
y_pred = algorithm.predict(X_test)
print 'Algorithm R2 score:', metrics.r2_score(y_test,y_pred,multioutput='variance_weighted')

Computing accuracy using Mean Squared Error (MSE):

from sklearn import metrics
y_pred = algorithm.predict(X_test)
print 'Algorithm mean squared error:', metrics.mean_squared_error(y_test,y_pred,multioutput='variance_weighted')

Predicting Flu Strains:

y_pred = algorithm.predict(X)
print y_pred

The prediction output is a list of floats. Each number in the float corresponds to a base pair: A to 1, T to 2, G to 3, and C to 4.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].