All Projects → nageshsinghc4 → DNA-Sequence-Machine-learning

nageshsinghc4 / DNA-Sequence-Machine-learning

Licence: other
Understand DNA structure and how machine learning can be used to work with DNA sequence data.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to DNA-Sequence-Machine-learning

catch
A package for designing compact and comprehensive capture probe sets.
Stars: ✭ 55 (+120%)
Mutual labels:  genome, dna
arv
A fast 23andMe DNA parser and inferrer for Python
Stars: ✭ 98 (+292%)
Mutual labels:  genome, dna
Gatk
Official code repository for GATK versions 4 and up
Stars: ✭ 1,002 (+3908%)
Mutual labels:  genome, dna
Deepvariant
DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
Stars: ✭ 2,404 (+9516%)
Mutual labels:  genome, dna
Genomepy
Download and use genomes the easy way.
Stars: ✭ 209 (+736%)
Mutual labels:  genome
Augustus
Genome annotation with AUGUSTUS
Stars: ✭ 129 (+416%)
Mutual labels:  genome
Arcs
🌈Scaffold genome sequence assemblies using linked read sequencing data
Stars: ✭ 67 (+168%)
Mutual labels:  genome
Ai Programmer
Using artificial intelligence and genetic algorithms to automatically write programs. Tutorial: http://www.primaryobjects.com/cms/article149
Stars: ✭ 948 (+3692%)
Mutual labels:  genome
cora-docs
CoRA Docs
Stars: ✭ 36 (+44%)
Mutual labels:  dna
coala
A Framework for Coalescent Simulation in R
Stars: ✭ 21 (-16%)
Mutual labels:  dna
Karyoploter
karyoploteR - An R/Bioconductor package to plot arbitrary data along the genome
Stars: ✭ 192 (+668%)
Mutual labels:  genome
Biomartr
Genomic Data Retrieval with R
Stars: ✭ 144 (+476%)
Mutual labels:  genome
Abyss
🔬 Assemble large genomes using short reads
Stars: ✭ 219 (+776%)
Mutual labels:  genome
Masurca
Stars: ✭ 128 (+412%)
Mutual labels:  genome
MGSE
Mapping-based Genome Size Estimation (MGSE) performs an estimation of a genome size based on a read mapping to an existing genome sequence assembly.
Stars: ✭ 22 (-12%)
Mutual labels:  genome
Genometools
GenomeTools genome analysis system.
Stars: ✭ 186 (+644%)
Mutual labels:  genome
genome updater
Bash script to download/update snapshots of files from NCBI genomes repository (refseq/genbank) with track of changes and without redundancy
Stars: ✭ 93 (+272%)
Mutual labels:  genome
Ribbon
A genome browser that shows long reads and complex variants better
Stars: ✭ 184 (+636%)
Mutual labels:  genome
Viral Ngs
Viral genomics analysis pipelines
Stars: ✭ 150 (+500%)
Mutual labels:  genome
LTRpred
De novo annotation of young retrotransposons
Stars: ✭ 35 (+40%)
Mutual labels:  genome

DNA Sequencing using Machine learning

Image The double-helix is the correct chemical representation of DNA. But DNA is special. It’s a nucleotide made of four types of nitrogen bases: Adenine (A), Thymine (T), Guanine (G) and Cytosine. We always call them A, C, Gand T.

A genome is a complete collection of DNA in an organism. All living species possess a genome, but they differ considerably in size.

As a data-driven science, genomics extensively utilizes machine learning to capture dependencies in data and infer new biological hypotheses. Nonetheless, the ability to extract new insights from the exponentially increasing volume of genomics data requires more powerful machine learning models. By efficiently leveraging large data sets, deep learning has reconstructed fields such as computer vision and natural language processing. It has become the method of preference for many genomics modeling tasks, including predicting the influence of genetic variation on gene regulatory mechanisms such as DNA receptiveness and splicing.

So here, we will understand DNA structure and how machine learning can be used to work with DNA sequence data.

Pre requisits:

  1. Biopython :is a collection of python modules that provide functions to deal with DNA, RNA & protein sequence.

pip install biopython

  1. Squiggle : a software tool that automatically generates interactive web-based two-dimensional graphical representations of raw DNA sequences.

pip install Squiggle

DNA sequence data usually are contained in a file format called “fasta” format. Fasta format is simply a single line prefixed by the greater than symbol that contains annotations and another line that contains the sequence:

“AAGGTGAGTGAAATCTCAACACGAGTATGGTTCTGAGAGTAGCTCTGTAACTCTGAGG”

In this repository, we are building a classification model that is trained on the human DNA sequence and can predict a gene family based on the DNA sequence of the coding sequence. To test the model, we will use the DNA sequence of humans, dogs, and chimpanzees and compare the accuracies.

You can read this article to understand the project step by step from www.theaidream.com or my kaggle notebook for implementation.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].