Edinburgh-Genome-Foundry / crazydoc

Licence: MIT License
Read DNA sequences from colourful Microsoft Word documents

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to crazydoc

poly
A Go package for engineering organisms.
Stars: ✭ 270 (+1400%)
Mutual labels:  synthetic-biology, molecular-biology
reg-gen
Regulatory Genomics Toolbox: Python library and set of tools for the integrative analysis of high throughput regulatory genomics data.
Stars: ✭ 64 (+255.56%)
Mutual labels:  bioinformatics
OpenGene.jl
(No maintenance) OpenGene, core libraries for NGS data analysis and bioinformatics in Julia
Stars: ✭ 60 (+233.33%)
Mutual labels:  bioinformatics
GenomicDataCommons
Provide R access to the NCI Genomic Data Commons portal.
Stars: ✭ 64 (+255.56%)
Mutual labels:  bioinformatics
full spectrum bioinformatics
An open-access bioinformatics text
Stars: ✭ 26 (+44.44%)
Mutual labels:  bioinformatics
ccs
CCS: Generate Highly Accurate Single-Molecule Consensus Reads (HiFi Reads)
Stars: ✭ 79 (+338.89%)
Mutual labels:  bioinformatics
dna-sculpture
3D printed sculpture of a DNA molecule, showing my own genome
Stars: ✭ 22 (+22.22%)
Mutual labels:  bioinformatics
chromap
Fast alignment and preprocessing of chromatin profiles
Stars: ✭ 93 (+416.67%)
Mutual labels:  bioinformatics
geneview
Genomics data visualization in Python by using matplotlib.
Stars: ✭ 38 (+111.11%)
Mutual labels:  bioinformatics
StackedDAE
Stacked Denoising AutoEncoder based on TensorFlow
Stars: ✭ 23 (+27.78%)
Mutual labels:  bioinformatics
bio tools
Useful bioinformatic scripts
Stars: ✭ 35 (+94.44%)
Mutual labels:  bioinformatics
referenceseeker
Rapid determination of appropriate reference genomes.
Stars: ✭ 65 (+261.11%)
Mutual labels:  bioinformatics
flexidot
Highly customizable, ambiguity-aware dotplots for visual sequence analyses
Stars: ✭ 73 (+305.56%)
Mutual labels:  bioinformatics
SSAKE
🍶Genome assembly with short sequence reads
Stars: ✭ 20 (+11.11%)
Mutual labels:  dna-sequences
perbase
Per-base per-nucleotide depth analysis
Stars: ✭ 46 (+155.56%)
Mutual labels:  bioinformatics
adversarial-relation-classification
Unsupervised domain adaptation method for relation extraction
Stars: ✭ 18 (+0%)
Mutual labels:  bioinformatics
SumStatsRehab
GWAS summary statistics files QC tool
Stars: ✭ 19 (+5.56%)
Mutual labels:  bioinformatics
plasmidtron
Assembling the cause of phenotypes and genotypes from NGS data
Stars: ✭ 27 (+50%)
Mutual labels:  bioinformatics
epiviz
EpiViz is a scientific information visualization tool for genetic and epigenetic data, used to aid in the exploration and understanding of correlations between various genome features.
Stars: ✭ 65 (+261.11%)
Mutual labels:  bioinformatics
seqviz
DNA sequence viewer supporting custom, GenBank, FASTA, NCBI accession, and iGEM input.
Stars: ✭ 99 (+450%)
Mutual labels:  synthetic-biology

crazydoc Logo

Travis CI build status https://coveralls.io/repos/github/Edinburgh-Genome-Foundry/crazydoc/badge.svg?branch=master

Crazydoc is a Python library to parse one of the most common DNA representation formats: the joyfully coloured and stylishly annotated MS-Word document.

Crazydoc returns Biopython records of the sequences contained in an MS-Word document, with record features corresponding to the various sequence highlightings (background color, boldness, italics, case change, etc.). The records can saved as GenBanks or easily plotted.

Motivation

While other standards such as FASTA or Genbank are better supported by modern sequence editors, none enjoys the same popularity among molecular biologist as MS-Word's .docx format, which is limited only by the sophistication and creativity of the user.

Relying on a loose syntax and unclear specifications, this format has however suffered from a lack of support in the developers community and is generally incompatible with mainstream software pipelines. This library allows to convert MS-Word DNA sequences to more computing friendly formats: Biopython records, FASTA, or annotated Genbanks.

Usage

To obtain all sequences contained in a docx as annotated Biopython records (such as this one):

from crazydoc import CrazydocParser
parser = CrazydocParser(['highlight_color', 'bold', 'underline'])
biopython_records = parser.parse_doc_file("./example.docx")

You can then plot the obtained records:

from crazydoc import CrazydocSketcher
sketcher = CrazydocSketcher()
for record in biopython_records:
    sketch = sketcher.translate_record(record)
    ax, _ = sketch.plot()
    ax.set_title(record.id)
    ax.figure.savefig('%s.png' % record.id)

To write the sequences down as Genbank records, with annotations:

from crazydoc import records_to_genbank
records_to_genbank(biopython_records)

Note that records_to_genbank() will truncate the record name to 20 characters, to fit in the GenBank format. Additionally, slashes (/) will be replaced with hyphens (-) in the filenames. To read protein sequences, pass is_protein=True:

biopython_records = parse_doc_file(protein_path, is_protein=True)

This will return protein records, which will be saved with a GenPept extension (.gp) by records_to_genbank(biopython_records, is_protein=True), unless specified otherwise with extension=.

Installation

You can install crazydoc through PIP:

sudo pip install crazydoc

Alternatively, you can unzip the sources in a folder and type:

sudo python setup.py install

License = MIT

Crazydoc is an open-source software originally written at the Edinburgh Genome Foundry by Zulko and released on Github under the MIT licence (Copyright 2018 Edinburgh Genome Foundry).

Everyone is welcome to contribute!

More biology software

https://raw.githubusercontent.com/Edinburgh-Genome-Foundry/Edinburgh-Genome-Foundry.github.io/master/static/imgs/logos/egf-codon-horizontal.png

Crazydoc is part of the EGF Codons synthetic biology software suite for DNA design, manufacturing and validation.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].