All Projects → HallLab → pandas-genomics

HallLab / pandas-genomics

Licence: BSD-3-Clause license
Pandas ExtensionDtypes for dealing with genomics data

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to pandas-genomics

GenomicsDB
Highly performant data storage in C++ for importing, querying and transforming variant data with C/C++/Java/Spark bindings. Used in gatk4.
Stars: ✭ 77 (+92.5%)
Mutual labels:  genomics-data
sentieon-dnaseq
Sentieon DNAseq
Stars: ✭ 18 (-55%)
Mutual labels:  genomics-data
pandas_genomics logo


Pandas ExtensionDtypes and ExtensionArray for working with genomics data

Quickstart

Variant objects holds information about a particular variant:

from pandas_genomics.scalars import Variant
variant = Variant('12', 112161652, id='rs12462', ref='A', alt=['C', 'T'])
print(variant)
rs12462[chr=12;pos=112161652;ref=A;alt=C,T]

Each variant should have a unique ID, and a random ID is generated if one is not specified.

Genotype objects are associated with a particular Variant:

gt = variant.make_genotype("A", "C")
print(gt)
A/C

The GenotypeArray stores genotypes with an associated variant and has useful methods and properties:

from pandas_genomics.scalars import Variant
from pandas_genomics.arrays import GenotypeArray
variant = Variant('12', 112161652, id='rs12462', ref='A', alt=['C'])
gt_array = GenotypeArray([variant.make_genotype_from_str(s) for s in ["C/C", "A/C", "A/A"]])
print(gt_array)
<GenotypeArray>
[Genotype(variant=rs12462[chr=12;pos=112161652;ref=A;alt=C], allele1=1, allele2=1),
Genotype(variant=rs12462[chr=12;pos=112161652;ref=A;alt=C], allele1=0, allele2=1),
Genotype(variant=rs12462[chr=12;pos=112161652;ref=A;alt=C], allele1=0, allele2=0)]
Length: 3, dtype: genotype[12; 112161652; rs12462; A; C]
print(gt_array.astype(str))
    ['C/C' 'A/C' 'A/A']
print(gt_array.encode_dominant())
    <IntegerArray>
    [1.0, 1.0, 0.0]
    Length: 3, dtype: float

There are also genomics accessors for Series and DataFrame

import pandas as pd
print(pd.Series(gt_array).genomics.encode_codominant())
    0    Hom
    1    Het
    2    Ref
    Name: rs12462_C, dtype: category
    Categories (3, object): ['Ref' < 'Het' < 'Hom']
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].