All Projects → corneliusroemer → desh-data

corneliusroemer / desh-data

Licence: Unlicense license
Sequence lineage information extracted from RKI sequence data repo

Programming Languages

Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to desh-data

SARS-CoV-2-Nowcasting und -R-Schaetzung
Das Nowcasting erstellt eine Schätzung des Verlaufs der Anzahl von bereits erfolgten SARS-CoV-2-Erkrankungsfällen in Deutschland unter Berücksichtigung des Diagnose-, Melde- und Übermittlungsverzugs.
Stars: ✭ 80 (+263.64%)
Mutual labels:  germany, sars-cov-2
SARS-CoV-2-Sequenzdaten aus Deutschland
Das Robert Koch-Institut stellt Systeme zur bundesweiten molekularen Surveillance des SRARS-CoV-2-Virus bereit. Jedes Labor in Deutschland, das SARS-CoV-2 sequenziert, ist laut der Verordnung zur molekulargenetischen Surveillance des Coronavirus SARS-CoV-2 verpflichtet, dem Robert Koch-Institut die Sequenz- und zugehörige Metadaten zu übermittel…
Stars: ✭ 66 (+200%)
Mutual labels:  germany, sars-cov-2
coronavirus-dresden
Collects official SARS-CoV-2 infection statistics published by the city of Dresden.
Stars: ✭ 19 (-13.64%)
Mutual labels:  sars-cov-2
SARS2-Stat-KR
중국 우한 바이러스 한국 확진자 통계
Stars: ✭ 25 (+13.64%)
Mutual labels:  sars-cov-2
awesome-made-by-germans
🇩🇪 The best open source projects that were made and mainly contributed by German developers
Stars: ✭ 170 (+672.73%)
Mutual labels:  germany
dka
Statistische Analyse und Visualisierung der täglichen Diagnoseschlüssel der deutschen COVID-19 Tracing-App (Corona-Warn-App).
Stars: ✭ 93 (+322.73%)
Mutual labels:  germany
deML
Maximum likelihood demultiplexing
Stars: ✭ 41 (+86.36%)
Mutual labels:  sequencing
gesetze-tools
Scripts to maintain German law git repository
Stars: ✭ 99 (+350%)
Mutual labels:  germany
fints-institute-db
Database of German Banks and their HBCI / FinTS endpoints
Stars: ✭ 28 (+27.27%)
Mutual labels:  germany
epidemic-simulator
A HTML/JavaScript simulator for an epidemc on a population
Stars: ✭ 23 (+4.55%)
Mutual labels:  sars-cov-2
machina
Framework for Metastatic And Clonal History INtegrative Analysis
Stars: ✭ 28 (+27.27%)
Mutual labels:  sequencing
poreCov
SARS-CoV-2 workflow for nanopore sequence data
Stars: ✭ 34 (+54.55%)
Mutual labels:  sars-cov-2
assembly improvement
Improve the quality of a denovo assembly by scaffolding and gap filling
Stars: ✭ 46 (+109.09%)
Mutual labels:  sequencing
mlst check
Multilocus sequence typing by blast using the schemes from PubMLST
Stars: ✭ 22 (+0%)
Mutual labels:  sequencing
covid-19-prediction
[IoT'20] Predicting the Growth and Trend of COVID-19 Pandemic using Machine Learning and Cloud Computing
Stars: ✭ 28 (+27.27%)
Mutual labels:  sars-cov-2
Spring
FASTQ compression
Stars: ✭ 71 (+222.73%)
Mutual labels:  sequencing
gargammel
gargammel is an ancient DNA simulator
Stars: ✭ 17 (-22.73%)
Mutual labels:  sequencing
SARS-CoV-2-Contextual-Data-Specification
Collection template and associated materials for SARS-CoV-2 metadata
Stars: ✭ 26 (+18.18%)
Mutual labels:  sars-cov-2
covid-19-signal
Files and methodology pertaining to the sequencing and analysis of SARS-CoV-2, causative agent of COVID-19.
Stars: ✭ 31 (+40.91%)
Mutual labels:  sequencing
scCATCH
Automatic Annotation on Cell Types of Clusters from Single-Cell RNA Sequencing Data
Stars: ✭ 137 (+522.73%)
Mutual labels:  sequencing

Pango lineage information for German SARS-CoV-2 sequences

This repository contains a join of the metadata and pango lineage tables of all German SARS-CoV-2 sequences published by the Robert-Koch-Institut on Github.

The data here is updated every hour, automatically through a Github action, so whenever new data appears in the RKI repo, you will see it here within at most an hour.

The resulting dataset can be downloaded here, beware it's currently around 50MB in size: https://raw.githubusercontent.com/corneliusroemer/desh-data/main/data/meta_lineages.csv

Omicron share plot

Type N means representative surveillance. Type X means unknown, but since this is unlikely to be heavily targeted and includes quite a number of labs I include it now in the main plot (hence type NX).

Omicron Logit Plot

Omicron Logit Plot

Omicron share by zip code area

Description of data

Column description:

  • IMS_ID: Unique identifier of the sequence
  • DATE_DRAW: Date the sample was taken from the patient
  • SEQ_REASON: Reason for sequencing, one of:
    • X: Unknown
    • N: Random sampling
    • Y: Targeted sequencing (exact reason unknown)
    • A[<reason>]: Targeted sequencing because variant PCR indicated VOC
  • PROCESSING_DATE: Date the sample was processed by the RKI and added to Github repo
  • SENDING_LAB_PC: Postcode (PLZ) of lab that did the initial PCR
  • SEQUENCING_LAB_PC: Postcode (PLZ) of lab that did the sequencing
  • lineage: Pango lineage as reported by pangolin
  • scorpio_call: Alternative, rough, variant as determined by scorpio (part of pangolin), this is less precise but a bit more robust than pangolin.

Excerpt

Here are the first 10 lines of the dataset.

IMS_ID,DATE_DRAW,SEQ_REASON,PROCESSING_DATE,SENDING_LAB_PC,SEQUENCING_LAB_PC,lineage,scorpio_call
IMS-10294-CVDP-00001,2021-01-14,X,2021-01-25,40225,40225,B.1.1.297,
IMS-10025-CVDP-00001,2021-01-17,N,2021-01-26,10409,10409,B.1.389,
IMS-10025-CVDP-00002,2021-01-17,N,2021-01-26,10409,10409,B.1.258,
IMS-10025-CVDP-00003,2021-01-17,N,2021-01-26,10409,10409,B.1.177.86,
IMS-10025-CVDP-00004,2021-01-17,N,2021-01-26,10409,10409,B.1.389,
IMS-10025-CVDP-00005,2021-01-18,N,2021-01-26,10409,10409,B.1.160,
IMS-10025-CVDP-00006,2021-01-17,N,2021-01-26,10409,10409,B.1.1.297,
IMS-10025-CVDP-00007,2021-01-18,N,2021-01-26,10409,10409,B.1.177.81,
IMS-10025-CVDP-00008,2021-01-18,N,2021-01-26,10409,10409,B.1.177,
IMS-10025-CVDP-00009,2021-01-18,N,2021-01-26,10409,10409,B.1.1.7,Alpha (B.1.1.7-like)
IMS-10025-CVDP-00010,2021-01-17,N,2021-01-26,10409,10409,B.1.1.7,Alpha (B.1.1.7-like)
IMS-10025-CVDP-00011,2021-01-17,N,2021-01-26,10409,10409,B.1.389,

Suggested import into pandas

You can import the data into pandas as follows:

#%%
import pandas as pd

#%%
df = pd.read_csv(
    'https://raw.githubusercontent.com/corneliusroemer/desh-data/main/data/meta_lineages.csv',
    index_col=0,
    parse_dates=[1,3],
    infer_datetime_format=True,
    cache_dates=True,
    dtype = {'SEQ_REASON': 'category',
             'SENDING_LAB_PC': 'category',
             'SEQUENCING_LAB_PC': 'category',
             'lineage': 'category',
             'scorpio_call': 'category'
             }
)
#%%
df.rename(columns={
    'DATE_DRAW': 'date',
    'PROCESSING_DATE': 'processing_date',
    'SEQ_REASON': 'reason',
    'SENDING_LAB_PC': 'sending_pc',
    'SEQUENCING_LAB_PC': 'sequencing_pc',
    'lineage': 'lineage',
    'scorpio_call': 'scorpio'
    },
    inplace=True
)
df

License

The underlying files that I use as input are licensed by RKI under CC-BY 4.0, see more details here: https://github.com/robert-koch-institut/SARS-CoV-2-Sequenzdaten_aus_Deutschland#lizenz.

The software here is licensed under the "Unlicense". You can do with it whatever you want.

For the data, just cite the original source, no need to cite this repo since it's just a trivial join.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].