All Projects → cutterkom → destatiscleanr

cutterkom / destatiscleanr

Licence: MIT License
Imports and cleans data from official German statistical offices to jump-start the data analysis

Programming Languages

r
7636 projects

Projects that are alternatives of or similar to destatiscleanr

mikropml
User-Friendly R Package for Supervised Machine Learning Pipelines
Stars: ✭ 34 (-27.66%)
Mutual labels:  rstats
cusumcharter
Easier CUSUM control charts. Returns simple CUSUM statistics, CUSUMs with control limit calculations, and function to generate faceted CUSUM Control Charts
Stars: ✭ 17 (-63.83%)
Mutual labels:  rstats
heddlr
Bring a functional programming mindset to R Markdown document generation
Stars: ✭ 14 (-70.21%)
Mutual labels:  rstats
l2kurz
German short introduction to LaTeX
Stars: ✭ 19 (-59.57%)
Mutual labels:  german
cranlogs
Download Logs from the RStudio CRAN Mirror
Stars: ✭ 70 (+48.94%)
Mutual labels:  rstats
TheoLog
Vorlesungsunterlagen "Theoretische Informatik und Logik", Fakultät Informatik, TU Dresden
Stars: ✭ 20 (-57.45%)
Mutual labels:  german
wdlRunR
Elastic, reproducible, and reusable genomic data science tools from R backed by cloud resources
Stars: ✭ 34 (-27.66%)
Mutual labels:  rstats
deutschland
Free open public domain football data for Germany (Deutschland) incl. Deutsche Bundesliga, 2. Bundesliga, 3. Liga, DFB Pokal etc.
Stars: ✭ 64 (+36.17%)
Mutual labels:  opendata
r-docker-tutorial
A docker tutorial for reproducible research
Stars: ✭ 245 (+421.28%)
Mutual labels:  rstats
geoparser
⛔ ARCHIVED ⛔ R package for the Geoparser.io API
Stars: ✭ 38 (-19.15%)
Mutual labels:  rstats
shapr
Explaining the output of machine learning models with more accurately estimated Shapley values
Stars: ✭ 95 (+102.13%)
Mutual labels:  rstats
Legal-Entity-Recognition
A Dataset of German Legal Documents for Named Entity Recognition
Stars: ✭ 98 (+108.51%)
Mutual labels:  german
sacred
📖 Sacred texts in R
Stars: ✭ 19 (-59.57%)
Mutual labels:  rstats
nasapower
API Client for NASA POWER Global Meteorology, Surface Solar Energy and Climatology in R
Stars: ✭ 79 (+68.09%)
Mutual labels:  rstats
agent
Store sensitive data such as API tokens
Stars: ✭ 19 (-59.57%)
Mutual labels:  rstats
statically
📸 Generate Webpage Screenshots Using the Statically API
Stars: ✭ 28 (-40.43%)
Mutual labels:  rstats
geo.data.gouv.fr
Trouvez facilement les données géographiques dont vous avez besoin
Stars: ✭ 68 (+44.68%)
Mutual labels:  opendata
open-gsa-redesign
A fresh start for open.gsa.gov.
Stars: ✭ 27 (-42.55%)
Mutual labels:  opendata
flowmapblue.R
Flowmap.blue widget for R
Stars: ✭ 42 (-10.64%)
Mutual labels:  rstats
music
Music Theory in R
Stars: ✭ 34 (-27.66%)
Mutual labels:  rstats

destatiscleanr


Update May 2020: This package is no longer needed. The Federal Statistical Office of Germany, Destatis, listened to it's users: You can now download data as a flat file csv or use an API.

Danke fürs Anschubsen @cutterkom und für weitere Anregungen aus der Community #ddj und fürs Umsetzen @destatis https://t.co/XIHG5Iml64

— Susanne Hagenkort-Rieger (@hagrie) May 29, 2020

This package as an online tool


Destatis is the Federal Statistical Office of Germany. Of course, it publishes a lot of datasets containing a wide range of data, from area sizes to international econonomic indicators in its database called Genesis.

Unfortunately, the downloadable csv files don't comply with common standards of a tidy, ready-to-use machine-readable dataset:

  • The tables have double, triple, quadruple, quintuple ... headers.
  • Every file includes copyright information on the end of the file.
  • positive numeric valus have a + sign
  • ...

The problems exists throughout the federal system of different statistical offices. Therefore destatiscleanr works on data of regionalstatistik.de and other statistics offices, too.

The consequence of these messy files is time-consuming data cleaning. Everytime you want to use data from Destatis you have to do the same (or at least very similar) tasks. This package helps by doing four things:

  1. it imports the file by taking care of German peculiarities concerning encoding and decimal marks
  2. it deletes the copyright and metadata part
  3. it combines multiline headers to a regular column name
  4. it converts numeric values to as.numeric

Ideally, you can start your analysis right after calling destatiscleanr("destatis_file.csv").

Install

The package can be installed with devtools:

devtools::install_github("cutterkom/destatiscleanr")

Usage

Download a csv file from the official Destatis/Genesis database and provide its path to the destatiscleanr function.

library(destatiscleanr)

df <- destatiscleanr("path/to/destatis_file.csv")

Example

A short example to illustrate the advantage of the package is the table for Verbraucherpreise, German for consumer prices aka inflation.

Without destatiscleanr

With destatiscleanr

The column name na_na derives from the fact that the column names are built from the rows four and five in the original "Verbraucherpreise" table - and these are empty, therefore na_na.

Caution

The goal is to jump start the analysis of Destatis data. This comes with two caveats: the automatic creation of column names and the handling of missing values.

Column names

Be aware that the automatic renaming of columns doesn't work perfectly. The column names are probably not as specific as you wish. The package combines multline headers to a unique column name, including a name and unit. So you can definitly start doing your analysis without any hassle immidiately. It may be that you have to adjust at least some column names.

Missing values

An NA value can have many different meanings, like - means no data available and ... the value will be reported later. This distinctions aren't represented in the cleaned data by destatiscleanr: Every missing value, no matter the reason, is an NA.

Possible reasons for missing values:

More ressources

The package wiesbaden offers a way to get Destatis data directly from the database. Unfortunately, this is a paid service for the main database of Destatis. Destatis offers it API now as a free service (See documentation here). Just like Regionalstatistik.de it can be accessed now as a free registered user.

Wishlist

  • more dynamic creation of column_names 🙄
  • Clever guessing of year/date column
  • Shiny app to offer it non r users
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].