alan-turing-institute / ptype

Licence: MIT license
Probabilistic type inference

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to ptype

distinctipy
A lightweight package for generating visually distinct colours.
Stars: ✭ 125 (+400%)
Mutual labels:  hut23
scivision
scivision: a framework for scientific image analysis
Stars: ✭ 60 (+140%)
Mutual labels:  hut23
rds-course
Materials for Turing's Research Data Science course
Stars: ✭ 22 (-12%)
Mutual labels:  hut23
binderhub-deploy
Deploy a BinderHub from scratch on Microsoft Azure
Stars: ✭ 27 (+8%)
Mutual labels:  hut23
monitoring-ecosystem-resilience
Repository for mini-projects in the Data science for Sustainable development project
Stars: ✭ 16 (-36%)
Mutual labels:  hut23
DeezyMatch
A Flexible Deep Learning Approach to Fuzzy String Matching
Stars: ✭ 70 (+180%)
Mutual labels:  hut23
ReadabiliPy
A simple HTML content extractor in Python. Can be run as a wrapper for Mozilla's Readability.js package or in pure-python mode.
Stars: ✭ 55 (+120%)
Mutual labels:  hut23
TuringDataStories
TuringDataStories: An open community creating “Data Stories”: A mix of open data, code, narrative 💬, visuals 📊📈 and knowledge 🧠 to help understand the world around us.
Stars: ✭ 27 (+8%)
Mutual labels:  hut23
AIrsenal
Machine learning Fantasy Premier League team
Stars: ✭ 140 (+460%)
Mutual labels:  hut23
build-publish on release build on develop PyPI version Documentation status Downloads Binder

1   Introduction

ptype is a probabilistic approach to type inference, which is the task of identifying the data type (e.g. Boolean, date, integer or string) of a given column of data.

Existing approaches often fail on type inference for messy datasets where data is missing or anomalous. With ptype, our goal is to develop a robust method that can deal with such data.

https://raw.githubusercontent.com/alan-turing-institute/ptype/release/notes/motivation.png

Normal, missing and anomalous values are denoted by green, yellow and red, respectively in the right hand figure.

ptype uses Probabilistic Finite-State Machines (PFSMs) to model known data types, missing and anomalous data. Given a column of data, we can infer a plausible column type, and also identify any values which (conditional on that type) are deemed missing or anomalous. In contrast to more familiar finite-state machines, such as regular expressions, that either accept or reject a given data value, PFSMs assign probabilities to different values. They therefore offer the advantage of generating weighted predictions when a column of messy data is consistent with more than one type assignment.

If you use this package, please cite the ptype paper, using the following BibTeX entry:

@article{ceritli2020ptype,
  title={ptype: probabilistic type inference},
  author={Ceritli, Taha and Williams, Christopher KI and Geddes, James},
  journal={Data Mining and Knowledge Discovery},
  year={2020},
  volume = {34},
  number = {3},
  pages={870–-904},
  doi = {10.1007/s10618-020-00680-1},
}

2   Install requirements

You can simply install ptype from PyPI:

pip install ptype

3   Usage

See demo notebooks in notebooks folder. View them online via Binder.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].