All Projects → rwnx → pynonymizer

rwnx / pynonymizer

Licence: MIT license
A universal tool for translating sensitive production database dumps into anonymized copies.

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to pynonymizer

myanon
A mysqldump anonymizer
Stars: ✭ 24 (-58.62%)
Mutual labels:  anonymization, anonymized-database, anonymized-data
kodex
A privacy and security engineering toolkit: Discover, understand, pseudonymize, anonymize, encrypt and securely share sensitive and personal data: Privacy and security as code.
Stars: ✭ 70 (+20.69%)
Mutual labels:  gdpr, anonymization
database-anonymizer
CLI tool an PHP library to anonymize data in various databases
Stars: ✭ 23 (-60.34%)
Mutual labels:  gdpr, anonymization
data-migrator
A declarative data-migration package
Stars: ✭ 15 (-74.14%)
Mutual labels:  gdpr, anonymization
pganonymize
A commandline tool for anonymizing PostgreSQL databases
Stars: ✭ 20 (-65.52%)
Mutual labels:  gdpr, anonymization
pgantomizer
Anonymize data in your PostgreSQL dabatase with ease
Stars: ✭ 95 (+63.79%)
Mutual labels:  gdpr, anonymization
privapi
Detect Sensitive REST API communication using Deep Neural Networks
Stars: ✭ 42 (-27.59%)
Mutual labels:  gdpr
risorse-gdpr
Raccolta di risorse sul GDPR
Stars: ✭ 20 (-65.52%)
Mutual labels:  gdpr
tag-manager
Website analytics, JavaScript error tracking + analytics, tag manager, data ingest endpoint creation (tracking pixels). GDPR + CCPA compliant.
Stars: ✭ 279 (+381.03%)
Mutual labels:  gdpr
lunasec
LunaSec - Dependency Security Scanner that automatically notifies you about vulnerabilities like Log4Shell or node-ipc in your Pull Requests and Builds. Protect yourself in 30 seconds with the LunaTrace GitHub App: https://github.com/marketplace/lunatrace-by-lunasec/
Stars: ✭ 1,261 (+2074.14%)
Mutual labels:  gdpr
php-ip-anonymizer
IP address anonymizer library for PHP
Stars: ✭ 55 (-5.17%)
Mutual labels:  gdpr
avatar-privacy
GDPR-conformant avatar handling for WordPress
Stars: ✭ 15 (-74.14%)
Mutual labels:  gdpr
proca
Widget to transform your website into a cutting-edge campaign in 10 min. multi-lingual, privacy first.
Stars: ✭ 29 (-50%)
Mutual labels:  gdpr
GdprBundle
A symfony3 bundle to assist with defining data in accordance with GDPR, and for encrypting and reporting.
Stars: ✭ 61 (+5.17%)
Mutual labels:  gdpr
havengrc
☁️Haven GRC - easier governance, risk, and compliance 👨‍⚕️👮‍♀️🦸‍♀️🕵️‍♀️👩‍🔬
Stars: ✭ 83 (+43.1%)
Mutual labels:  gdpr
privera
Use the tools you know. Respect users' privacy. Forget cookie consents. Comply with GDPR, ePrivacy, COPPA, CalOPPA, PECR, PIPEDA, CASL; you name it.
Stars: ✭ 23 (-60.34%)
Mutual labels:  gdpr
Hemmelig.app
Keep your sensitive information out of chat logs, emails, and more with encrypted secrets.
Stars: ✭ 183 (+215.52%)
Mutual labels:  gdpr
cookieconsent
🍪 Simple cross-browser cookie-consent plugin written in vanilla js
Stars: ✭ 2,158 (+3620.69%)
Mutual labels:  gdpr
enhanced-privacy-m1
Magento 1 Enhanced Privacy extension for easier compliance with GDPR. Allows customers to delete, anonymize, or export their personal data.
Stars: ✭ 34 (-41.38%)
Mutual labels:  gdpr
concrete
Concrete ecosystem is a set of crates that implements Zama's variant of TFHE. In a nutshell, fully homomorphic encryption (FHE), allows you to perform computations over encrypted data, allowing you to implement Zero Trust services.
Stars: ✭ 575 (+891.38%)
Mutual labels:  gdpr

pynonymizer pynonymizer on PyPI Downloads License

pynonymizer is a universal tool for translating sensitive production database dumps into anonymized copies.

This can help you support GDPR/Data Protection in your organization without compromizing on quality testing data.

Why are anonymized databases important?

The primary source of information on how your database is used is in your production database. In most situations, the production dataset is usually significantly larger than any development copy, and would contain a wider range of data.

From time to time, it is prudent to run a new feature or stage a test against this dataset, rather than one that is artificially created by developers or by testing frameworks. Anonymized databases allow us to use the structures present in production, while stripping them of any personally identifiable data that would consitute a breach of privacy for end-users and subsequently a breach of GDPR.

With Anonymized databases, copies can be processed regularly, and distributed easily, leaving your developers and testers with a rich source of information on the volume and general makeup of the system in production. It can be used to run better staging environments, integration tests, and even simulate database migrations.

below is an excerpt from an anonymized database:

id salutation firstname surname email dob
1 Dr. Bernard Gough [email protected] 2000-07-03
2 Mr. Molly Bennett [email protected] 2014-05-19
3 Mrs. Chelsea Reid [email protected] 1974-09-08
4 Dr. Grace Armstrong [email protected] 1963-12-15
5 Dr. Stanley James [email protected] 1976-09-16
6 Dr. Mark Walsh [email protected] 2004-08-28
7 Mrs. Josephine Chambers [email protected] 1916-04-04
8 Dr. Stephen Thomas [email protected] 1995-04-17
9 Ms. Damian Thompson [email protected] 2016-10-02
10 Miss Geraldine Harris [email protected] 1910-09-28
11 Ms. Gemma Jones [email protected] 1990-06-03
12 Dr. Glenn Carr [email protected] 1998-04-19

How does it work?

pynonymizer replaces personally identifiable data in your database with realistic pseudorandom data, from the Faker library or from other functions. There are a wide variety of data types available which should suit the column in question, for example:

  • unique_email
  • company
  • file_path
  • [...]

Pynonymizer's main data replacement mechanism fake_update is a random selection from a small pool of data (--seed-rows controls the available Faker data). This process is chosen for compatibility and speed of operation, but does not guarantee uniqueness. This may or may not suit your exact use-case. For a full list of data generation strategies, see the docs on strategyfiles

Examples

You can see strategyfile examples for existing database, such as wordpress or adventureworks sample database, in the the examples folder.

Process outline

  1. Restore from dumpfile to temporary database.
  2. Anonymize temporary database with strategy.
  3. Dump resulting data to file.
  4. Drop temporary database.

If this workflow doesnt work for you, see process control to see if it can be adjusted to suit your needs.

Requirements

  • Python >= 3.6

mysql

  • mysql/mysqldump Must be in $PATH
  • Local or remote mysql >= 5.5
  • Supported Inputs:
    • Plain SQL over stdout
    • Plain SQL file .sql
    • GZip-compressed SQL file .gz
  • Supported Outputs:
    • Plain SQL over stdout
    • Plain SQL file .sql
    • GZip-compressed SQL file .gz
    • LZMA-compressed SQL file .xz

mssql

  • Requires extra dependencies: install package pynonymizer[mssql]
  • MSSQL >= 2008
  • For RESTORE_DB/DUMP_DB operations, the database server must be running locally with pynonymizer. This is because MSSQL RESTORE and BACKUP instructions are received by the database, so piping a local backup to a remote server is not possible.
  • The anonymize process can be performed on remote servers, but you are responsible for creating/managing the target database.
  • Supported Inputs:
    • Local backup file
  • Supported Outputs:
    • Local backup file

postgres

  • psql/pg_dump Must be in $PATH
  • Local or remote postgres server
  • Supported Inputs:
    • Plain SQL over stdout
    • Plain SQL file .sql
    • GZip-compressed SQL file .gz
  • Supported Outputs:
    • Plain SQL over stdout
    • Plain SQL file .sql
    • GZip-compressed SQL file .gz
    • LZMA-compressed SQL file .xz

Getting Started

Usage

CLI

  1. Write a strategyfile for your database
  2. Start Anonymizing!
usage: pynonymizer [-h] [--input INPUT] [--strategy STRATEGYFILE]
                   [--output OUTPUT] [--db-type DB_TYPE] [--db-host DB_HOST]
                   [--db-port DB_PORT] [--db-name DB_NAME] [--db-user DB_USER]
                   [--db-password DB_PASSWORD] [--fake-locale FAKE_LOCALE]
                   [--start-at STEP] [--only-step STEP]
                   [--skip-steps STEP [STEP ...]] [--stop-at STEP]
                   [--seed-rows SEED_ROWS] [--mssql-driver MSSQL_DRIVER]
                   [--mssql-backup-compression]
                   [--mysql-cmd-opts MYSQL_CMD_OPTS]
                   [--mysql-dump-opts MYSQL_DUMP_OPTS]
                   [--postgres-cmd-opts POSTGRES_CMD_OPTS]
                   [--postgres-dump-opts POSTGRES_DUMP_OPTS] [-v] [--verbose]
                   [--dry-run] [--ignore-anonymization-errors]

A tool for writing better anonymization strategies for your production
databases.

optional arguments:
  -h, --help            show this help message and exit
  --input INPUT, -i INPUT
                        The source dump filepath to read from. Use `-` for
                        stdin. [$PYNONYMIZER_INPUT]
  --strategy STRATEGYFILE, -s STRATEGYFILE
                        A strategyfile to use during anonymization.
                        [$PYNONYMIZER_STRATEGY]
  --output OUTPUT, -o OUTPUT
                        The destination filepath to write the dumped output
                        to. Use `-` for stdout. [$PYNONYMIZER_OUTPUT]
  --db-type DB_TYPE, -t DB_TYPE
                        Type of database to interact with. More databases will
                        be supported in future versions. default: mysql
                        [$PYNONYMIZER_DB_TYPE]
  --db-host DB_HOST, -d DB_HOST
                        Database hostname or IP address.
                        [$PYNONYMIZER_DB_HOST]
  --db-port DB_PORT, -P DB_PORT
                        Database port. Defaults to provider default.
                        [$PYNONYMIZER_DB_PORT]
  --db-name DB_NAME, -n DB_NAME
                        Name of database to restore and anonymize in. If not
                        provided, a unique name will be generated from the
                        strategy name. This will be dropped at the end of the
                        run. [$PYNONYMIZER_DB_NAME]
  --db-user DB_USER, -u DB_USER
                        Database credentials: username. [$PYNONYMIZER_DB_USER]
  --db-password DB_PASSWORD, -p DB_PASSWORD
                        Database credentials: password.
                        [$PYNONYMIZER_DB_PASSWORD]
  --fake-locale FAKE_LOCALE, -l FAKE_LOCALE
                        Locale setting to initialize fake data generation.
                        Affects Names, addresses, formats, etc.
                        [$PYNONYMIZER_FAKE_LOCALE]
  --start-at STEP       Choose a step to begin the process (inclusive).
                        [$PYNONYMIZER_START_AT]
  --only-step STEP      Choose one step to perform. [$PYNONYMIZER_ONLY_STEP]
  --skip-steps STEP [STEP ...]
                        Choose one or more steps to skip.
                        [$PYNONYMIZER_SKIP_STEPS]
  --stop-at STEP        Choose a step to stop at (inclusive).
                        [$PYNONYMIZER_STOP_AT]
  --seed-rows SEED_ROWS
                        Specify a number of rows to populate the fake data
                        table used during anonymization. Defaults to 150.
                        [$PYNONYMIZER_SEED_ROWS]
  --mssql-driver MSSQL_DRIVER
                        [MSSQL] ODBC driver to use for database connection
                        [$PYNONYMIZER_MSSQL_DRIVER]
  --mssql-backup-compression
                        [MSSQL] Use compression when backing up the database.
                        [$PYNONYMIZER_MSSQL_BACKUP_COMPRESSION]
  --mysql-cmd-opts MYSQL_CMD_OPTS
                        [MYSQL] pass additional arguments to the restore
                        process (advanced use only!).
                        [$PYNONYMIZER_MYSQL_CMD_OPTS]
  --mysql-dump-opts MYSQL_DUMP_OPTS
                        [MYSQL] pass additional arguments to the dump process
                        (advanced use only!). [$PYNONYMIZER_MYSQL_DUMP_OPTS]
  --postgres-cmd-opts POSTGRES_CMD_OPTS
                        [POSTGRES] pass additional arguments to the restore
                        process (advanced use only!).
                        [$PYNONYMIZER_POSTGRES_CMD_OPTS]
  --postgres-dump-opts POSTGRES_DUMP_OPTS
                        [POSTGRES] pass additional arguments to the dump
                        process (advanced use only!).
                        [$PYNONYMIZER_POSTGRES_DUMP_OPTS]
  -v, --version         show program's version number and exit
  --verbose             Increases the verbosity of the logging feature, to
                        help when troubleshooting issues.
                        [$PYNONYMIZER_VERBOSE]
  --dry-run             Instruct pynonymizer to skip all process steps. Useful
                        for testing safely. [$PYNONYMIZER_DRY_RUN]
  --ignore-anonymization-errors
                        Instruct pynonymizer to ignore errors during the
                        anonymization process and continue as normal.
                        [$PYNONYMIZER_IGNORE_ANONYMIZATION_ERRORS]

Package

Pynonymizer can also be invoked programmatically / from other python code. See the module entrypoint pynonymizer or pynonymizer/pynonymize.py

import pynonymizer

pynonymizer.run(input_path="./backup.sql", strategyfile_path="./strategy.yml" [...] )
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].