GWAS_Flow

Citing

GWAS-Flow was written and published in the hope that you might find it useful. If you do and use it for your research please cite the paper published alongside the software, which is currently publicly accessible on the BiorXiv preprint server. https://www.biorxiv.org/content/10.1101/783100v1 doi: 10.1101/783100

Introduction

GWAS_Flow is an open source python based software provding a GPU-accelerated framework for performing genome-wide association studies (GWAS), published under the MIT-License. GWAS is a set of major algorithms in quantitative genetics to find associations between phenotypes and their respective genotypes. With a broad range of applications ranging from plant breeding to medicine. In recent years the data sets used for those studies increased rapidly in size, and accordingly the time necessary to perform these on conventional CPU-powered machines increased exponentially. Here we use TensorFlow a framework that is commonly used for machine learning applications to utilize graphical processing units (GPU) for GWAS.

Requirements

Required Software

python (v.3.7.3)
anaconda
git

Required python packages

tensorflow (v.1.14.0)
numpy (v.1.16.4)
pandas(v.24.2)
scipy (v.1.3.0)
h5py (v.2.9.0)
matplotlib

Docker and Singularity

Docker (v.19.03.1)
Singularity (v.2.5.2)

Installation

git and anaconda

This has been tested on multiple linux systems with anconda versions > 4.7

clone the repository directly with git

git clone https://github.com/Joyvalley/GWAS_Flow

create an anaconda environment and install the necessary packages using the gwas_flow_env.yaml configuration file

###  optional: 
conda create -n gwas_flow
conda activate gwas_flow
### set up environment with pip 
pip install -r requirements.txt

docker

For the installation with docker the only required software is docker itself.

git clone https://github.com/Joyvalley/GWAS_Flow.git 
cd GWAS_Flow
docker build  -t gwas_flow .

Then you can run GWAS_Flow using your user id and files in your current working directory like this:

docker run -u $UID:$GID -v $PWD:/data --rm gwas_flow -x gwas_sample_data/G_sample.csv -y gwas_sample_data/Y_sample.csv -k gwas_sample_data/K_sample.csv -o docker_out.csv

singularity

git clone https://github.com/Joyvalley/GWAS_Flow.git 

docker build  -t gwas_flow .

!! make sure to change /PATH/TO/FOLDER
docker run -v /var/run/docker.sock:/var/run/docker.sock -v /PATH/TO/FOLDER:/output --privileged -t singularityware/docker2singularity:1.11 gwas_flow:latest
change the name of e.g. gwas_flow_latest-2019-08-19-8c98f492dd54.img to gwas_flow_sing.img

Execution with anaconda installation

Input data

GWAS_Flow is designed to work with several different input data formats. For all of them there is are sample data avaialble in the folder gwas_sample_data/ The minimal requirement is to provide a genotype, phenotype and a kinship file. ⚠️ In previous versions of GWAS_Flow (<= 1.1.2) a kinship matrix according to van Raden was caluculated from the provided marker information. There might have been an error in the implementation (see [#27). Therefore, the recommendation to provide a kinship matrix was changed to a requirement.

hdf5 input

python gwas.py -x gwas_sample_data/AT_geno.hdf5 -y gwas_sample_data/phenotype.csv -k gwas_sample_data/kinship_ibs_binary_mac5.h5py

csv input

python gwas.py -x gwas_sample_data/G_sample.csv -y gwas_sample_data/Y_sample.csv -k gwas_sample_data/K_sample.csv

plink input

To use PLINK data format add a bed bim and fam file with the same prefix to the folder. You can tell GWAS-Flow to use those files by using prefix.plink as the option for the genotype file

python gwas.py -x gwas_sample_data/my_plink.plink -y gwas_sample_data/pheno2.csv -k gwas_sample_data/kinship_ibs_binary_mac5.h5py

Flgas and options are

-x , --genotype : file containing marker information in csv or hdf5 format of size
-y , --phenotype : file container phenotype information in csv format
-k , --kinship : file containing kinship matrix of size k X k in csv or hdf5 format
-m : name of column to be used in phenotype file. Default m='phenotype_value' 
-a , --mac_min : integer specifying the minimum minor allele count necessary for a marker to be included. Default a = 1
-bs, --batch-size : integer specifying the number of markers processed at once. Default -bs 500000
-p , --perm : perform n permutations
--out_perm : output individual resulst of the permuation. Default False, enable with arbitary string (e.g. --out_perm yo)
--plot : create manhattanplot 
-o , --out : name of output file. Default -o results.csv  
-h , --help : prints help and command line options

use python gwas.py -h to see the command line options

Execution with docker and singularity

Execute the docker container with the sample data

docker run --rm -u $UID:$GID -v $PWD:/data gwas_flow:latest  -x gwas_sample_data/AT_geno.hdf5 -y gwas_sample_data/phenotype.csv -k gwas_sample_data/kinship_ibs_binary_mac5.h5py

On Windows you can use something like this after activating the file sharing for the drive the repo is stored on:

cd c:\PATH\TO\REPO\GWAS_Flow
docker run -v c:/PATH/TO/REPO/GWAS_Flow:/data gwas_flow:latest -x gwas_sample_data/AT_geno.hdf5 -y gwas_sample_data/phenotype.csv -k gwas_sample_data/kinship_ibs_binary_mac5.h5py

!! The GPU versions of docker and singularity are still under development and might or might not work properly with your setup. To run the GWAS-Flow on GPUs as of now we recommand the usage of anaconda environments

Execute the singularity image with the sample data

singularity run  gwas_flow_sing.img -x gwas_sample_data/AT_geno.hdf5 -y gwas_sample_data/phenotype.csv -k gwas_sample_data/kinship_ibs_binary_mac5.h5py

further options

Co-factor

Previous versions of GWAS_Flow (<= 1.1.2) had experimental support for one Co-Factor. This functionality was dropped in v1.2.0 (see #30).

Permutation

add the flag --perm 100 to calculate a significance threshold based on 100 permutations. Change 100 to any integer larger 2 to perform n permutations

Manhattan plot

By default there is no plot generated if you add --plot True a manhattan plot is generated

The dash-dotted line is the bonferroni threshold of significance and the dashed line the permutation based threshold The latter is only calculated if the flag --perm n was used with n > 2.

Performance Benchmarking and Recommendations

The image displays the average time of 10 runs with 10000 markers each and varying number of phenotypes for GWAS_Flow on GPU and CPUs and a standard R-Script for GWAS. The computational time growths exponentially with increasing number of phenotypes. With lower numbers of phenotypes (< 800), the CPU version is faster than the GPU Version. This gets more and more lopsided the more phenotypes are included. All calculations have been performed on 16 i9 vCPUS and a NVIDIA Tesla P100 graphic card.

Unit tests

The unit tests can be run one the console with:

python -m unittest tests/test.py

All the necassary test data is stored in test_data

Changes

v1.2.0

providing a kinship matrix via -k is now required (#27)
fix degrees of freedom (#29)
drop co-factor support (--cof no longer works, see #30)
standard error is no longer reported in the output files (#28)
create plots in png and pdf format (related to #16)
fix bug with permutation output when a path was given with --out (rather than a filename, related to #16)

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Joyvalley / GWAS_Flow

Programming Languages