GWAS_Flow
Citing
GWAS-Flow
was written and published in the hope that you might find it useful. If you do and use it for your research please cite the paper published alongside the software, which is currently publicly accessible on the BiorXiv preprint server. https://www.biorxiv.org/content/10.1101/783100v1 doi: 10.1101/783100
Introduction
GWAS_Flow
is an open source python based software provding a GPU-accelerated framework for performing genome-wide association studies (GWAS), published under the MIT-License.
GWAS is a set of major algorithms in quantitative genetics to find associations between phenotypes and their respective genotypes.
With a broad range of applications ranging from plant breeding to medicine.
In recent years the data sets used for those studies increased rapidly in size, and accordingly the time necessary to perform these on conventional CPU-powered machines increased exponentially.
Here we use TensorFlow a framework that is commonly used for machine learning applications to utilize graphical processing units (GPU) for GWAS.
Requirements
Required Software
Required python packages
- tensorflow (v.1.14.0)
- numpy (v.1.16.4)
- pandas(v.24.2)
- scipy (v.1.3.0)
- h5py (v.2.9.0)
- matplotlib
Docker and Singularity
- Docker (v.19.03.1)
- Singularity (v.2.5.2)
Installation
git and anaconda
This has been tested on multiple linux systems with anconda versions > 4.7
clone the repository directly with git
git clone https://github.com/Joyvalley/GWAS_Flow
create an anaconda environment and install the necessary packages using the gwas_flow_env.yaml configuration file
### optional:
conda create -n gwas_flow
conda activate gwas_flow
### set up environment with pip
pip install -r requirements.txt
docker
For the installation with docker the only required software is docker itself.
git clone https://github.com/Joyvalley/GWAS_Flow.git
cd GWAS_Flow
docker build -t gwas_flow .
Then you can run GWAS_Flow using your user id and files in your current working directory like this:
docker run -u $UID:$GID -v $PWD:/data --rm gwas_flow -x gwas_sample_data/G_sample.csv -y gwas_sample_data/Y_sample.csv -k gwas_sample_data/K_sample.csv -o docker_out.csv
singularity
git clone https://github.com/Joyvalley/GWAS_Flow.git
docker build -t gwas_flow .
!! make sure to change /PATH/TO/FOLDER
docker run -v /var/run/docker.sock:/var/run/docker.sock -v /PATH/TO/FOLDER:/output --privileged -t singularityware/docker2singularity:1.11 gwas_flow:latest
change the name of e.g. gwas_flow_latest-2019-08-19-8c98f492dd54.img to gwas_flow_sing.img
Execution with anaconda installation
Input data
GWAS_Flow is designed to work with several different input data formats. For all of them there is are sample data avaialble in the folder gwas_sample_data/
The minimal requirement is to provide a genotype, phenotype and a kinship file.
hdf5 input
python gwas.py -x gwas_sample_data/AT_geno.hdf5 -y gwas_sample_data/phenotype.csv -k gwas_sample_data/kinship_ibs_binary_mac5.h5py
csv input
python gwas.py -x gwas_sample_data/G_sample.csv -y gwas_sample_data/Y_sample.csv -k gwas_sample_data/K_sample.csv
plink input
To use PLINK data format add a bed bim and fam file with the same prefix to the folder. You can tell GWAS-Flow to use those files by using prefix.plink as the option for the genotype file
python gwas.py -x gwas_sample_data/my_plink.plink -y gwas_sample_data/pheno2.csv -k gwas_sample_data/kinship_ibs_binary_mac5.h5py
Flgas and options are
-x , --genotype : file containing marker information in csv or hdf5 format of size
-y , --phenotype : file container phenotype information in csv format
-k , --kinship : file containing kinship matrix of size k X k in csv or hdf5 format
-m : name of column to be used in phenotype file. Default m='phenotype_value'
-a , --mac_min : integer specifying the minimum minor allele count necessary for a marker to be included. Default a = 1
-bs, --batch-size : integer specifying the number of markers processed at once. Default -bs 500000
-p , --perm : perform n permutations
--out_perm : output individual resulst of the permuation. Default False, enable with arbitary string (e.g. --out_perm yo)
--plot : create manhattanplot
-o , --out : name of output file. Default -o results.csv
-h , --help : prints help and command line options
use python gwas.py -h
to see the command line options
Execution with docker and singularity
Execute the docker container with the sample data
docker run --rm -u $UID:$GID -v $PWD:/data gwas_flow:latest -x gwas_sample_data/AT_geno.hdf5 -y gwas_sample_data/phenotype.csv -k gwas_sample_data/kinship_ibs_binary_mac5.h5py
On Windows you can use something like this after activating the file sharing for the drive the repo is stored on:
cd c:\PATH\TO\REPO\GWAS_Flow
docker run -v c:/PATH/TO/REPO/GWAS_Flow:/data gwas_flow:latest -x gwas_sample_data/AT_geno.hdf5 -y gwas_sample_data/phenotype.csv -k gwas_sample_data/kinship_ibs_binary_mac5.h5py
!! The GPU versions of docker and singularity are still under development and might or might not work properly with your setup. To run the GWAS-Flow on GPUs as of now we recommand the usage of anaconda environments
Execute the singularity image with the sample data
singularity run gwas_flow_sing.img -x gwas_sample_data/AT_geno.hdf5 -y gwas_sample_data/phenotype.csv -k gwas_sample_data/kinship_ibs_binary_mac5.h5py
further options
Co-factor
Previous versions of GWAS_Flow (<= 1.1.2) had experimental support for one Co-Factor. This functionality was dropped in v1.2.0 (see #30).
Permutation
add the flag --perm 100
to calculate a significance threshold based on 100 permutations. Change 100 to any integer larger 2 to perform n permutations
Manhattan plot
By default there is no plot generated if you add --plot True
a manhattan plot is generated
The dash-dotted line is the bonferroni threshold of significance and the dashed line the permutation based threshold
The latter is only calculated if the flag --perm n
was used with n > 2.
Performance Benchmarking and Recommendations
The image displays the average time of 10 runs with 10000 markers each and varying number of phenotypes for GWAS_Flow
on GPU and CPUs and a standard R-Script for GWAS.
The computational time growths exponentially with increasing number of phenotypes.
With lower numbers of phenotypes (< 800), the CPU version is faster than the GPU Version.
This gets more and more lopsided the more phenotypes are included.
All calculations have been performed on 16 i9 vCPUS and a NVIDIA Tesla P100 graphic card.
Unit tests
The unit tests can be run one the console with:
python -m unittest tests/test.py
All the necassary test data is stored in test_data
Changes
v1.2.0
- providing a kinship matrix via
-k
is now required (#27) - fix degrees of freedom (#29)
- drop co-factor support (
--cof
no longer works, see #30) - standard error is no longer reported in the output files (#28)
- create plots in png and pdf format (related to #16)
- fix bug with permutation output when a path was given with
--out
(rather than a filename, related to #16)